September 27, 2011 | ‘Scaling to bigger and better hardware doesn’t help if your data is [sic] growing in size faster than your hardware,” says Titus Brown at Michigan State University. He and others in the NGS community are calling for software solutions to their NGS data woes instead of massive storage options. In an August post on his blog, “Daily Life in an Ivory Basement,” Brown wrote: “The bottom line is this: when your data cost is decreasing faster than your hardware cost, the long-term solution cannot be to buy, rent, borrow, beg, or steal more hardware. The solution must lie in software and algorithms.”
Thankfully, the options for both are expanding. Familiar names such as CLC bio, Geospiza, DNAnexus, GenomeQuest (see, p. 24), Omicia (see, p. 48) and others (see, “Next-Gen Sequencing Software: The Present and the Future,” Bio•IT World, Sept 2010) are being joined by a new batch of friendly competitors. For the most part, these offerings—from aligners to niche analytics—support the Illumina, 454, and SOLiD platforms, with some including Ion Torrent, Pacific Biosciences, and Complete Genomics data as well.
The software landscape for NGS analysis is broad and varied partly because “analysis” isn’t a cut and dried term, says Knome’s Nathaniel Pearson, director of research. “We’ve managed, as a community, to make people understand that analysis is as important as sequencing in the end… But now we have to tease out upstream and downstream analysis.”
Pearson defines “upstream analysis” as that closest to the sequencing machines, where the first work was done: base calling, variant calling, variant assessment, etc. “Now we’re seeing a focus moving toward downstream analysis, toward understanding many genomes at a time. As the stream of sequencing data from one machine comes together in a river with the streams coming from other machines, we need to make sense of that tide of data.”
Swimming Upstream
Knome’s area of interest can be summarized as “service with software,” says Pearson. kGAP—Knome’s Genome Analysis Platform—is the analysis software Knome uses to “richly annotate genomes and compare them to each other thoroughly,” says Pearson.
Knome’s sequencing and genome analysis service was launched in 2007. “Knome cut its teeth analyzing whole genomes for consumers. Given how costly whole genome sequencing remains, most of those consumers are still either healthy and wealthy aficionados of science and technology, or physician-aided families with urgent health problems—fairly small markets,” says Pearson.
“We do foresee that the consumer market will eventually democratize, as sequencing gets cheaper and insights for small numbers of relatively healthy genomes—especially in family settings—become more precise and useful,” he says.
Until then, Knome plans to keep refining its analysis pipeline and end-user software Today more than 95% of the firm’s customer base is researchers, about half from academia and half from industry, users that Pearson says can best understand diseases of widespread public interest.
When these customers receive Knome’s analysis they also receive software tools like KDK Site Finder, a simple query interface that lets clients find interesting sites in one or a set of genomes by “sensibly chosen criteria: allele frequency, call quality, novelty, zygosity—the usual suspects—as well as a rich archive of gene- and site-associated phenotype data from the literature.”
The current version of kGAP runs in the Cloud, which has greatly increased its throughput. But Pearson doesn’t expect analysis costs to fall at the rate of sequencing costs. “They’re going to drop slower than sequencing costs overall because we’re more tied to computational costs—which is more of a Moore’s Law scale,” he says. “Some software will fall quickly; it’ll get commoditized. But the very best software will always be costing a bit more because it will entail evermore complex underlying calculations to make the bottom line look much simpler to use.”
He believes future analysis options will do for sequencing what Photoshop did for photography. “I think we’ll see software for the end user for understanding genomes [in which] a lot of the underlying calculations will be done very swiftly and very cleverly under the hood. And the user’s experience will be very easy and very fast, but that’s going to cost a bit.”
The team at Real Time Genomics might disagree. The company’s “single and only intent is to provide the world’s best genomic analysis software,” says CEO Phillip Whalen. And they’re giving it away for free.
The venture-funded company based in San Francisco unveiled its website only a few months ago, but the technical team has been working on this problem for seven or eight years.
“The decision we made when we basically took the wrappers off,” says Whalen, “was that for organizations we wanted to charge a license fee, but if [researchers are] working on a project and they decide, ‘I’d like a really tight, easy to use pipeline,’ absolutely the use of our software by an individual investigator is unrestricted.”
RTG Investigator is made up of two such pipelines: one geared for variant detection and one for metagenomics. The software runs from a command line interface and is geared toward research teams that include both bioinformaticians and biological investigators. “Our customers wring the last bit of information out of their datasets, and the tension of discovery demands a collaborative effort,” says Stewart Noyce, RTG’s director of product marketing.
“Right at the core is this extremely fast and sensitive searching technology,” says Graham Gaylard, RTG’s founder. “When I say sensitive, we actually can search with mismatches in the search pattern right at the very start. “The variant detection pipeline does all of the alignment—it’s a fully gapped aligner—so it does full read matching assembly and also processing right through to variant calls such as SNPs, complex calls, indels, CNVs [copy number variations] and structural variations,” says Gaylard. “It handles paired ends natively, not as an add-on. That gives us far superior efficiency. We’re as accurate as all of them, but we’re faster than the BWA/GATK pipeline by 10x.”
And the numbers are even better for metagenomics. “One of the functions our search technology replaces is BLASTX, a translated nucleotide search of protein databases. We’re 1,000x faster than that.” The Genome Institute at Washington University acquired some of the early licenses for the product a couple of years ago and RTG has worked closely with them on the Human Metabolome Project. Gaylard says RTG has turned a 10-year compute task on their cluster into a three-month problem. “That has a big impact on how you do things,” he says.
The software is designed to make maximum use of the computing resources allocated to it, and will run on a laptop or cluster or can be pushed to the Cloud. Everything is proprietary—new algorithms, a new approach, and patent protected (or pending). “We have not gone out and taken something open source and tweaked it,” says Whalen. “We have attacked the problems from a computer science point of view with new ways of doing things. We’ve done that from scratch and come up with some results that our customers say are pretty compelling.”
Betting on Biologists
Though some users are happy at a command line, Enlis Genomics and others are betting that many biologists would like to dig into their data without also learning bioinformatics. Enlis’ “point-and-click genomics” software was designed by biologists, says founder Devon Jensen.
The software caught Illumina’s eye in July, winning the commercial category in the iDEA Challenge (see, “Illumina Showcases New Visions in Genomic Interpretation,” Bio•IT World, July 2011; part of the prize was a one year co-promotion marketing agreement with Illumina. Jensen says the details of that agreement are still being finalized).
This isn’t variant calling though. The software addresses the biologist’s question: After you have sequenced, assembled, and called variants, what do you do next? Tools like the Variation Filter and the Genome Difference tool let users query the genome and compare up to 100 genomes. “The focus of our software is making it easy to find what is biologically relevant in the sequence data of a patient, individual or research animal,” says Jensen.
The Enlis software comes with an import/annotation tool that creates a .genome file format encapsulating all the different types of genomic data into a single file to improve the process of handling and storing the data for the researcher. The focus is on speed and ease. “The software contains very fast algorithms for filtering variations and finding differences between whole genomes,” says Jensen. “We have organized all of the information in a way that allows a researcher to quickly assess whether a particular feature of a genome is important.”
SoftGenetics’ NextGENe product is also aimed at the individual biologist or clinician, says John Fosnacht, the company’s co-founder and vice president. “It’s a Windows-based program that’s easy to use. It has a lot of tools in it that [users] can use on multiple applications. It’s doesn’t require any kind of bioinformatics support.”
Fosnacht says the company has several groups of customers, including core labs that don’t have huge bioinformatics resources. The Mayo Clinic, for example, is using a networked version of the software. The software will process a whole human genome in ten hours, Fosnacht says.
In a partnership with the rare disease group at NIH, SoftGenetics developed a variant comparison tool as a module in NextGENe to identify which of thousands of variants are most likely to be causative mutations in rare genetic disorders. The software takes the total number of variants (more than 275,000 variants in a family of 6 in one example) and filters out silent variants, known mutations in dbSNP, and other parameters. The NIH researchers were left with a very manageable six candidate mutations.
“The filtering and prediction part takes less than half a day. That allows the molecular geneticist and researcher, instead of trying to do the impossible and look at 280,000 variants, to focus on relatively few,” says Fosnacht.
The software uses a modified Burrows-Wheeler transform method, and excels in indel discovery and somatic mutations. NextGENe was able to find a 55-basepair deletion in a 50-bp read. “This is a patented functionality in the software, that can elongate short reads,” says Fosnacht. “In reality it is a localized assembly. Once the reads are elongated the software can detect an indel up to 33% of the elongated length. The same process can be used to actually merge paired reads into one long read. When employed this process can produce Sanger quality reads from short reads.”
These types of projects make the most of what Fosnacht calls tertiary analysis tools. “We want to provide the third level of tools to the actual users to speed up the whole process. Unlike many “freeware” or other programs that just give you a list—an Excel spreadsheet basically—of all the variants that were found, you can actually see them in our browser… A lot of people like to touch and feel, you might say, their data.”
DNASTAR agrees. “There just aren’t enough bioinformaticians out there to handle the data deluge,” says Tom Schwei, DNASTAR’s VP and general manager. “And they don’t want to wait in line for a week or two weeks for that bioinformatics core group. We believe that the end user, the person who is sponsoring the experiment, knows best their research objectives and their data and is in the best position to do the analysis… You shouldn’t have to be a bioinformatician to parse through the data and understand what you see.”
The company views the NGS market as simply an extension of what their customers are already doing. As such, DNASTAR recently moved their next-gen data products under the Lasergene umbrella, a 15-year old brand name that also includes primer design software and cloning resources. SeqMan
NGen is the GUI-based assembler, SeqMan Pro the data analysis module. They are designed to work together, although they can be purchased separately.
Schwei says that the new Lasergene offerings are designed to be intuitive, fast, and easy to use. Users can easily compare their variants to dbSNP and the reference genome.
“SeqMan Pro’s strength is really the analysis of any number of samples. It can handle individual assemblies quite well, and it can handle multiple assemblies.” The software can manage 100 samples of a certain region, says Schwei. “We will do separation of the tags if people are running multiple samples in one lane on the assembly side. We’ll then report on those samples on the analysis side.”
The software is also affordable. “For less than $10,000, scientists can get all the software they need—and the computer to run it on—to do any next-gen assembly and analysis project they need to do,” he says, thanks to proprietary assembly algorithms. “Basically, it no longer relies on the amount of memory you have on your computer,” Schwei says. “There’s no correlation between the amount of RAM and the size of the genome you have to assemble.”
Avadis NGS by Strand Scientific Intelligence enables “NGS analysis for the rest of us,” says Thon de Boer, director of product management, software. With a strong focus on visualization, Avadis NGS has three major workflows: DNA-seq, CHIP-seq, and RNA-seq. De Boer says Strand is focusing on “the individual researcher with their individual [sequencer] and their individual piece of software.” The desktop software manages analysis after alignment, the “backend” analysis, de Boer calls it, and he says that Strand has been able to “sell to places that already have the Genomics Workbench from CLC bio, for instance, because people really like our visualization.”
“We have special never-seen-before visualizations around splicing—very informative alternative splicing analysis visualization. And the same goes for SNP analysis, what we call the variant supporter view, which is just a better way to look at all of the supporting reads for a particular SNP without being overwhelmed with the amount of data you have to look through.”
Strand has also had success partnering in the field. Ion Torrent is a reseller of Avadis NGS, and Strand recently announced a partnership with German-based BIOBASE to give all Avadis NGS users one-click access to BIOBASE’s Genome Trax curated biological database. “We bundle a lot of our software with publicly available data,” says de Boer, and the partnership with BIOBASE will expand the available data pool and “make it easy for customers to get all the information that they need right from our servers.”
Partek’s Genomics Suite is a complete start-to-end solution,” says Hsiufen Chua, Partek’s regional manager in Singapore. “Just one package off the shelf and you can use all the genomics data analysis you need in the lab.”
Partek’s product integrates sequencing data with microarray data or real time PCR because, as Chua points out, most labs have several types of data. “[Customers] would like to bring together two sets of data because they would have samples that have been run on different platforms.” Genomics Suite allows users to compare the results in the same platform.
“From the point that [researchers] obtain the reads from the next-gen sequencer, we take care of them. We have solutions to help them align the reads down to the point where they can do quality control to see if the data they have is good enough to proceed for further analysis. If so, then we have the tools for them to do the statistical analysis—all the statistics. Following that, we also have the same tools to do the biological interpretation.”
Service Segment
But if a do-it-all platform is not what a researcher wants, the analysis-as-a-service segment of the market is expanding. While BGI (see, p 31) and Complete Genomics will do sequencing and analysis, Samsung just launched beta testing of an analysis-only service.
Samsung’s Genome Analysis Service will provide analysis for whole-genome sequencing and RNASeq for Life Technologies and Illumina data, says SungKwon Kim, director of the bioinformatics lab at Samsung SDS, with support for the Ion Torrent sequencer ready by the end of 2011.
The algorithms that Samsung SDS is using are a combination of open-source and vendor-provided software with Samsung’s own proprietary “tweaks,” says Kim. Samsung has built its own genome browser, but all of the data are available for download if the customer prefers another option.
Samsung is offering analysis on its own Cloud infrastructure in Korea, which Kim expects it to be extremely efficient, safe, and fast. “I think our analysis job is much faster than other competitors,” he says. “Our whole genome analysis will take five days; our RNA analysis will take 3 days.”
He also cites Samsung’s reputation for enterprise-level IT. “We’ve been working with system innovation with banks, high-profile Fortune 500 companies, so when it comes to data security—I don’t think any other vendor companies should be able to match our capabilities in security and recovery handling.”
Kim says Samsung has been eyeing the NGS space for three years. “This [industry] is mainly driven by academics and research institutions who have some of the IT infrastructure and who have their own sequencers… but when the read price drops below $1,000, then I don’t think any research institute or academia will be able to handle [both] their own sequencing jobs and their own analysis jobs.”
With so many options, they shouldn’t have to. •
This article also appeared in the 2011 September-October issue of Bio-IT World magazine.