Computer Scientists' Solution to a Biologist’s Problem
By Allison Proffitt
March 16, 2012 | HAMILTON, NEW ZEALAND—In 2004, a New Zealand biotech approached a group of computer scientists with a biology problem.
Genesis Research (a New Zealand-listed company that is no longer operational) was sequencing the poplar tree genome, and they had a problem they estimated it would take their in house cluster about three months to solve, recalls Graham Gaylard, Real Time Genomics’ founder. “They wanted to see if Stuart and his crew could develop something better.”
“Our background was in high performance computing and data mining. We weren’t biologists,” explains Stuart Inglis, RTG’s CTO. “The biologists… came to us and explained their problem and we attacked it from a computer science perspective.”
And attack they did. A team of about seven computer scientists solved the problem in a few weeks, saving Genesis months of compute time, and filed a patent on the process that year.
Since then, Real Time Genomics has continued to use a computer science approach to speed up many sequencing analysis problems. The resulting software—RTG Investigator—is “proprietary patented technology with lots of trade secrets,” says Gaylard, but hinges on a core search engine that can be applied to many different problems.
The company’s R&D and software development arm, and its team of computer scientists is based in Hamilton, New Zealand, while the “front office” functions including the technical director, marketing, and some tech support are based in the San Francisco headquarters opened in January 2009. Gaylard and CEO Phillip Whalen routinely make the trip between the two offices. It’s an efficient commute, Gaylard and Inglis joke. You leave New Zealand at 2 pm and arrive in California the day before.
Speed Search
The company has built its product portfolio in much the same way as it started: tackling partner problems one at a time. “Quite a lot of the bio input is coming from our partners,” says Inglis. “[They] bring their side, and say this is the problem we’d like to solve.”
One of the company’s early partners came at the launch of the Human Microbiome project—NIH’s project to characterize the microbial communities found at several different sites on the human body.
The Genome Institute at Washington University did some benchmarking on the time it would take to process the glut of data. “We were aiming to produce 10 gigabases of sequence, 108 reads on 750 samples,” recalls George Weinstock, associate director of The Genome Institute. “We could do that in a matter of weeks or month or so.” But the analysis was a different challenge; the team wanted to run a Blast search against all of Genbank. “Because bacteria are highly diverse, you really have to do a translated Blast, not just straight nucleotide comparison, so BlastX.”
Weinstock estimated the analysis would take 50 years with the current tools.
Real Time Genomics proposed a solution at just the right time. “They were talking to [sequencing] centers and we said, ‘Well, we have this big problem for metagenomics,’” says Weinstock. “Really in the space of 6 to 9 months they’d solved the problem and put together their suite of metagenomics tools.”
Weinstock says that The Genome Institute stayed very involved along the way. “It’s been great working with them because we’ve had very collaborative relationship. We put in a fair amount of effort in validating and looking for bugs and giving them feedback,” he says. Not only did the resulting solution need to be fast—Gaylard says the RTG pipeline is 1000x faster than BlastX—but it needed to be accurate as well, and Weinstock and his group participated in much of that validation. “RTG put a lot of effort into [the project] and have been very, very responsive to input we gave,” Weinstock says.
The metagenomics tools RTG created with The Genome Institute became the foundation for the company’s metagenomics pipeline. “As soon as we see a problem, we render it down to a general solution, and then factor that into the product line. That’s been our approach. The more problems we’re exposed to, the better the general solution is,” says Gaylard.
Got Milk?
Though the company cut its teeth on metagenomics problems served up by Wash U, one of the first commercial customers tested RTG’s variant calling pipeline with a milk problem. Livestock Improvement Corporation, LIC, is a dairy famers’ cooperative supporting the 11,000 farmers that account for 20% of the country’s export earnings. Just a 15-minute drive from RTG’s Hamilton offices, LIC offers herd testing, software products to help farmers keep track of the data, and genetics research related to everything from milk production, fat and protein content, to length of gestation.
LIC has been involved in genetics, mostly microarray and SNP chips, for quite a long time, says Richard Spelman, general manager of R&D at LIC. With help from a government grant, LIC began whole genome sequencing last year. LIC is a happy Illumina sequencing customer, but started their interpretation efforts in house with BWA, GATK, and SAMtools.
When RTG entered the scene, Spelman set up a simple experiment: he gave one LIC bioinformatician BWA and GATK to do alignment and variant calling, and gave another the RTG variant calling pipeline to do both.
“One guy was doing [the project] the BWA/GATK kind of way, and the other guy was doing the RTG [pipeline],” he says. “Within a week we had RTG up and going, processed genomes, etc. Probably took us another 3-4 weeks to go through all the switches, all the options, etc under BWA/GATK.”
It wasn’t just ease of use; Spelman saw immediate speed advantage too. The analysis was 8x to 10x quicker using RTG than previous options, he says. “We ended up looking at the amount of compute time we’d need if we went through [the open source] pipeline, compared to the RTG pipeline. I’d rather spend money on the biology, on sequencing, rather than putting huge computer infrastructure in place. Using RTG enabled me to sequence more genomes rather than throw [the money] into computer infrastructure.”
The first sequencing group consisted of 25 bulls sequenced at 30x coverage, but Spelman plans to sequence 500 more animals by the end of 2012 and believes he’ll need a couple of thousand sequences to do the population studies he has planned.
RTG’s trio caller or family-based calling is already drastically increasing the accuracy of the variant calling. “We’ve found that the accuracy of having the two parents and progeny has made a huge difference in the accuracy of our calls… It still astonishes me,” Spelman says. SNP calling on the individual reveals about 10% non-Mendelian calls, but with family-based calling that falls to .1-.3%. “SAMtools is the only other software that we know of and that we’ve tried [for this type of analysis], and we weren’t as happy with it.”
With these tools, Spelman envisions being able to look at more complex families and sequence more animals, projects he wouldn’t have been able to do with his previous toolkit.
Pipeline Directions
With both the metagenomics and variant calling pipelines established, RTG is expanding its product suite. Gaylard mentions cancer callers, germline cell callers, and software for de novo assembly.
The two current pipelines run from a command line for bioinformaticians and biologists to get as much data as possible from the software, and are licensed to users, but the company is considering a service model long term, Gaylard says. He also says that eventually the company wants to “move further on down the chain” and offer a service model as well as tools for data management, curation, storage, and visualization.
RTG’s competition, as Gaylard sees it, is mainly open source options. Complete Genomics does something similar, he concedes, but only for human genomes. RTG can handle Complete Genomics data and many species other than human. Gaylard’s team started with the poplar tree genome and “worked our way down.” The human genome is smaller and less complicated than many of the other genomes in nature, he says.
The software also has no size limitations. The SNP caller can handle tens of thousands of SNPs and the metagenomics module can work against everything in Genbank. A strength of the software is the ability to combine datasets to combine results, Gaylard says.
With the publicity gained from last year’s release of a free version of RTG Investigator (see, “November-December FreeTrials”), Gaylard hopes to accelerate RTG’s growth targeting large customers like Genentech and Merck.
“Some of these big customers are doing only hundreds of genomes a year. We see the market as thousands—hundreds-a-day type scope,” he says. “That’s where our software will really come into the fray: the ability to handle it at high throughput with accuracy.”