Playing the Markets: ClusterK Launches Cloud Scheduler, Open Source GATK Pipeline
February 18, 2015 | Life is too short to wait for GATK to finish.
At least that’s how Dmitry Pushkarev sees it. His new company, ClusterK, is releasing its genomics pipeline to illustrate how complex workflows like the Broad Institute’s GATK can be run efficiently—and much faster—on the cloud. The pipeline breaks the GATK pipeline into thousands of different tasks, each taking 10-20 minutes, which can be run in parallel. “It allows the entire workflow to be distributed across dozens of compute nodes,” Pushkarev says, and results are returned much faster.
The pipeline can be run on local clusters or other cloud environments, but it was designed for Amazon Web Services. By taking advantage of AWS spot instances, Pushkarev says he can process a whole human genome—30x coverage—in three hours for less than $5.
Dmitry Pushkarev has the bona fides to make such a claim. Eight years ago Pushkarev came to the United States from Moscow to study physics at Stanford and found himself working on single molecule sequencing under Stephen Quake. He started with bacteria, then transitioned to whole genome sequencing projects using the Helicos platform.
Helicos produced a few terabytes of data, which was too much to process on the lab’s computers. “As I was driving back and forth down highway 101 to pick up a few dozen servers decommissioned from Yahoo to build our first lab cluster, it became clear that computation would play a significant part in genomics moving forward,” Pushkarev remembers.
In 2009, the lab published a genome that was fully sequenced at 30X for under $50,000 in reagents. “It’s hard to believe, but at the time it was considered a breakthrough,” Pushkarev says.
In 2010, Pushkarev began working to assemble the genome of Botryllus schlosseri, a marine species with a fully-functioning stem cell system. The researchers believed the B. schlosseri genome could shed light on evolution, the immune system, and cancer genomics, but they didn’t have the right tools. “We realized there was not a good technology to assemble long reads. That’s when we started to work on our own long read DNA sequencing technology, which resulted in the creation of long read sequencing protocol later known as Moleculo.”
Pushkarev along with Quake and Mickey Kertesz founded Moleculo in 2011. The company’s product, offered as a service, was about half wet lab and half bioinformatics: a molecular biology kit and protocol to create and tag the genomic DNA library. On the bioinformatics side, an algorithm took Illumina short reads and reconstruct long reads from the tags.
But it was a labor intensive undertaking.
“Basically to assemble one library, you had to spend almost 4,000 CPU hours,” Pushkarev remember. “At the time we didn’t have our own cluster, so the only option we could find was AWS, Amazon Web Services. It took almost $400 of compute time just to assemble one library of data… This wasn’t feasible to scale. In this case, we actually paid more for compute than our salaries and reagents combined.”
But Amazon had just launched a spot market the year before, which offered hope, Pushkarev says.
Amazon Web Services offers spot instances as one option with which to buy EC2 cloud computing time. Users bid on discounted and fragmented compute time and the highest bidder wins. But making the best use of spot instances requires the right compute strategy. Tasks should be relatively small and the scheduling system needs to be fault tolerant. Spot instances can be terminated without warning, and so projects must be designed to survive the abrupt stop. Once a task has stopped, there must be a strategy for bidding on and buying more compute space—new instances—to finish the task.
The bidding process is the appeal. “If you do it well, you would on average save up to 90%,” Pushkarev said. “If you can choose different spot markets, you can almost always get the lowest price.”
For Moleculo, figuring out the best way to bid on and buy spot instances was essential.
“We built a very highly efficient scheduler that can use multiple instances, multiple availability zones, and multiple regions, and schedule a huge amount of compute across…a highly reliable system,” Pushkarev said. “If we could lower the compute price by a factor of ten that made it possible to launch the Moleculo product. Otherwise it would just be too expensive.”
The plan worked so well, Moleculo was able to take on another compute-intensive problem.
Human genome phasing was a very important problem, but people thought it was a problem that couldn’t be solved because the compute was too intensive, Pushkarev says. In May 2012, Moleculo started working on a human genome phasing pipeline based on the protocol that was used for long reads but with an entirely different computational approach.
“Now that we have access to this very cheap compute capacity, we can actually go ahead and try it. At this point we used the scheduler to actually build this new generation of algorithms to enable human genome phasing. And that actually turned out to be very successful.”
In December 2012, less than a year after Pushkarev and his colleagues launched Moleculo, Illumina bought the company. Pushkarev believes that the genome phasing capabilities were likely the main reason Illumina was interested in the company.
Moleculo’s team joined Illumina long enough to launch the product. What Moleculo offered as a service, Illumina released as a kit last June. The TruSeq kit included a library kit and two BaseSpace apps for assembly and haplotype phasing.
Moleculo’s advances—first synthetic long reads, then genome phasing—were enabled by technology, Pushkarev says. “In both cases, there’s an absolutely new breakthrough in sequencing technologies that were enabled due to novel molecular biology and library prep kit, but pretty enabled by the algorithms combined with access to compute capacity.”
“As a small startup, we had saved over $1 million in compute costs by being able to use the spot pricing. So we thought, what kind of value can we bring to new companies with this technology?”
Pushkarev launched ClusterK in mid-2013. Funded by VC and angel investors—many of whom funded Moleculo—the company has seven employees in its downtown Palo Alto location.
“The idea is to build a platform on top of the world’s underutilized capacity and use market-based mechanics to provide efficient scheduling on top of it. Currently we use AWS spot [markets] but we’re seeing other cloud providers moving in this direction, and hope to support more platforms in the nearest future,” Pushkarev says. “ClusterK scheduling technology has come a long way since what we developed at Moleculo, with current use cases ranging from HPC and bioinformatics to rendering and speech recognition.”
ClusterK’s first product, Cirrus, is a scheduler that bids on spot instances across multiple zones and regions; a ClusterK algorithm predicts which spot markets have the lowest price volatility.
“There are 600 different spot markets, if you are only using one instance type in one zone, you're stuck on spot price,” Pushkarev explains. Building you own cluster at Amazon usually gets you computer power for about ten cents per core hour. By playing the markets, Pushkarev says Cirrus users see costs of about $0.01 per core hour.
The GATK Use Case
As a proof of principle, Pushkarev is returning to the genomics community with a gift: an open source GATK pipeline.
Pushkarev says he chose GATK because it is one of the most complex, but commonly-used workflows in bioinformatics.
“Our biggest concern right now is that even at Stanford, people wait for days for their genome workflows to finish,” he said. “If people were just slightly more careful about how they design pipelines, and how to make them more parallel, you could save a lot of time and effort as researchers.”
Pushkarev’s GATK pipeline is available at GitHub and shows how GATK can be split into thousands of small tasks and packaged quickly and inexpensively on the compute nodes.
“Traditionally, best practices GATK workflow consists of 7 stages: align to reference genome; sort; run Picard reduplications; indel realignment to refine local alignments; quality recalibration; variant calling; and variant recalibration,” Pushkarev says. “When run in series it can take up to a few days for a whole 40x whole genome especially if HaplotypeCaller is used for variant calling.”
Pushkarev made two changes to increase parallelism.
First, “We split input at alignment stage in 500MB chunks, which usually take 15-20 minutes to align with BWA mem. These chunks are then combined on a chromosome level into one bam file per chromosome,” he explained in a follow up email to our conversation. Then, “We then split chromosomes into regions of an average 30MB and run variant calling step separately on them. In order to make sure that spits do not affect short variant calling we scan throughout the entire chromosome to find ‘safe’ split locations.”
Pushkarev defines “safe” as a spot on the chromosome with no repeats, no sign of indels or variants, and for which almost all of the bases match the reference. As a result, split locations are unique to each sample being analyzed.
Using these two changes, a multi-day GATK workflow is divided into hundreds of 20-30 minute tasks and many of them can run in parallel. “For example,” Pushkarev says, “alignment of input splits, processing different chromosomes, and variant calling on different chromosome regions.”
The pipeline provides a very simple abstraction layer dubbed SWE—simple workflow engine—that lets users submit the entire workflow at once, and define task dependencies and data transfer rules in a simple and elegant format so that entire workflow definition takes under 300 lines of shell code.
ClusterK customers can run the workflow on AWS using the Cirrus scheduler and S3 as a storage backend, which eliminates storage bottlenecks by distributing the data across the entire AWS region allowing users to scale to thousands of whole genome processed in parallel.
For 30x whole genome analysis—from FASTQ files to a final vcf file—the process can be done in about 1,000 core hours. Parallelized and distributed across multiple instances, with clever spot buying, Cirrus can run that job for between $3-5 (based on whether Unified Genotyper or Haploytype Caller is used).
In the open source offering, Pushkarev has provided some scripts to help run the pipeline locally, for example with MIT StarCluster and shared NFS for storage.
Cirrus pricing is based on its efficiency; users are charged a percentage of the savings they realized. For academic groups and small companies with some level of tech savvy, who can do some of the set up themselves, Pushkarev says special pricing can be arranged.