BGI Announces Cloud Genome Assembly Service
By Allison Proffitt
July 6, 2011 | SHENZHEN, CHINA—At the BGI Bioinformatics Software Release Conference today, researchers announced two new Cloud-based software-as-a-service offerings for next-gen data analysis. Hecate and Gaea (named for Greek gods) are “flexible computing” solutions for de novo assembly and genome resequencing.
These are “cloud-based services for genetic researchers” so that researchers don’t need to “purchase your own cloud clusters,” said Evan Xiang, part of the flexible computing group at BGI Shenzhen. Hecate will do de novo assembly, and Gaea will run the SOAP2, BWA, Samtools, DIndel, and BGI’s realSFS algorithms. Xiang expects an updated version of Gaea to be released later this year with more algorithms available.
Flexible computing, explained Xiang, is a more efficient cluster architecture than traditional Cloud. Jobs of different types are grouped on the cluster to make the most of computing power and address scalability issues. For instance, CPU intensive jobs are grouped; memory-intensive jobs are grouped; and input/output intensive jobs are grouped.
Both the Hecate and Gaea services will run on the BGI compute cluster because “Amazon is slow,” Xiang said. Running the services on an in-house cluster also alleviates any internet access issues.
Hecate is based on a series of distributed algorithms to recognize and simplify non-branching repeat-free regions of the genome, correct errors and resolve the ambiguous bubbles and short repeats, together with the distributed graph shrinkage algorithms to construct a linear DNA sequence. Based on BGI’s SOAPdenovo and SOAP2 algorithms, Hecate is more scalable than those algorithms alone.
Xiang presented results from speed comparisons showing significant cost and time savings using Hecate for de novo assembly. Running SOAPdenovo on a single server for 70 hours resulted in 80% genome coverage at a hardware price of $150,000. Using 96 Hecate cores, the genome coverage increased to 84% in 42 hours at a price of $60,000.
Gaea is designed to distribute resequencing computation to a cluster of nodes based on the Hadoop Streaming framework with personalized algorithm interfaces for SOAP and BWA. For the current version of Gaea (v 1.2), Xiang reported speed increases of 75x for SOAP2 and 90x for BWA using 100 cores. At 400 cores those numbers rose to 300x and 346x speed increases compared to running either algorithm on a single core. Xiang expects Gaea v 2.0 to see further improvements.
Gaea is also optimized for a biomarker analysis toolkit that includes SOAPsnp, DIndel and realSFS for SNP calling, indel calling and gap alignment.
More details about both products as well as a host of updated bioinformatics tools released at the event are available at soap.genomics.org.cn.