BGI Cloud on the Horizon
With growing resources, BGI is building its own cloud to share them all.
By Kevin Davies
February 2, 2011 | SHENZHEN, CHINA—Tianjian Chen, an architect in BGI’s computing platform group and a key engineer in the development of BGI’s new cloud computing resource, is 26 years old. No wonder his colleagues call him the “old man” of the team.
BGI is a rapidly-growing genomics institute headquartered in Shenzhen (see, “Sequencing, Sequencing, Sequencing,” Bio•IT World, Nov 2009) attracting the cream of China’s young scientific and engineering talent. Chen graduated from Tsinghua University in Beijing, which he assures me is the best university in China. He was initially recruited by Baidu, China’s leading Internet search engine, but a few years ago, Chen admitted, “I lost my passion.” He has clearly found it again at BGI, which he calls “the most interesting employer I’ve worked for … we want to bring all the young scientists in China into the international scientific community.” He believes that BGI Cloud will help BGI share its enviable resources with its partners and customers.
Under the direction of Guoqing Li, associate director and database manager in BGI’s Bio-Cloud Computing department, BGI began laying plans for a Cloud resource in 2009. Li trained as a physicist at Nanjing University, near Shanghai before joining BGI. He generously credits Chen with driving the recent progress. “In 2009, we didn’t have Tianjian. He’s a very important person in our project,” says Li.
“The data are growing so fast, the biologists have no idea how to handle this data,” says Li. “I think the Cloud will be the solution. We have to sequence more and more data. Maybe we have to sequence everybody! Every fish! The data keep growing and we need a lot of compute power to process.”
For Chen, there are three priorities for BGI Cloud:
- Connectivity: With partners across China and the world, “we’ve connected all the people and resources—the sequencers, the samples, the ideas, the compute power, and the storage together to make a greater contribution.”
- Scalability: Calling the explosion in next-gen sequencing (NGS) a “data tsunami,” Chen says BGI aims to provide the parallel computing resources to help users manage and process these datasets. “If you can’t do the analysis, it’s pointless. We use distributed computing technology in the bioinformatics area. We’re confident we can solve the scalability problem.”
- Reproducibility: Chen says bioinformatics researchers are happy to show their data and their pet program—SOAP, BWA, and so on. “That’s fine. But analysis is very complicated. The methodology he is actually using is a homemade pipeline. It’s very difficult to reproduce that result. We built this platform not only to solve the capability and connectivity of computing, we want to resolve the problems in reproducing designs and procedures.”
With new NGS gene assembly and SNP calling programs such as Hecate and Gaea about to be released (see, “In the Name of Gods”), Li says it was essential to develop a “run-time environment, a Web-based platform for Cloud storage and reference data, with a feature-rich GUI, and effective bioinformatics analysis software.”
Cloud Cover
Close to 100 people have been working on BGI Cloud. “BGI is an open facility,” he says. Most projects are shared with partners, who share similar problems in data management. “We have tons of hardware,” says Chen. “There’s a key point: we need methodology and software to keep this hardware effective.”
Unlike service providers such as Amazon EC2, Chen says, “we are weapon designers for life sciences! If you buy Amazon, you just get your shotgun and bullets. If you buy Cloud from us, we’ll give you a full automatic robot! They are infrastructure services. We can build a software layer both in our own hardware and on top of them.” With researchers ever more “anxious about the increase in data in their research projects,” Chen says, “I think it’s a perfect time to drag our weapon from the warehouse to show our design and solution.”
The initial programs on BGI Cloud were confined to Blast, SOAP, SOAPSNP for sequence alignment, and an analysis pipeline for RNA-Seq data. BGI is planning to release new software on the platform, including a gene browser for data visualization, and later in 2011, Pipeline Composer. “We are putting all our R&D resources on this project,” says Chen. “Everyone can compose their pipeline on this platform and redo everything. We will have a log mechanism built into the Composer. We want to deliver the product in a very graphic interface way, so people who don’t understand the coding skills can also use it to compose their own pipeline.”
BGI Cloud is open to anyone, not just partners or customers. The cost will be decided by results, not for the time of analysis. “We understand the scientific procedure. We want users to be satisfied with their result and then pay us.”
Data Transfer
BGI Cloud is currently deployed at two sites—Shenzhen and Hong Kong, the latter to serve visitors from overseas—with several thousand cores available. More important than size, Chen says, is scalability. “We’re now using Hadoop for our infrastructure for test purposes. For the online service environment, we are developing our own file system, and our own scheduler systems to support the MapReduce framework. We can easily add nodes to this platform as the workload goes up.” BGI Cloud has been designed with a raft of security features built in. Chen hopes to pass ISO 20007 certification in 2011.
When it comes to transferring large volumes of data, “FedEx is still the best choice,” says Chen, but for flexibility and privacy, BGI has developed its own data transmission technology. “The hypertransfer technology is based on the UDT protocol to solve the problem of latency.” Chen says most of the effort is being targeted to the data compression algorithm, achieving a data compression ratio of less than 25%.
As for scheduling, this may be “the key component of the Cloud platform,” says Chen. BGI has considered commercial offerings, but Chen says software businesses have inherent limitations, because no partner can provided “on-demand modifications. We’re very concerned about whether our needs can be satisfied in an acceptable time.”
While BGI rolls out these resources for 2nd-generation sequencers, Li promises further developments. “We won’t stop—we have another team working on 3rd-gen sequencing. Then we can move it onto the Cloud platform.” •
In the past two years, BGI has released some superb NGS software including SOAP (Short Oligonucleotide Analysis Package), developed by Ruiqiang Li and colleagues. A new version of SOAP is faster, can support longer read lengths, and takes just 2 minutes to compare 1 million single-end reads to the human reference genome. BGI’s latest NGS alignment programs are named after Greek Gods. Hecate is a MapReduce program designed to run on a cluster, representing what Tianjian Chen calls “the first ring of the genome process tool chain. It solves the scalability of assembly. We can now assemble any genome on commodity machines. If you don’t get bigger machines, you just need more time. If you have enough funding, we can accelerate assembly about 10-20X.” In the near future, users can access Hecate either by logging into BGI Cloud or by downloading the program and putting it on Amazon’s EC2 platform. Hecate is said to be much cheaper than other alignment programs. “We can use hard drive access to replace RAM access,” says BGI’s Guoqing Li. Because it is scalable, it is potentially applicable to any sized genome. And its flexibility means “we can divide and distribute the Cloud computing resources according to different tasks.” In a presentation at BGI’s ICG-V conference in November, Guoqing Li presented preliminary data comparing Hecate to SOAPdenovo. Using 96 cores, Hecate achieves the same level of performance as SOAPdenovo for about one quarter of the cost. Li and colleagues aim to release an upgraded version of Hecate in early 2011, with improved coverage (from 60% to 80%), longer contigs, and doubled or quadrupled speed compared to Hecate 1.0. Gaea is a new SNP calling program. “SNP calling is a pain for our researchers because the data are so big,” says Chen. “Our scientists say, ‘Can we get a program to compute all the data?’ I asked: ‘All the data? At the initial stage, it’s 4 terabytes (TB).’ Everyone was just mute! It would take months to process the data in traditional way. But we’re not scared. Applying the same distributed computing technology, we could process more than 100 TB of data with just 30 commodity machines. So 4 TB is not an issue.” |
This article also appeared in the January-February 2011 issue of Bio-IT World Magazine. Subscriptions are free for qualifying individuals. Apply today.