BGI Releases Updated Bioinformatics Software and Datasets

By Bio-IT World Staff

November 14, 2011 | BGI used the opening of its annual international genomics conference, ICG-VI in Shenzhen, China, to announce the launch of new bioinformatics analysis pipelines and software and a new open-access database for large-scale data.

The new software tools enable genome assembly and genetic variation analysis, as well as two cloud-based “green” solutions for genomics research. The tools include the Short Oligonucleotide Analysis Package (SOAP series) and cloud-based software (Hecate 2, Gaea 2, GAMA, GSNP and Adam) for next-gen data analysis.

The new research journal launched by BGI, GigaScience, released GigaDB, a freely accessible, large-scale database. All of the datasets in GigaDB will be assigned a Digital Object Identifier (DOI), which allows the data to be directly cited in future publications.

The newly updated SOAP suite includes SOAP3 -- a GPU-accelerated short read alignment tool; SOAPindel -- an indel finder; SOAPfusion -- a gene fusion detector; SOAPsplice -- a splice-junction detector; SOAPdenovo-Trans -- a de novo transcriptome assembler; and Metacluster 4.0 -- a binning solving tool for metagenomics data.

The SOAP toolkit is freely available at http://soap.genomics.org.cn.

Zhiyu Peng, VP of BGI’s research and cooperation division, introduced the RNA-Seq tools SOAPsplice and SOAPfusion. Peng says tests of SOAPsplice using both simulated and real datasets show a high sensitivity and high specificity, especially under conditions of low sequencing depth. BGI believes SOAPfusion has the highest sensitivity and lowest false discovery rate of any available gene-fusion detection tool.

The emergence of RNA-Seq technology “accelerates the speed in the detection of fusion genes and splice junction sites,” says Peng. “The gene fusion discovery performed by SOAPfusion… will greatly accelerate the study of genomic alterations in cancer."

The SOAPdenovo-Trans assembler handles alternative splicing and expression level analysis for de novo transcriptome assembly using short reads. Yin Long Xie, BGI senior bioinformatician, says SOAPdenovo-Trans “[provides] a more accurate, complete and faster way to construct the transcript sets."

BGI’s new metagenomics software tool, Metacluster 4.0, addresses the binning problem familiar in metagenomic datasets, and is capable of handling 100 species at varying abundance ratios.

Cloud-based Solutions

BGI’s new software offerings include updates to BGI Cloud, which it introduced at the 2010 ICG conference.

Mian Lu (Hong Kong University of Science and Technology) has been collaborating with BGI in the area of “green cloud computing." For example, a data processing pipeline was re-implemented on a specialized GPU platform, reducing processing time from 90 hours to just six.

GSNP (detection of SNPs) and GAMA (allelic frequency estimations) are two discovery tools for genetic variation implemented on a GPU platform. Compared with its predecessor, SOAPsnp, GSNP achieves higher performance through improved representation for base information and massive data parallelism on the GPU. Lu says that a three-day run to process a genome sample can be reduced to about two hours using GSNP.

As for GAMA, the allele frequency analysis of a group of 1,000 individuals could take up to a year, but the new version of GAMA obtains a result in two days.

Adam was developed by exploiting hardware features, which could sort and remove duplicate from massive data. Its performance has been improved by three times, handling 150GB data with a node of 25GB memory," said Dr. Lu.

(Further information about the new software and pipelines can be found at: http://jil.genomics.org.cn.)

BGI also announced updated flexible computing solutions for de novo assembly and resequencing analyses: Hecate 2 and Gaea 2. Hecate 2 has greater scalability than its predecessor. "Hecate 2 adopts more sophisticated models for solving massive scale constraint optimization problems in de novo assembly in a fine-grained manner, which enables data from different sequencing platform to be assembled simultaneously and leads a dramatic improvement of the assembly quality in terms of accuracy, length and coverage," said Evan Xiang, R&D director at BGI’s Flexible Computing Center.

GigaDB launched

The launch of GigaDB, which will host publicly available, large-scale datasets, each with a unique DOI, was accompanied by the release of 17 datasets. These datasets span much of the tree of life, with data hosted from plants, animals (vertebrate and invertebrate) and microbes. The plant data includes whole-genome data from the potato, the Chinese cabbage, the domestic cucumber, and sweet and grain sorghums. The animal data includes whole-genome data from three species of ants, a roundworm, the naked mole rat, the domestic sheep, domestic and wild silkworms, the Tibetan antelope, and three different datasets (whole genome, transcriptome, and methylome) from a single Asian man.

The unique DOI issued to each dataset will allow researchers to directly cite the data itself -- as a separate entity from the data analysis papers, with the goal of promoting rapid data release. Data producers can now be properly acknowledged and recognized for their work rather than waiting for a more extensive research paper to be published.

Earlier this year, Bio-IT World published an interview with the editor-in-chief of the BGI journal GigaScience, Laurie Goodman. GigaDB is available at http://GigaDB.org.