DNAnexus to Host Short Read Archive (SRA) Database in Google Cloud

October 12, 2011

By Kevin Davies  

October 12, 2011 |  

 MONTREAL -- DNAnexus, a San Francisco-based software company offering Cloud-based tools to manage next-generation genome sequence (NGS) data, will offer free hosting and access to the Short Read Archive (SRA), the funding-challenged trove of NGS read data hosted by the National Center for Biotechnology Information (NCBI).    

Earlier this year, NCBI announced that it would be phasing out funding for the SRA. “When we read about NCBI phasing out support, we realized this would be a huge loss for the community, but a great opportunity for us to step up and preserve it,” said DNAnexus co-founder and CEO, Andreas Sundquist.   

DNAnexus will host a copy of the SRA repository on Google’s Cloud Storage infrastructure. This new community resource was made publically available at sra . dnanexus . com.  

According to NCBI’s Jim Ostell, NCBI remains the primary archive for SRA. Ostell explained in an email to Bio-IT World that DNAnexus approached the institute about providing an alternative hosting environment in light of funding problems.    

“We said, as with other commercial vendors, that we'd be happy to work with them or their partners on transferring the public data in SRA into whatever system they wanted and to explain whatever technical issues may arise. We agreed with them that there was certainly a need for nice packaged tool sets for people working with high-throughput sequence, and if they wanted to demo their platform on the public SRA data that was fine with us. The more the better.”     

Ostell says he welcomes other vendors working on the public SRA data. “If anyone finds it useful, either to explore and analyze the public data, or to work on pre-release data of their own, then that's good too.”     

However, Ostell stressed that neither DNAnexus nor any other commercial outfit is taking over SRA. “They are not an archive, they don't issue accession numbers, and are not part of any official NIH data publishing process… It's been a strictly technical issue of transferring data, working with Google, and getting their platform in place.”     

Google Backing  

The relationship grew in part from investment interest in DNAnexus from Google Ventures, which separately announced a funding deal (see below). Sundquist says the relationship shows Google’s commitment to help “democratize DNA data.”  

Sundquist stresses that access to the SRA data will be free: there is no registration or fee structure. “We just want to make sure people can access this resource,” he says. “We think ultimately, there’s value in helping promote data exchange, preserving these data sets, and growing this space faster.”   

“We’re doing this to help NCBI with their mission,” says Sundquist. "Our hope is that the hosted version of the SRA will provide a complementary way for researchers to access these data. We worked with the NCBI to get a complete set of these data and they were supportive in helping us get our hosted version up and running." He projects that the SRA repository will grow tenfold each year. "The SRA has done a tremendous service to the research community by capturing these data and we want to help preserve it."  

 “The SRA has been an invaluable resource to the research community,” commented Rick Myers, president and director of the HudsonAlpha Institute for Biotechnology in Huntsville, Alabama. “However, the ever increasing size of datasets being submitted and the need to easily integrate them into downstream analyses has tested the limits of its utility. I am very pleased to see private entities such as DNAnexus step in to keep this resource freely accessible and provide a more intuitive and user-friendly portal for searching and retrieving these important genomic datasets.”  

“No-one thinks the Government should be providing access in perpetuity,” says Sundquist. “When the SRA was originally built, it was a different era, a different volume of data.” Given the explosion in NGS data, Sundquist expects to see the archive swell to hundreds, possibly thousands of times its present size in the years ahead. “[SRA] will be a tiny bit of data compared to five years from now. Think what it will be like when we’re sequencing millions or tens of millions of genomes!”  

A copy of all the data in the public SRA will be hosted in Google Cloud Storage. “The DNAnexus SRA website is an example of a ‘big data’ initiative that benefits from rethinking the interface in a 100% web-enabled world,” says Eric Morse, head of business development, Google Cloud Storage. “Combining Google’s massively scalable data storage infrastructure with DNAnexus’ expertise in web-based interfaces, genomics data analysis, and visualization, researchers can quickly access the world’s genomic information from any web browser.”  

Sundquist says DNAnexus has also cleaned up the SRA interface as “it’s been a little cumbersome to use.” Eventually, researchers might be able to submit data direct to DNAnexus to host in the SRA. “There is no sign up required for anyone who wants to use SRA, but if you want to do analysis, we’ll provide unlimited access” to DNAnexus tools for a limited time.  

While central NIH funding for the SRA is ending this month, NCBI will still accept certain classes of SRA data that don't necessarily generate massive amounts of data, but are important for the scientific record (see: http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi)    

Individual NIH institutes have asked NCBI to price out a cost-per-study for their larger sequencing projects. Many institutes have chosen to fund NCBI directly to keep SRA going for their studies, says Ostell.   

Fresh Funds  

DNAnexus also announced $15 million in new, led by Google Ventures – “the best ‘big data’ investor out there,” says Sundquist – and TPG Biotech. Since the company’s first round of just $1.5 million, it has grown to 25 employees and Sundquist hopes to double the headcount in the next 9-12 months.   

The Google tie-in is interesting, as most of the DNAnexus infrastructure is built on Amazon’s EC2 Cloud. “Now we’re working with both Amazon and Google on providing access to large genomic datasets,” notes Sundquist.   

Sundquist also announced a significant cut in pricing for academic customers. “In some ways, the academic community is the key to driving this space forward. Because of the great response, we’ve slashed our prices substantially by half for academia, effective immediately.”  

Sundquist says DNAnexus is “absolutely focused” on genome interpretation, recognizing a huge opportunity for growth. “For us, it’s not just about the ‘$1 million interpretation’ for one genome, you have to think about this interpretation and scale it up to thousands of genomes. That’s a whole different domain, a huge space that no-one has built anything around.”  

[Editor's Note: This story has been updated to include comments from NCBI's Jim Ostell and other minor clarifications.]