'Big Data' Journal GigaScience Makes Its Debut

July 12, 2012

By Kevin Davies 

July 12, 2012 | The inaugural issue of the ‘big data’ journal GigaScience and its companion database, GigaDB, is published this week. The journal, which is published by China’s BGI, the world’s largest genome sequencing institute, and BioMed Central, the British open-access publisher, combines standard article publishing with complete data hosting and analysis tools, all of which are open access and freely available.

A good GigaScience paper “needs to be large-scale data, but what classifies as big data in different fields varies,” says journal editor Scott Edmunds. Aside from emphasizing big data, GigaScience is also stressing the importance of reproducibility of results. “We’ve asked our referees questions about data quality and reproducibility over their assessment about interest. We’re trying to do 100% reproducibility in a workflow system.” 

Of the 11 research articles in the first issue, Hong Kong-based Edmunds cites an impressive paper by Stephan Beck’s group at the University College London, UK (http://goo.gl/2nZgD), which examines whole-genome analyses of DNA methylation.  

The article includes a supplemental file amount to 84 Gigabytes (GB) data, including all the supporting data and software tools needed to recreate the experiments freely available for download and reuse from the journal’s companion database, GigaDB, hosted by BGI.  

GigaDB supports open data by giving up all copyright in published datasets by its use of the Creative Commons CC0 public domain dedication waiver, which enables anyone to access and reuse published data without restrictions. GigaScience papers also include Digital Object Identifier (DOIs) for all datasets in GigaDB, aimed at helping make datasets more permanent, as well as fully track-able, linkable and citable. 

While many of the GigaScience papers focus on applications of genomics and next-generation sequencing, areas of neuroscience will also be a major theme. “Neuroscience and imaging will be the next big field -- they’re starting to get used to sharing data,” says Edmunds. 

The debut issue contains no papers authored by scientists from BGI, although Edmunds says there are some currently in the peer review process.  

Laurie Goodman, GigaScience Editor-in-Chief, commented: “The full use of large-scale data has sadly lagged far behind our ability to produce it. The leaders of BGI realized they had the ability, given their vast computational resources, to create an innovative new journal format — one where enormous datasets could be fully hosted and directly linked to their original scientific studies. By including analysis tools in a data platform, as well as the planned addition of cloud technology later this year, GigaScience can serve as a means to put such data into the hands of researchers who do not have the vast computational resources required for optimal data use.” 

[Ed Note: We published an extensive interview with Goodman on the motivation for launching Gigascience in 2011.] 

Other papers of note include a commentary from Guy Cochrane, Ewan Birney and colleagues at EMBL-Bioinformatics Institute, who propose a graded system for storing DNA sequences under differing levels of compression based on ease of reproduction of the data and the availability of DNA samples for resequencing. 

Another essay by Michael Schatz and colleagues at Cold Spring Harbor, who propose a sequencing-based pathogen surveillance system if scientific resistance to data sharing can be surmounted. 

Also in the debut issue is UC Davis professor Jonathan Eisen, who opines on “Badomics words and the power and peril of the ome-meme.”