A Broad Perspective on Genome Data at Bio-IT World Expo

April 30, 2012

By Charli Kerns  

April 30, 2012 | BOSTON—Jill Mesirov and Martin Leach—the Broad Institute’s chief informatics and chief information officers, respectively—stressed themes of integrative biology and big data in a twin-bill opening keynote session at the 10th anniversary Bio-IT World Conference & Expo last week.   

Ten years after Broad Institute director Eric Lander delivered the opening keynote at the inaugural Bio-IT World Expo, Mesirov and Leach reflected on the past decade of explosive growth in genomics data, noting that as the technology to extract, measure, and analyze that data becomes increasingly complex, scientists’ ability to interpret the data in a meaningful way becomes more important.   

To achieve this, Mesirov and Leach spoke of opposite-sided solutions to the same research coin. Mesirov offered ways of making the technology more accessible to non-computational scientists, while Leach discussed the need for training more data scientists.  

 Jill_Mesirov 

“The amount of data that we can acquire now compared to ten years ago is tremendous,” said Mesirov, director of computational biology and bioinformatics at the Broad Institute. Genomic data continue to offer huge promise for basic and biomedical research by better understanding diseases at the molecular level and uncovering the genetic basis for and the mechanisms of disease.

Mesirov used the Broad Institute’s data acquisition as an example of how much and how fast the technology has grown. In 1999, the Broad’s rate of sequence data acquisition was less than a gigabase a year—it’s now more than 150 terabases a year. All of this work has been enabled by the development of new high-throughput data acquisition technologies, as well as the fact that the cost of genome sequencing has plummeted by six orders of magnitude.

“Recently, what we’ve seen is this wonderful democratization of sequencing,” said Mesirov. This explosion in sequence data has resulted in shifts in scientific thinking. Scientists use much more complex computational algorithms to analyze the data, resulting in much more interdisciplinary and cross-lab work as well as more international collaboration. Scientists are also able to get a more global view of the data, shifting from analysis of single genes to functional evaluation of networks, processes and pathways.  

 

Double Down  

 

 Martin_Leach 
 

Leach noted how the density of data storage has also increased dramatically over the past two decades.   

 

“16 Gigabytes (GB) was the highest amount of data stored on a disc in 1993. That now fits into a pinky-sized SD card in my phone,” Leach said. Leach gave a hypothetical example of what that advancement could mean for genome sequencing billions of genomes—what some have called the Humanity Genome Project. Back-of-the-envelope calculations suggest that sequencing the world population would approximate 4 x 1018 bytes of information. “Of course, I’m scratching my head asking, ‘How many hard drives would that take?’ ” said Leach.   

In a sign of the progress in data storage density, Leach said that a hypothetical attempt to store 6 billion genome sequences back in 1980 would require 450 billion drives—roughly the weight of every car in the world. By 2015, however, using 16-terabyte drives, scientists could store that same amount of data onto half a million drives, which would fit onto a soccer pitch.  

“Why would we even want all this raw data?” asked Leach. Focusing on just exomes would reduce the number of drives to about 5,000—77,000 square feet now becomes 800 square feet. “That’s now not so big,” said Leach. “The question still is what would we do with all that data?”  

With advanced technologies come increasingly complex issues with data. Scientists need to integrate large datasets and multiple data types, which can cause issues with data management, for example finding the data that is really relevant. The workflows and algorithms are becoming much more complicated, and scientists are making greater demands on computer power. They also have to get all of these methods and software tools to integrate and work together.

Compounding this problem is the proliferation of data tools. “There’s a lot of data out there, and it’s hard to get answers out of that data,” said Leach. According to Mesirov, there are some 7-10,000 bioinformatics tools available for download online, as well as 5,000 data repositories. The Broad Institute alone offers 60 downloadable bioinformatics tools. The challenge here is having the ability to get all these tools to work together, which may be out of the reach for most biologists who may not be able to code, especially to the level of complexity these tools require to work.  
 

Mesirov and Leach proposed complementary solutions to these challenges. Mesirov suggested creating a collaborative approach tool that enables experimental scientists—not necessarily the experts in computational methods—to access and interpret the data.   

Mesirov announced a new resource called GenomeSpace—a project to build an online community to find and inter-operate diverse computational tools such as Cytoscape, Galaxy, Genepattern, IGV, Genomica, and the UCSC genome browser. The tools retain their identity and use a stand-alone software, and GenomeSpace maintains their native look and feel. GenomeSpace currently has three biological projects: cancer stem cells, patient stratification, and link RNAs. “Our goal is to bring the ever changing wealth of genomic analysis methods and whatever data is required to the fingertips of any working biologists,” said Mesirov.  

Meanwhile, Leach proposed increasing the number of data scientists to work on making sense of the data. “Once you can get all the different data from all the different areas, you need that breed of scientist that can actually work with the data that can know where and how to look and visualize that data,” Leach said.

Mesirov argued that integrated genomics holds the key to accessing important biomedical questions and technology is the driving force behind the research. Mesirov ended her speech saying, “Ten years from now at Bio-IT World’s 20th anniversary, the integrated genomics will bring us into an era of genomic medicine in the clinic.”  
 

Charli Kerns is a graduate student in the Boston University science journalism program.