INSIDE THE BOX: Solutions for Data Sharing in Life Sciences
By Ari Berman
July 16, 2012 | Inside the Box | Sharing scientific data is as fundamental to the progression of science as the research design itself. Without data sharing, experiments cannot be peer-reviewed, and scientists cannot perpetuate existing findings by taking the next steps in the laboratory.
Unfortunately, data sharing is becoming more and more difficult. Compare scientific papers published in the early 90’s to those published in 2012 -- the differences are striking. Back then, any and all data associated with a project could fit in a figure or two, so the paper itself was the point of data sharing. Today, more and more papers are published with reams of supplementary data, e.g. PDF tables can reach hundreds of pages and are themselves a distilled and reduced version of the original data. (New initiatives such as the journal GigaScience and its associated database should help address this issue.) This illustrates the crux of the issue: modern research produces tons of data and publications are no longer a viable medium for sharing all those data.
So, why is data so big now? As widely discussed in previous Inside the Box columns, astounding increases in next-generation sequencing (NGS) technologies, the speed of data collection and the complexity of experimental designs are responsible for the “data deluge.” The resulting compounded complexity in research has made it more difficult for individual research laboratories to handle the resulting data, hence the formation of greater numbers of collaborations in order to better distribute the analysis load.
These phenomena, coupled with the additional insistence from the NIH for complete and open sharing of published data, have led to a vastly increased need for direct data sharing. Sadly, sharing over the Internet isn’t a viable option in most cases. The current Internet infrastructure is inadequate to effectively transmit big data between institutions due to the high cost of high-density connections, effectively eliminating it as a practical method of data transfer.
There are a few feasible solutions in use today. These include shipping data on hard drives by courier (or forklifting); central storage using cloud-based solutions (like Amazon’s S3); and granting remote access to collaborating researchers to local systems and data for local analysis. Each scenario has pros and cons, but they are mostly inefficient, expensive, and impractical long-term solutions.
There are also solutions that maximize the use of the Internet for data transfer, like Aspera’s file mover, Globus Grid FTP, and the CRAM compression algorithm. While these solutions are helpful in utilizing the current infrastructure, they are still limited by physics and tend to saturate Internet connections, which can affect entire organizations. So, short of a massive improvement in Internet infrastructure, the original problem remains. Or does it?
In times of inadequate resources, most species either adapt or perish. We are at a similar crossroads with data sharing. So, how can we adapt our current technology level to solve the infrastructure problem? The answer lies somewhere between cloud-based solutions and granting remote access to local systems. Big data is not a new problem nor is it limited to biological sciences. Luckily, the commercial world has found some innovations in analytics that can teach us some lessons to help solve our own problems.
Helping Hadoop
Distributed resource allocation systems like Hadoop are great examples of how to solve big data problems. Hadoop is a distributed file system that is “rack-aware,” meaning it knows the precise location of data within its structure. So, if someone wants to analyze some data, the analysis process is brought to the systems that are close to where the data is being stored. For instance, if a Hadoop volume is made up of 5,000 hard drives in 5,000 servers in two datacenters and someone wants to analyze data localized within datacenter 1 on rack 23, Hadoop executes the commands on servers within rack 23 to minimize the data transfer distance, thus speeding up the analysis process.
In and of itself, Hadoop is not a solution that will work for life sciences, but the underlying concept can be expanded to data sharing in life sciences. The answer is to bring the analysis to the data, rather than the data to the analysis. Let’s move the smallest number of bits (command sets) over the Internet as possible, not the largest (data).
The next question is how? If a centralized system existed that allowed researchers to register their data and resources and grant permission to other users to access their data, then collaborators could initiate analyses from this centralized system on the remote systems that contain the data and return the results of the analysis. In this workflow, the analysis would be brought to the data rather than moving the data to the analysis. The only thing that would be moved are commands to initiate the analysis, monitoring data relaying the progress of the analysis, and the results of that analysis being sent back to the central system for interpretation. This concept effectively removes the difficult part of data-sharing from the equation by removing the need to move the data at all. One can simply analyze remote data and get results back as if the data were on local machines -- a simple inversion of workflows to yield an elegant solution to a large problem.
The obvious downside to this data-sharing concept is that local computational resources would now be used to support the analyses of distant researchers rather than being fully available to local researchers. This methodology is economically impractical in a time when research money is scarce. One solution is for researchers to budget money into their grants for hardware that would be dedicated to information management and data sharing.
Another concern is the availability of analysis software on remote systems. If the collaborating researchers want to analyze data in a certain way using certain software, then it would have to be available on the remote system as well. Perhaps there’s an easy solution to that problem, but it is still a problem.
In short, there are solutions to the data-sharing problem. The long-term solution either lies in a major advancement in Internet technology (or affordability) or the development of new and crazy ways of sharing and analyzing data -- like bringing the analysis to the data, rather than pushing data around. If this concept could be realized in a simple, centralized social network of sorts, then it could revolutionize how research is done, leading to further technology innovations that would allow faster analysis with greater precision.
Innovation always carries with it a period of ineptitude, followed by intelligent adaptation. I believe we’re close to this adaptation period and that there is significant hope on the horizon.
Ari Berman is a consultant with the BioTeam. Email: ari@bioteam.net