Arvados Project Looks to New Models of Genomic Data Management
By Aaron Krol
April 14, 2015 | As genomics becomes a truly global enterprise, with DNA data stored and shared across continents and the great divides of academia, healthcare, and the biotech industry, more players in the field are raising concerns about the state of the ramshackle IT infrastructure supporting this data. The Global Alliance for Genomics and Health (GA4GH), a coalition of nearly 300 member organizations, is gaining momentum as a creator of standards to support data sharing, but it faces serious obstacles in a science that, perhaps more than any other, is confronting big data’s far limits.
The sheer size of genomic datasets, which in large population studies can sometimes reach petabyte scale, makes transferring, retrieving, and searching through data a major challenge. It also creates problems when databases are updated or recombined, as legacy versions of datasets are expensive to preserve. Even more problematic is the issue of rerunning or validating analyses. Different researchers put together their own pipelines of analysis tools to make sense of raw genomic data, and getting these pipelines to produce the same results in different compute environments is notoriously difficult.
“In the past, when people have said, ‘here’s a link to my pipeline,’ it’s been with the caveat of ‘good luck ever making this run,’” says Adam Berrey, co-founder and CEO of Curoverse. Berrey’s company is the major contributor to Arvados, an open source project to provide a foundation for working with biological data, including genomes — one that is compatible with all major analysis tools, but sets new standards for data management.
The Arvados platform began at Harvard University as part of the Personal Genome Project (PGP), whose ambitious goal is to sequence the whole genomes of thousands of individuals and make that data fully public for researchers around the world. (The name Arvados comes from Star Trek: The Next Generation, where the planet Arvada III is the childhood home of Chief Medical Officer Beverly Crusher; ArvadOS would be the operating system for this future of medicine.) The PGP still runs on Arvados, but most of the platform’s creators at Harvard have now migrated to Curoverse, including Berrey’s co-founder Alexander Wait Zaranek.
At its new corporate home in Boston’s Seaport District, this team has continued to extend and refine the Arvados platform, while also building a commercial service around it. The Curoverse business model is similar to that of other IT companies working with open source software, like Red Hat with Linux or Cloudera with Hadoop. Curoverse helps customers implement Arvados, and typically provides a hosting environment for the platform. It currently has around a dozen pilot users through a private beta program, but today, the company announced its public beta, inviting all comers to test out Arvados in preparation for a full launch later this year.
Berrey believes the time is right for institutions working with biological data to experiment with more dedicated infrastructures. He points out that the rate at which genomic data is growing means that many users will soon need to upgrade their capacity, providing a perfect chance to install new hardware or shift to cloud models that implement Arvados.
“If you look at the state of the art in the industry… it’s an architecture that is pretty old, and really hasn’t taken advantage of the modern distributed computing and the computing innovation that transformed web-scale data,” he says. “There’s a real opportunity to introduce some new capabilities.”
Reproducible Computing
Curoverse is paying close attention to the efforts of GA4GH, as the genomics community tries to converge on a common architecture for sharing information. A senior software developer at Curoverse, Peter Amstutz, co-chairs a Global Alliance task team on containers and workflows; Wait Zaranek is also an active participant in GA4GH working groups.
Berrey hopes that Curoverse’s connection to GA4GH can follow the model of working relationships between the World Wide Web Consortium (W3C) and groups like the Apache Foundation in the early days of the Internet. As the standards-defining W3C worked on common solutions for basic infrastructures like web servers, it sought input from groups trying to implement those standards with open source software.
“Pulling those things together is what made the web possible,” says Berrey. “The standards were defined by the W3C, the software was built as open source software, and we have the web today because of that effort.”
The science of genomics finds itself in a similar situation today, having common problems and a cultural dedication to the open source movement, but little in the way of shared infrastructure. “The vision that [GA4GH] have laid out, that there will be a network of bioinformatics cores all over the world that will be storing genomic data, and we need to be able to query across that data… That is a really interesting problem technically,” Berrey says.
It’s taken some thoughtful data management strategies to navigate this problem in Arvados. For instance, the platform is built on a storage architecture that rarely copies or moves data from server to server. Instead, the individual files that make up datasets receive their own cryptographic hashes that make it easy for Arvados to find and organize them. The same files can then be assigned to multiple datasets without copying them, and when an analysis is performed on an entire dataset, that workflow can be sent out to each individual file rather than the data moving to a server where the workflow resides.
A "provenance map" feature in Arvados helps users track the tools and datasets involved in any specific workflow they run. Image credit: Curoverse
The hashing of each file also provides a quality control feature that prevents Arvados users from needlessly replicating their data. “Say you have a petabyte cluster,” explains Berrey. “Someone goes to upload a new file. We’ll know instantly that we already have this file [because Arvados recognizes its hash value], so instead of making another copy of this file, we’ll just give you the content address.”
Another key feature of Arvados is its treatment of pipelines. Because the combinations of tools used to analyze genomic data — some open source, others proprietary, and still others home-brewed — are non-standard and sensitive to their compute environments, replicating results with them is an uncertain prospect at the best of times. Arvados gets around this problem by letting users store stable copies of each pipeline they run, including an identical version of the dataset and every tool in the workflow. A URL address for that pipeline will then direct third parties to an exact copy of the original run, where they can reproduce the results themselves, or even tinker with the tools and datasets involved. Pipelines can be made fully public, or restricted to specific Arvados users for collaborative research.
Early adopters of Arvados have already used this feature to make their findings more transparent. The PathoMap project, for instance, which made headlines this February for mapping the microbes living in New York subway cars, placed several of its pipelines online in Arvados for anyone to run. (A login is required.)
Voyage to Arvada
While the high grade of reproducibility in Arvados is currently being used mainly to share research findings, it could have more practical roles to play in a medical system that increasingly uses genetic data to inform patient care. In a clinical setting where genetic tests are used to make diagnoses or choose between treatment options, the pipelines used to support these tests can change rapidly, making it important to preserve a record of how past patients’ test results were interpreted. Clinical labs also need to keep stable records of their pipelines when they demonstrate tests for regulators and accrediting organizations.
In the long term, the Arvados team hopes to support hospitals and other care centers that are processing greater and greater volumes of biological data. That includes genomes, but also imaging data for pathology and sensors on medical devices. Right now, the few medical centers that are taking systematic approaches to big biological data face the same kinds of problems that have plagued electronic health records, where a huge assortment of vendors and few common standards make it difficult to meaningfully mine or share information.
“There is no way you’re going to realize this vision for a federation of hospitals and bioinformatics cores sharing data if you don’t have open source software,” says Berrey. “It’s just not going to happen on proprietary systems.”
While a few customers in Curoverse’s private beta, such as Johns Hopkins University and Harvard Medical School, have clinical operations that could eventually migrate to Arvados, so far they have used the platform for research purposes. Today’s public launch also opens the door for more medical centers to dip their toes in Arvados, although it’s unlikely that any users will make the leap to the clinical setting right away.
To deploy Arvados, Curoverse has entered a partnership with Intel, which will provide hardware pre-loaded with the Arvados platform to users who want to run the program on local clusters. Curoverse is also offering a cloud deployment, run through the Amazon and Google clouds, or hybrid solutions for users who want to keep their private data on-premise while analyzing public data over the web.
As open source software, Arvados is also free for anyone to download and modify on their own hardware. The software licenses for the Arvados project are intentionally lenient, giving third parties the option to write proprietary tools on top of the open source foundations. Berrey sees this as good for both genomics as a whole and Curoverse’s business, by encouraging wider adoption of the platform; his company will also offer support and training to users who want to download Arvados on their existing clusters.
Meanwhile, even as Curoverse gears up for its first paid contracts in the second half of 2015, the company is still looking for new areas where Arvados can support the mission of GA4GH. The team’s next project, Berrey hints, will deal with new methods for storing data on genetic variants — current file types are hard to query, and have blind spots when it comes to large structural variation in the genome.
“Our goal is to create infrastructure where you could take a million whole genomes and do machine learning at extremely high speeds, and you could do extremely complex genotyping queries in sub-seconds,” Berrey says. “We think it’s possible to store the data that way.”
Whether or not Arvados does become the standard operating system for personalized medicine, it’s encouraging to see a company betting big on the GA4GH vision. The computing architecture for genomics is due for an overhaul, and Curoverse plans to be at the ground floor when research institutions start to rebuild.