Open Data and Patient Modeling in Europe

Open source debates abounded at 3rd annual Bio-IT World Europe conference.

By Allison Proffitt

Novembe 21, 2011 | HANNOVER, GERMANY—“The tools and library situation in bioinformatics is an open-source zoo,” said Misha Kapushesksy, functional genomics team leader with the European Bioinformatics Institute (EBI), during his presentation on the Gene Expression Atlas platform. If that’s so, attendees got quite a tour of menagerie at the third annual Bio-IT World Europe conference*, with much emphasis on open-source platforms and cloud deployments.

The ArrayExpress Archive is an EBI database of functional genomics experiments where one can query and download data. Gene Expression Atlas contains a subset of curated and re-annotated archive data, which can be queried for individual gene expression results under different biological conditions across experiments. It’s meant to be a simple interface for identifying strong differential expression candidate genes in conditions of interest.

Kapushesksy and colleagues have deployed the ArrayExpress high-throughput sequencing pipeline inside the R Cloud at EBI. The R Cloud was originally a server farm and is EBI’s “experiment to increase the use of its infrastructure,” Kapushesksy said. The cloud is free, but use is constrained. Currently R Cloud is linked to the European Nucleotide Archive (ENA), ArrayExpress, and quality control reports.

Daniel White (Max Planck Institute of Molecular Cell Biology and Genetics) presented Fiji, an extension of the ImageJ open platform for image analysis but with Java, Java 3D, and plugins organized in a menu structure. White said that Fiji now represents a philosophy and community as well as a platform, with users sharing code and developing plugins for image transformation, registration, segmentation and analysis.

Urban Liebel (Karlsruhe Institute of Technology, KIT) presented the Harvester bioinformatics search portal (harvester.kit.edu). The portal scours all of the images and figures in PubMed and more than 30 other databases for images of interest. “I don’t know about you,” Liebel said, “but when I find a paper, I look at the figures first. If I can understand the figures, I read the paper.”

Liebel also presented Sciety (sciety.org), an application that works like Digg for PubMed articles. Much of the post-publishing commentary on papers isn’t recorded, Liebel said. If a student brings a Nature paper to his/her supervisor, the group leader may know that the paper had been later invalidated, but the student doesn’t. Sciety lets researchers comment on published research. There is an option to comment anonymously, but comments from users who log in and identify themselves carry more weight.

Programming the Patient

Hans Lehrach, director of vertebrate genomics at the Max Planck Institute for Molecular Genetics in Berlin, presented “Systems Patientomics” in one of two keynote talks. We’ve been treating patients not as individuals, but as members of large homogeneous groups, Lehrach said. Now that sequencing costs are falling and speeds are increasing, we must develop virtual patient systems taking into account data from mutation databases, genomics, tumor sequencing, and more. Lehrach compared the models to crash models employed by the automobile industry. We don’t run crash simulations with real people, he pointed out, and called for similar models to find optimal treatments for patients.

A trial is currently underway to test the models with the ITFoM (IT Future of Medicine) project. The goal, Lehrach said, is to define a reference model, then use many technologies to individualize those models. Within ten years, he believes such models will be able to lead to advances in cancer and metabolic disease.

He acknowledges that the compute requirements for such models will be huge—he estimates a petaflop required for each patient—but the effort currently has support from IBM, Amazon, Siemens, Intel, COSBi, and other computing powerhouses involved in the ITFoM project (see, “Hans Lehrach’s Predictive Biology Philosophy,” Bio•IT World, May 2009).

Connected to that project, Corrado Priami, CEO of Microsoft’s Center for Computational and Systems Biology Center (COSBi) in Trento, Italy, argued for the need for a new language to describe how biology works, and it just may be a programming language. We are trying to help biologists “program without knowing they are programming,” said Priami.

Priami presented an option for modeling a biological pathway using an interface based on natural language to explain a biological system. The model is easier to write, change, and reuse than traditional mathematical approaches, though it may be slower than classical equations for smaller systems.

Science as a Service

In the cloud forecast, Folker Myer (Argonne National Laboratory/University of Chicago) sees the cloud as part of a solution, but not the end. Myer defines “cloud” as basically grid computing with virtual machines, though he acknowledges that “cloud” now includes infrastructure-as-a-service (Iaas), platform–as-a-service (PaaS), and software-as-a-service (SaaS). Myer’s group applies the cloud as a solution to huge metagenomics problems (the Earth Microbiome Project plans to collect 200,000 samples). MG-RAST (http://metagenomics.anl.gov), Argonne National Lab’s open source metagenomics analysis server, was released for the cloud in March.

But the cloud cannot be the only answer. When sequencing massive datasets, the compute cost is now surpassing the sequencing costs, Myer said. Using a 2009 example, Myer reported that Illumina HiSeq 2000 data, running BLAST-X on soil samples, could cost $45,000 for the sequencing and $900,000 for the Amazon EC2 computing costs (not including storage or data storage).

Myer believes researchers should share results to minimize re-computing costs and raw data to minimize data access problems. The Open Source Data Framework (OSDF), which Myer’s group announced in late September, could help fix the data analysis problem, he said.

The Argonne Workflow Engine—a RESTful interface with Google’s v8 JAVAscript engine—could also help. It has been running MG-RAST for the last 18 months, and scaling up by 600x. AWE distributes work across a number resources including HPC clusters, clouds, and systems with accelerators (GPUs or FPGAs).

But what does a cloud solution look like for labs without HPC clusters or massive budgets? The concern voiced by University of Manchester’s Carole Goble (see, “Democratizing Informatics for the ‘Long Tail’ Scientist,” Bio•IT World, March 2011) is not big pharma’s use of the cloud, but how small academic labs can use the resource. Researchers with limited funding options or access to large computer clusters, already dependent on open-source software solutions, have the ability to sequence, but need help with annotation.

And so Goble decided to do the “most naïve thing you could imagine”—she and her team moved her open-source workflow platform, Taverna, to the cloud. Annotation is perfect for the cloud, Goble said. It’s highly repetitive and deeply parallel. It works well on a pay-as-you-go model. Her goal was to provide annotation as a cloud service—science–as-a-service.

The experiment took four days and cost $600 to set up a Web interface and test Taverna in the AWS environment. The cost per run? $5 and less than two hours. She and her team used datasets from researchers who are comparing why some African breeds of cattle seem much heartier than others. The researchers were able to use Taverna in the cloud to annotate the Boran and Cape Buffalo genomes in a couple of hours.

Goble said the experiment did expose some challenges. First, Eucalyptus clouds (an open source platform for building private clouds) are not the same as AWS. Users can not save development costs by testing in Eucalyptus clouds before moving to AWS—it just won’t work. But the bigger problem Goble uncovered was that the reference dataset the researchers wanted to use was located in the US Amazon space, while their work was being done in the EU Amazon space.

“This is a social problem!” Goble said. Public databases must exist in the cloud, but “just the West coast zone is not enough.” The set up raises many data ownership issues. Goble’s next crusade is to convince major cloud providers including Amazon to host public data sets for little money so that researchers worldwide can easily access and work with the data. •

* Bio-IT World Europe 2011, Hannover, Germany, October 11-13, 2011

This article also appeared in the November-December 2011 issue of Bio-IT World magazine. Subscribe today!