Disabling Data Analysis Paralysis
By Paul Nicolaus
March 9, 2017 | Trillions of bacteria call our bodies home. Some helpful, others not. One microbe in our gut helps break down food while another wreaks havoc and makes us sick. In recent years, the interest in this complex ecosystem of microbes living in and on us has only grown as researchers attempt to better understand its impact on human health and disease.
Look no further than Virginia Commonwealth University (VCU) and its Multi-Omic Microbiome Study-Pregnancy Initiative (MOMS-PI) for an example of this interest and related research. The study is one of several that make up the Integrative Human Microbiome Project (iHMP), established in 2014 as the second phase of the National Institute of Health (NIH) Common Fund’s Human Microbiome Project (HMP).
In collaboration with the Global Alliance to Prevent Prematurity and Stillbirth (GAPPS) at Seattle Children’s Hospital, the study follows the microbes of women throughout pregnancy and shortly after childbirth to examine the impact of the vaginal microbiome. A cohort of about 1,500 women provided samples from the mouth, skin, vagina, and rectum at every trimester throughout pregnancy, at labor and delivery, and at follow-up visits in addition to blood collected early in pregnancy and again at triage. Participants also completed surveys detailing health history, habits, and diet.
Newborns were sampled at multiple sites (rectum, mouth, skin, nares, meconium, stool) at birth and discharge. Placenta, cord blood, and chorioamniotic membranes were gathered as well as the amniotic fluid samples from cesarean deliveries. In total, over 150,000 samples have been used to generate multi-omic data with next-generation DNA sequencing, high-accuracy mass spectrometry, interactomics tools and other high-throughput technologies.
The problem with all this scientific forward progress, however, is that this project and others like it are generating massive datasets, which in turn creates a need for extensive bioinformatics and computational infrastructure. “The datasets that we’re generating under the MOMS-PI project and the other iHMP projects are really some of the first large-scale multi-omic datasets, and so that provides new challenges,” explained Jennifer Fettweis, MOMS-PI project director. “If it’s the first kind of data of that type, often the analysis tools and the platforms for dealing with that typically don’t already exist.”
The MOMS-PI team recognized that as data becomes less expensive to generate and the data types become more and more complicated, there is a need for new solutions. To unearth them, Fettweis said it made sense to pilot various projects in different directions to see what would work.
Evolving Tool Box
When former VCU bioinformatics grad student and grant-funded employee Shaun Norris was tapped in 2014 to find a way for the MOMS-PI data analysis team to make queries and understand the relationship between the sequenced data, the microbiome profile, and the clinical information received from the study’s surveys, he knew he needed a platform. “So I went out and did a lot of research about different technologies,” he recalled, “and did a lot of testing with different databases.”
After trying various options, he wound up building a software tool on top of Hadoop and Hive—both free, open source platforms sponsored by the Apache Software Foundation (ASF)—that makes it possible to store and analyze high volumes of data. Following close to a year of work, the Massive Multiomics Microbiome Database (M3DB) reached fruition in the summer of 2015.
Based on the command line arguments provided, the tool cleans up files, performs deep multiplexing, and classifies microbiota. Once those steps are completed, the analysis team could also interact with the database and write queries as needed.
It is one example of a whole host of tools—some more statistical in nature and others resembling standard bioinformatics pipelines—that the MOMS-PI team has generated and made use of along the way. Beyond the in-house tools, Fettweis highlighted the use of OREAN (Omics REsearch ANalytics), an integrative web-based platform for multi-omic data visualization, as well as a suite developed by Curtis Huttenhower’s lab at Harvard University.
“He has something called the bioBakery,” Fettweis said, noting that the MOMS-PI research team collaborates with Huttenhower’s group at the Harvard School of Public Health as part of the iHMP project. Together with Ramnik Xavier, Huttenhower leads a multi-institutional Inflammatory Bowel Disease (IBD) Multi-omics Data (IBDMDB) research team that is examining how the gut microbiome changes over time in adults and children with IBD. “That’s one of the labs that’s developing a large number of high-quality tools for microbiome analysis,” she added.
The third core iHMP study, led by Stanford University Genetics Professor Michael Snyder and Jackson Laboratory for Genomic Medicine Microbial Genomics Professor George Weinstock, is searching for potential microbial causes of diabetes. All three are generating similar data types and dealing with similar data analysis difficulties, not to mention the task of pulling all this together in a central repository, which is handled by University of Maryland School of Medicine Epidemiology and Public Health Professor Owen White and his group at the HMP Data Analysis and Coordination Center (DACC).
Molded from a Cloud
While larger labs and large-scale endeavors may have the resources and staffing to build tools in-house, there are others that do not. The Office of Cyber Infrastructure and Computational Biology (OCICB) within the National Institute of Allergy and Infectious Diseases (NIAID) has developed a platform, publicly available since February 2016, that is targeted primarily at lab technicians and post-docs who want in on the action but may not have access to large grants or bioinformatics support.
“We leveraged the power of the cloud,” computational genomics specialist Ian Misner said of Nephele, an open-source cloud analysis pipeline with a name borrowed from Greek mythology (Nephele was a nymph molded from a cloud by Zeus). Misner has been working on the NIH project as a contractor with Medical Science & Computing since late 2015 in an effort to bring microbiome data and analysis tools together in a way that removes a major hurdle for today’s researchers—analyzing, transferring, and storing biomedical big data.
There has traditionally been a single model for establishing local computational infrastructure, Misner explained, which is to determine the average usage and then obtain the hardware necessary to maintain that. The problem is that spikes in usage can result in saturation, which leads to queuing environments and waiting. Nephele, on the other hand, utilizes the computing infrastructure of Amazon Web Services (AWS) to take advantage of additional resources as needed.
The platform was assembled into a pipeline that is accessible through a web interface using existing open source, publicly available software including QIIME, mothur, and bioBakery. Registration is free, and once an access code is received, users choose the appropriate pipeline and data type for 16S, 18S, ITS, or whole metagenome sequence data analysis, provide their data files through upload or URL link, set any processing parameters, and click submit.
Results are typically received in as little as an hour or two, and users can download visualizations such as heat maps, bar plots, and taxonomy tables. Results can also be interpreted using tutorial videos or a pipeline output guide. Beyond that, Nephele provides a feature that allows researchers to compare their information to the HMP’s Healthy Human Subjects data. “You don’t have to have any bioinformatics knowledge to either use it or to necessarily understand the results that are coming back,” he added. “We’ve tried to design it to be as accessible to everyone as possible.”
A number of schools have used it as a teaching tool for microbiology courses or general biology work, including Georgetown University, North Carolina State University, and the University of Florida. “For students, I think it’s a great opportunity for them to have easy access to some pretty complex analysis tools,” he said, “without having to get mired down in the bioinformatics and the computational biology side of things. They can work on answering their biological questions that their data are presenting to them in a relatively straightforward manner.”
While the platform has helped fill a void, Misner acknowledged that Nephele is far from a one-stop shop that can meet all needs. Rather, he has noticed that many tend to use it as a preliminary research tool. Those looking to dive deeper into the analysis can turn to the platform’s user tutorials and FAQs as stepping stones, but the reality is that in the scope of today’s science it is difficult for a general use tool to be the source of answers for the many and varied questions that are being posed.
“So we do have a particular role in the community with providing this type of tool, but we also don’t think that we are the end all be all of microbiome analysis tools,” he said. “We are not expecting that people will come and use it exclusively. While we’ve built upon open source software, we expect that others can extend from what we’ve done.”
Free Ride or Pay to Play?
Once data analysis in a new field becomes routine, that’s typically the point where someone figures out a solution and it leads to one or two methods that most conform to, Jennifer Fettweis explained, but in the realm of microbiome research we’re just not there yet. As the search for solutions continues, she and her colleagues prefer to make use of open source options. “As we’re looking at ways to do this, we don’t just necessarily adopt commercial platforms,” she said. “That typically doesn’t work in a research environment.”
A frequent issue, explained Gregory Buck, director of VCU’s Center for the Study of Biological Complexity and principal investigator of the MOMS-PI study, is a desire to ask a question that’s not going to be answered by a particular software offering with its limited set of options. Furthermore, it is “difficult for us to invest in those kinds of things,” he said, “because once you’re invested in them and your data is analyzed, you’re almost obligated to keep using that package.”
At the moment, it makes more sense to have a team of bioinformaticists who can take existing freeware and modify it. Open source platforms are far from perfect, however, and tend to come with their own set of hurdles. As these options emerge, they are developed using existing reference data, but that evolves rapidly, meaning the software platforms can become out-of-date in a hurry.
“For example, some of the stuff that Curtis Huttenhower has developed has been very useful to us,” Buck said, “but is not as useful as it could be because it doesn’t have many of the relevant reference strains in it.” Most of the developers attempt to update along the way, but that relies on capital. In theory, commercial software could handle this challenge because there would be a continuous funding stream, but because there is a relatively small user group companies need to charge hefty subscription fees and these are often too high for an academic lab to maintain.
Grants are cyclical by nature, and while funds may be available at the moment, those dollars could disappear down the road. And researchers cannot afford to be locked out of their own datasets when the funding runs dry. As financial support for the MOMS-PI study comes to an end this year, the research team is scrambling to gather additional resources needed to continue the project. “We’ve got some grants,” Buck said, “and we’re continuing to write proposals.”
After collecting pregnancy samples over the first three years, the study’s final delivery will occur in March. “We haven’t been able to do the analysis yet that we want to do on those samples,” he added, “so I don’t think that the big discoveries have come through yet.” The hope is that understanding how the microbiome is passed from one generation to the next will illuminate the causes of issues like preterm labor and miscarriage and eventually lead to interventions that promote the well-being of pregnant women and their babies.
So where will the data analysis solutions come from as this field forges ahead? In the immediate future, Norris believes the free, open-source tools are going to continue to lead the way despite their imperfections. He also thinks that dynamic will shift over time, though. “As the field gets bigger and bigger, more large players get involved in it as they realize there’s a market for it,” he said. “Case in point, Intel is working on developing a platform.”
From Fettweis’s vantage point, it all depends on the end user. If you’re looking at the human microbiome and identify disease biomarkers, at that point you know what you’re looking for and a commercial solution becomes viable. “If we’re thinking about the research, which is really pushing the cutting edge of what’s possible and innovating with new data types and new ways of looking at data,” she said, “there will always be research labs that are not using commercial solutions.”
Paul Nicolaus is a freelance writer specializing in health and medicine. Learn more at www.nicolauswriting.com.