The Latest in Data Integration and Personalized Medicine: Updates from the 2018 Molecular Medicine Tri-Conference
By Melissa Pandika
March 5, 2018 | The Molecular Medicine Tri-Conference celebrated its 25th anniversary last month in San Francisco. Personalized medicine emerged as a major theme, and with it the question of how to deal with the huge, ever-growing slew of data it requires.
To find out the latest in the efforts to confront this challenge, Bio-IT World caught up with leaders in data management and cloud computing at Tri-Con’s Converged IT & the Cloud track. In a series of talks and panel discussions, they shared their experiences dealing with the challenges of storing, transferring and integrating vast amounts of data from a broad range of sources.
Chris Lunt of the National Institutes Health kicked off the Converged IT & the Cloud track with a keynote presentation on the All of Us Research Program. Set to launch this spring, the program will collect data from one million or more people in the United States and uncover patterns in them, which may inform the development of personalized prevention strategies and treatment. “We should be able to have an operations manual for your body,” Lunt said. “Participants need to give a lot of data to do this.” He shared his team’s plans to gather data from diverse participants through surveys, genome sequencing, blood and other samples, and electronic health records (EHR).
Lunt outlined the challenges he anticipates in integrating these data including, for instance, from the heterogeneity of research tools. He and his team will also need to decide on standards for genetic, digital health, and other data, and determine how to provide patients and participants with easy, API-based access to EHR data. They will also need to walk the fine line between making research findings publicly available while protecting patient privacy. To that end, Lunt says his team plans to provide tiered access, including a public tier providing a full summary of the findings—such as the average BMI of participants—as well as a controlled tier for researchers to access sensitive information about individuals, such as genetic data.
David Hiatt of analysis storage company WekaIO and Robert Sinkovits of the San Diego Supercomputing Center (SDSC) discussed the potential of high-performance computing (HPC) to speed up discovery in personalized medicine research and relieve stress on IT infrastructures. Although traditionally used in physical sciences and engineering, and optimized for these disciplines’ use of small numbers of large files, life sciences data has reshaped HPC, optimizing its use for large numbers of small files. Researchers are using Comet, SDSC’s petascale supercomputer, for a number of life science applications, from simulating biological membranes to investigating how the flu virus’ molecular structure impacts its infectivity.
Data transfer has also emerged as a challenge in the life sciences. David Mostardi of IBM Aspera, a file transfer software company, explained the limitations of FTP and other file transfer technologies based on TCP, the fundamental communication protocol of the Internet, which uses only a small fraction of available bandwidth. Aspera’s FASP technology, on the other hand, uses all the available bandwidth, sending data hundreds of times faster than FTP/HTTP can—up to 100 terabytes of data anywhere in the world, with sufficient bandwidth.
The second day of the Converged IT & the Cloud track overlapped with the Bioinformatics & Big Data track to cast a spotlight on the rapid emergence of data commons, digital “common spaces” containing computing infrastructure, cloud-based storage and analytical tools, where researchers can share and work with data. As the life sciences have become more data-intensive, Matthew Trunnell of the Fred Hutchison Cancer Center said he and others in research computing have noticed new imaging modalities, high-throughput generation capabilities and more multimodal data analysis problems. This, in turn, means research computing has become more expensive per capita, with a huge increase in users’ ability to consume computation resources.
“The way we have paid for research computing is probably not going to scale,” Trunnell said. Beyond addressing the economic challenges of computing, data commons also allow for larger sample sizes than any research group can achieve alone, potentially accelerating discovery. “We’re looking at a new paradigm,” he said. “This is going to be the year of the scientific data commons.”
Anthony Kerlavage of the National Cancer Institute echoed Trunnell’s forecast. “We are in a Cambrian explosion of data commons,” he said, citing the NCI Genomic Data Commons, the Broad Institute’s Human Cell Atlas, and the International Cancer Genome Consortium, among other examples. “This plethora of data commons can lead to competition for niches and perhaps survival,” he said, “but perhaps lead to synergy and cooperation.”
Kerlavage outlined the guiding principles of a data commons—that it should be modular, community-driven, open-source and standards-based—and shared NCI’s efforts to build a Cancer Research Data Commons (CRDC). Building on the NCI’s Genomic Data Commons and Cloud Resource, the CRDC would enable the cancer research community to share diverse data types—from The Cancer Genome Atlas, as well as individual NCI labs, for instance—across programs and institutions. It would create a data science infrastructure to connect data repositories, as well as knowledge bases and analytical tools.
Kerlavage acknowledged that enforcing metadata and other standards will be challenging, especially since individual laboratories often have their own legacy systems, which they may be reluctant to replace. “It requires a lot of painstaking work,” he said. “We’re trying to use artificial intelligence and create tools for smart semantics.”
Experts delved deeper into this challenge in a panel discussion focusing largely on incentivizing data commons. Despite the growth in data commons in recent years, implementing them remains a huge undertaking. “No researcher on their own wants to share their data,” said Robert Grossman, professor of medicine and computer science at the University of Chicago. “They just want the data in their lab to do what they want and publish as much as possible…. The commons are there for the community because by sharing you can accelerate research and impact patient outcomes.”
But other than contributing to the social good, what incentives do researchers have to share their data, especially if it entails the often-tedious work of harmonizing it? Panel members suggested taking a step back and remembering the bigger picture. Lara Mangravite, president of Sage Bionetworks, pointed out that data commons members need each other to answer a shared question. When she has seen researchers agree to metadata tagging, often “they’re committed to what can come out of the commons,” she said. Simon Twigger, senior scientific consultant at BioTeam, added that “if there’s something [in the data] I care about… then I’m willing to go through that pain.”
Lucila Ohno-Machado, associate dean of informatics and technology at University of California, San Diego Health agreed, attributing the relatively greater success of cancer data commons to a clear-cut understanding of the benefits of sharing their data, “whereas for other initiatives, it’s not as clear,” she said.
Ohno-Machado added that patient privacy concerns might dissuade healthcare providers from sharing EHR data and highlighted the importance of weighing the need to protect patient data against scientific discovery. Mangravite shared how Sage Bionetworks navigated this tricky area in its investigation of the utility of artificial intelligence and deep learning in evaluating mammograms. Rather than convincing health providers to upload patient mammograms to the web, researchers built containers that allowed commons members to apply models to the mammograms without accessing the mammograms themselves.
But by and large, “there’s a need for more approaches to make [sharing] more valuable, and we haven’t figured it out,” Ohno-Machado said. Mangravite is optimistic. “I think it’s going to happen, but it’s just very early days,” she said.
Twigger, like many on the panel, viewed the challenges of implementing data commons as more social than technological. “It’s a hairy social problem, getting scientists to agree on metadata,” he said. “People are very attached to their definitions of things.” He added that “scientists would rather share their toothbrush than share their data.”
The last day of Tri-Con expanded on the social challenges of data integration in a closing panel discussion on trends and future directions (see moderator Chris Dwan’s notes here). “There was a lot of time spent on people,” said Saira Kazmi about her experience implementing a metadata system as the scientific architect at The Jackson Laboratory. “It’s not a lot of effort in terms of technology.” Seeing the benefits of metadata tagging firsthand made her colleagues more enthusiastic about implementing it—for instance, when a postdoc in one lab thought that a piece of data wasn’t available before a search revealed a postdoc in another lab had already generated it.
When asked whether we will ever see total open-source data convergence, panel member Aaron Gardner, senior scientific consultant at BioTeam, Inc responded that “it’s a human problem, not a technology problem, so it’s less predictable.” Jonathan Sheffi, product manager of Genomics & Life Sciences at Google Cloud, however, argued that the tension between data access and respect for data sovereignty and locality represents “a tech problem.” Technology solutions could allow researchers to strike a balance between the two, perhaps by enabling them to analyze the data in different data centers or in Amazon Web Services, rather than all in one cloud.
Annerose Berndt, vice president of clinical genomics at the University of Pittsburgh Medical Center, worries about data convergence compromising patient privacy. “I would tend more toward saying it’s not a technology problem,” she said. “There are very smart technologists who will figure out how to link data, but how will you overcome security and privacy?”
Berndt then raised the larger question of the source of the demand for data convergence, which will ultimately drive it. In the age of personalized medicine, the fundamental question among patients is “What can I do to prevent disease?” Bringing data together to amass large enough reference populations is crucial to answering it. “Does the push come from end users rather than all of us working in IT and genomics?” she said. Perhaps patients’ demand for answers will fuel this convergence, rather than the IT and genomics communities’ desire for it.
* The Molecular Medicine Tri-Conference; February 11-16, 2018; San Francisco. The Tri-Conference is produced by Cambridge Healthtech Institute, the parent company of Bio-IT World.