Evolving Challenges And Directions In Data Commons: Updates From Bio-IT World West
By Melissa Pandika
March 22, 2019 | Data commons are rapidly emerging in biomedical research, driven largely by a growing mandate to share data toward accelerating discoveries. The inaugural Bio-IT World West, which took place last week in San Francisco, featured numerous sessions and panels on the challenges and future directions of these digital common spaces.
On Tuesday, Robert Grossman, director of the Center for Data Intensive Science (CDIS) at the University of Chicago, discussed how to progress from a data commons to a data ecosystem, an interoperable collection of data commons, computing resources and data. He noted that recent years have seen the start of a transition from learning how to build bioinformatics platforms in clouds of genomic and other data to tackling a more complex problem: “If I have data ecosystems of different projects by different groups, how do I build apps over them and make it easy for other researchers to use the data?”
Grossman then introduced the Data Commons Framework Services (DCFS), a set of software services for building and operating data commons and cloud-based resources, which can also support multiple data commons, knowledgebases, and applications as part of a data ecosystem.
The DCFS offers core services for a narrow middle architecture. In this approach, the goal is to “standardize the smallest number of services and let the community rapidly evolve the others,” Grossman explained. The DCFS, in turn, is built using the open-source Gen3 platform, which is being developed by CDIS and used for many data commons. “Gen3 is how data commons are made,” Grossman said. But “it’s not just a data commons software stack. It offers a stepping stone to data ecosystems.”
Grossman described the process of building a Gen3 data commons platform. Users first define a data model and use the Gen3 software to autogenerate the data commons and associated API. Next, they import data to the commons through a Gen3 import application, use Gen3 to explore their data and create synthetic cohorts, and use platforms such as Tera to analyze them. They then develop their own container-based applications, workflows and Jupyter notebooks.
Grossman outlined four core services of Gen3 that might be sufficient to enable interoperability. Firstly, the data commons and resources expose their API for access to data and resources. The data commons also expose their data models through the API, and the data models include references to third party ontologies and other authorities. Authentication and authorization systems can also interoperate. While users wouldn’t want one authentication to manage everything, “we want some kind of interoperability,” Grossman said. By Q2 2019, structured data would also have the ability be serialized, versioned, exported, processed, and imported.
Basically, “each person creates their own commons and does whatever they want. All we ask is to expose the API and the data models,” Grossman said. “We don’t want to set the standards for the ecosystem, but what are the minimal things so that we have some interoperability?”
The next day, Matthew Trunnell of Fred Hutchinson Cancer Research Center moderated a multidisciplinary panel of leaders in data commons, who kicked off the session by introducing their work. Michael Kellen, chief technology officer of Sage Bionetworks, provided an overview of Synapse, a cloud-native informatics platform that allows data scientists to conduct, track and disseminate biomedical research in real time. Next, Stanley Ahalt, director of the Renaissance Computing Institute, described the NIH Data Commons, where investigators can interact with digital objects of biomedical research, and the NHLBI Data STAGE, which uses the Data Commons to fuel discovery in various disorders.
Adam Resnick, director for the Center for Data Driven Discovery in Biomedicine, discussed how an urgent, moral imperative to curate and connect data uniquely positions the pediatric community to develop novel, data-driven approaches that may extend to other populations. He highlighted his work on the Kids First Data Resource Center, which seeks to unravel the genetic links between childhood cancer and birth defects. Finally, Lucila Ohno-Machado, associate dean of informatics and technology at University of California, San Diego Health talked about establishing trust between patients and researchers when managing large biomedical data networks, drawing on her work with the patient-centered Scalable National Network for Effectiveness Research, a clinical data research network of 32 million patients and 14 health systems.
The discussion portion of the session began with a focus on efforts to center the patient in sharing clinical data. An audience member asked Ohno-Machado whether extending data sharing authorizations to patients will actually increase the amount of available data compared to limiting authorizations to research institutions or principal investigators. Ohno-Machado responded that she does think allowing patients to decide what to share—and to change their decision at any time—will increase the amount of available data. In fact, one of her group’s research projects has found that 90% of people want to share their data in some form.
Resnick said the percentage is even higher in the pediatric community. The cancer community also shares more data than the healthy population, Ohno-Machado said, probably because they feel a greater sense of urgency to accelerate the research.
Trunnell then shifted the conversation toward a more fundamental issue: how to define a data commons. Kellen defined it as a way to bring researchers around shared data, as well as the downstream effects produced as a result of the research, such as visualizations. Ahalt had a more abstract definition, describing a data commons as “a collection of shared resources that by virtue of the importance of those resources, it draws a community in.” It also establishes methods for the use of those resources. Meanwhile, Resnick defined a data commons as “an economic infrastructure that defines and tries to create value,” emerging in response to the technological age and when specializations around research begin to falter as individuals try to tackle problems on their own.
But what incentives bring and keep people and organizations as data donors? With a data commons, “you’re able to do things you probably wouldn’t have even attempted because you couldn’t imagine they were solvable,” Ahalt said. Sharing data runs counter to most researchers’ training, he added—yet patient imperatives to accelerate discovery won’t allow researchers to sustain the status quo. “We’re seeing a shift to a radically different mindset to collaborate,” he said. “Not any one group has all of the talent they need.”
Resnick echoed Ahalt’s observation of how incentives of a data commons are often at odds with those of research institutions. “A commons is an economy of trust and attribution,” he said. “It’s a human layer of perceived assessment of one another that you can engender in a commons that essentially displace the metrics of the academic space." In other words, we have a long way to go in thinking of incentive structures for data commons, he said.
Kellen cited incentivization as the greatest challenge he has encountered in building a data commons. “It’s not enough to lecture about the good of sharing,” he said. Many researchers voice their support, but don’t show it in practice, although funding mandates could encourage collaboration.
Building trust is crucial to realizing the benefits of data commons, Kellen added. Expanding small collaborations over time could help foster this trust. For instance, data in the Colorectal Cancer Subtyping Consortium (which uses Synapse) were initially closed to the public research community. Yet access to them has broadened over time, and they’re now in the public domain. “I don’t think the project would have gotten off the ground if we started that way,” Kellen said.
When asked how they manage to bring together so many entities to use data in one place, the panelists described a shift from a one-size-fits-all model of a data commons to a “commons of commons”—a group of commons that each allow limited access but enable some interconnectivity among them. Kellen noted that he’s reorienting his staff toward forming defined research communities, but with the infrastructure to do indexing across them.
“The instinct is to create one for all, but that actually ends up not being for all,” Resnick said. “What we do end up finding is that communities actually prefer having domains encapsulated in some fashion.” Rather than a single common entryway, “more people feel comfortable going through some door that’s specific for them.” Domain experts could curate and structure the data sets, increasing their value. But this added complexity raises other questions, such as how to merge different data models in these ecosystems into a common framework, Ahalt pointed out.
To conclude the session, Trunnell raised the question of how those involved in managing data on behalf of their institutions can think about their work in a way that positions them to connect with broader commons. Ahalt suggested looking to other disciplines, such as hydrology and seismology, where commons have also emerged. “We need to be more open to learning from others,” he said.
Returning to the issue of incentivization, Resnick added that while we often characterize a data commons as non-competitive, recontextualizing it as a competitive marketplace may help better convey its value to institutions. No single institution would have sufficient data to differentiate itself from other institutions, which could encourage “competition for collaboration”: Rather than competing for ownership of the data, “you have to compete for what you can do with the data.”