Data Sharing During A Pandemic: EMBL-EBI’s COVID-19 Portal, GA4GH Take On Data Sharing
By Allison Proffitt
April 23, 2020 | Earlier this week, EMBL’s European Bioinformatics Institute (EMBL-EBI) announced its COVID-19 Data Portal, which includes a wide range of data types including genomic, protein, and microscopy data, as well as scientific literature.
The European COVID-19 Data Platform is a partnership between EMBL-EBI, ELIXIR, and the European Commission, and is one of the ten projects included in the first iteration of the ERAvsCORONA Action Plan launched by the European Commission.
“In some sense, the lead partner is the European Commission,” Ewan Birney, Director of EMBL-EBI told Bio-IT World. “They’re both good funders and have been incredibly good colleagues in designing this. We’re sort of the technical engine.”
EMBL-EBI has a 30-year history of delivering data sharing for scientists, Birney said. “I think people outside of molecular biology don’t appreciate that in molecular biology the default is to share data on publication of results. Because of that, we have some very robust systems that predate the internet, even, for data sharing—in particular for protein structures and DNA sequences,” Birney said. “From those two traditions—protein structures and DNA sequences—we have developed data sharing for other data types like proteomics and expression data and other things.”
The COVID-19 Data Portal is the entry point to the wider European COVID-19 Data Platform initiative; multiple SARS-CoV-2 Data Hubs are under construction. Once built, the hubs will organize the flow of sequence data from the outbreak and provide comprehensive open data sharing for the European and global research communities accessible through the Data Portal. Both the European COVID-19 Data Portal and the SARS-CoV-2 Data Hubs will use established EMBL-EBI data infrastructures.
The COVID-19 Data Portal includes datasets from many EMBL-EBI data resources including the European Nucleotide Archive (ENA), UniProt, Protein Data Bank in Europe (PDBe), Electron Microscopy Data Bank (EMDB), Expression Atlas, and Europe PMC. In the coming weeks, the portal will also include genomic data from the outbreak and a dedicated Cohort Browser for searching clinical and epidemiological data. It’s an alphabet soup of partners, Birney admitted.
The COVID-19 Data Portal offers data analysis and visualization tools from EBI to help interpret the data, and the team is actively working on the user interface to make the portal intuitive and easy to use. In the coming weeks, with help from ELIXIR and other collaborators, additional datasets and tools from other European projects will be added to the COVID-19 Data Portal.
“I can’t tell you how proud I am of all the people who worked so hard at EMBL-EBI to deliver [the Portal],” Birney said. “Because they had to do an awful lot of things; it looks quite simple, but there’s quite a lot of rewiring behind the scenes to pull together this data from the perspective of one particular biological question, and they have really worked their socks off in a really complicated situation where they’ve got children at home and homeschooling and all that stuff.”
Database Glut
The new data portal joins a host of other databases and data pools worldwide dedicated to understanding, diagnosing, treating, and eventually vaccinating against the SARS-CoV-2 virus, but Birney is not concerned that the volume of datasets will confuse or slow research.
“I don’t worry so much about many, many different websites. I think there are many different views of the data and many different things to explore,” he said. “But what we should really strongly advocate for is the same data flow around the world: that we use the same fundamental open data infrastructure, and we use the data infrastructures that we’ve developed over the last 30 years.”
Birney also serves as Chair of GA4GH, the Global Alliance for Genomics and Health, which develops standards and harmonized approaches for effective and responsible genomic and health-related data sharing. GA4GH has been working since 2013 to develop and disseminate data sharing and use standards, and Birney advocates for applying those standards to COVID-19 research.
“We need the same standards,” Birney said. “The way Genomics England stores the data has to be the same way as Finland, the same way as the US, otherwise every time you go somewhere, you have to rethink everything. That would be a complete nightmare!”
Data In The Time Of Corona
But we are in an historically unique position, and data sharing and data use are happening under new external pressures.
“I think what you’re seeing is people saying, ‘Look, we’ve got to get this data out there in whatever format it’s in and use it,’” observed Heidi Rehm, one of two GA4GH Co-Chairs and medical director of the Clinical Research Sequencing Platform at the Broad Institute of the spate of new databases and tools.
We need a “balance right now of flexibility and the actual sharing,” Rehm said. Researchers, companies, and other data holders are pragmatic, she said, theorizing: “If [researchers] have to go in and manually convert lots of things to other things because it doesn’t match the format, there’re lots of people right now who are willing to put that time in. So let’s just put it out there, even though it’s not harmonized,” she observed.
This industry-wide data dump is not without risk. Taking and comparing data from different sources introduces risks of misinterpretation and false associations—sometimes with very serious outcomes.
“We do have to be careful that we take the same scientific rigor when we’re making claims about things,” she warned, “[While,] at the same time, coming up with hypotheses that warrant further investigation. Any way we can get those hypotheses and early evidences out there, the better!”
Standards alone do not ensure scientific rigor, Rehm emphasizes. GA4GH standards focus on data format and file types, how to build standardized APIs for data exchange, data use and research identity mechanisms to enable best access to data, file formats like CRAM and BAM, container systems for tools, and more. “These things are a lot more operational, in a lot of ways,” Rehm said.
But they all exist to encourage data sharing.
“We’re looking to encourage very rapid data sharing in all domains that relate to this outbreak,” Rehm said. “Everything from results being seen from viral testing so we know who’s contracting the virus and where, to proper tracing and assess risk across the population. Some of that is not scientific discovery, it’s literally a public health emergency.”