COVID-19 Datasets Bring AI Experts, Life Sciences Researchers Together For A Cure
By Allison Proffitt
April 10, 2020 | All of the Bio-IT community is eager to contribute to plans for treatments, diagnostics and vaccines for SARS-CoV-2 and the resulting disease, COVID-19. Companies are donating consulting services, compute resources, tools for clinical trials, and so much more. But the biggest donations might be the sheer volume of data being pooled for researchers to mine for answers.
On March 16, the Allen Institute for AI (AI2), Chan Zuckerberg Initiative (CZI), Georgetown University’s Center for Security and Emerging Technology (CSET), Microsoft, and the National Library of Medicine (NLM) released the COVID-19 Open Research Dataset (CORD-19).
The dataset, accessible through the Allen Institute for AI’s Semantic Scholar platform, includes scholarly literature about COVID-19, SARS-CoV-2, and the coronavirus group.
Semantic Scholar is a free, AI-powered tool for navigating scientific literature, Doug Raymond, the general manager for Semantic Scholar told Bio-IT World. Established in 2015, Semantic Scholar collects millions of peer-reviewed journal articles, publications from preprint servers, related GitHub repositories, blog posts, clinical trial data, presentations, videos, and more. More than 180 million papers are included in Semantic Scholar.
The CORD-19 dataset currently includes over 47,000 scholarly articles including 36,000 full text articles from PubMed, found using a search query that includes COVID-19, coronavirus, SARS, MERS and other relevant terms. Pre-prints from bioRxiv and medRxiv are included based on the same query. The dataset includes information on coronaviruses in general, and papers date back to the 1970s, Raymond said.
“We’ve partnered with Elsevier, the World Health Organization, and a number of other institutions to get the full text of the articles, and then we’ve created a structured representation of this data in JSON format, which allows you to see all the metadata, the full text,” he said. “We’re planning to add additional metadata such as citations, which show the links between the different papers.”
Currently the CORD-19 dataset is updated weekly and can be downloaded by researchers. Raymond says that they are working to publish daily updates.
In addition to the data pool, the AI2 team has released tools as well. CoViz allows researchers to identify associations between concepts that occur in the CORD-19 database. CORD-19 Explorer is a search engine that is built on top.
“Essentially this is a way to take what previously was thousands of PDFs of papers and make it very, very easy to review that literature for any particular research interest.”
Structure Advantage
There is, in fact, a wealth of information on COVID-19 and coronaviruses in general, and many groups are working to collect and share those data. The World Health Organization has a COVID-19 Research Database and the National Institutes of Health LitCOVID resource also tracks COVID-19 literature. Microsoft has dedicated both a COVID-19 Resource Page and CORD-19 AI Powered Search. Overton has created a COVID-19 Policy Dataset and the Cochrane Library has also curated a COVID-19 Literature Review Collection.
“We’re sitting on this treasure trove of science we’ve created over—literally—the last century. We want to make anything relevant to COVID-19 open to the world to find a treatment and get us through what we’re going through right now, which is just surreal,” said Michael Dennis, echoing the thoughts of many.
Dennis is VP of Innovation at CAS, a division of the American Chemical Society. For more than 100 years, CAS has been collecting small molecules and cataloging their chemical structures, sequences, toxicity, and known biological activity. CAS has built a candidate compound dataset of about 50,000 compounds chosen based on their chemical structure’s similarity to known antiviral compounds and those structures druggability and toxicity. The collection is available within the CORD-19 dataset.
“It will give scientists a head start, if you will,” Dennis said.
CAS started by compiling a list of all the known antivirals using SciFindern, the CAS discovery platform for mining the 100 million small molecules in the CAS registry.
“We pulled out known antiviral compounds. One example is remdesivir; that has a CAS registry number and we know a lot about that molecule including its shape. We ended up with about 100 known antivirals. We didn’t focus on just COVID-19; we didn’t focus on just coronaviruses. We went a little broader,” Dennis said. From there, the team expanded the pool of candidates based on those 100 known antivirals by looking for compounds with similar chemical structures doing substructure searching and similarity searching, and then further refining the list by size, toxicity, and biological activity. They looked for anti-infective agents, respiratory system agents, and enzyme inhibitors.
“We ended up with a candidate compound dataset of about 50,000 compounds,” Dennis said. “We can’t guarantee they’re going to treat a [viral infection], but they’re related to known antivirals based on all the work we did.”
CAS released its COVID-19 structures dataset in mid-March and made it available through the CORD-19 dataset hosted at Semantic Scholar. CAS is already working on additional datasets. “We’re starting to look at SAR data—structure activity relationship data. It has to do with how these molecules might bind to a target, a protein. That relationship is important in the treatment of any disease,” Dennis says.
Uniting Effort
Dennis says the CAS dataset has been downloaded by pharma companies, biotechs, and academic researchers all over the globe. Many are organizations CAS has had long relationships with, but some are new. “They’re organizations that aren’t traditional biotech or pharmaceutical companies. They’re organizations that focus more on software and AI. They normally wouldn’t license tools like SciFinder, but they want access to this kind of rocket fuel for their AI engines,” he said.
On the AI side, Raymond is seeing a similar convergence. “We’re seeing a lot of interest from both communities,” he said. “The NLP community which is using natural language processing techniques to try to unearth information embedded in this dataset is very much engaged and has been publishing tools and new reviews and information based on what we’ve released. We’re also seeing the medical research community take a great deal of interesting in the resources as well.”
Both Dennis and Raymond believe that offering these biomedical datasets to both life sciences researchers and AI researchers will accelerate the discovery of a cure.
“I think it’s going to be a hybrid [effort],” Dennis said of the future cure. “I think it’s going to be the combo of the AI tech with the more traditional science that’s going to unlock the next treatment for COVID-19. And it’s out there. I’m 100% convinced we will find it.”
Raymond agreed. “We were founded as an AI institute for the common good. To have a threat like COVID-19 [impacts all of us.] It’s a great opportunity to show how AI can support a better way of doing science. We hope that not only are we able to help find treatments and ultimately a cure for COVID-19, but we’re able to accelerate scientific progress more generally.”