Bioassays Have An Integration Problem: Collaboration Will Be Key To Making Them FAIR
Contributed Commentary by Dana Vanderwall, Schrödinger, and Vladimir Makarov, Pistoia Alliance
February 10, 2023 | Whilst life science companies have come to recognize data as their greatest asset, it is also their greatest challenge. The answers to the biggest questions facing the industry today could already be held within the countless proprietary experiment notes, published literature, and patient records produced in previously conducted experiments. The data landscape is continuously growing in complexity and scale as organizations generate more research, but much of it is siloed in different formats and locations. This makes it difficult to discover, query, and share—rendering data essentially unusable.
Bioassay protocols are one such example where legacy data management systems are holding R&D back, and where adopting the FAIR (Findable, Accessible, Interoperable, Reusable) principles would improve the usability of the data. Bioassay protocols constitute the essential metadata for most of the experimental results collected in the process of drug discovery. While assay protocols are widely accessible—often stored in public data banks—they are universally kept in plain-text formats. This means they are not machine-readable and therefore require manual review, which takes considerable time investment by highly qualified professionals. Scientists must spend significant amounts of time sifting through vast libraries of old records; there are currently more than 1.4 million unformatted bioassays. Pistoia Alliance research found that some researchers may spend up to twelve weeks per assay selecting and planning new experiments.
The Bioassays Challenge: Scope And Integration
The Pistoia Alliance BioAssay FAIR Annotation project, informally known as the DataFAIRy, aims to overcome these challenges. The objective of the DataFAIRy project is to enable high quality curation of a large body of published biological assay protocols and convert the metadata contained in them into machine-readable FAIR data objects to be used by the global scientific community.
Kickstarting a FAIR data project in such a technical area poses two major challenges: scope and integration. Choosing a scope for any data project is a common question: how much metadata is enough, and where do we start? There are many resources to turn to when it comes to enriching data to make it machine-readable, including the BioAssay Ontology, the Ontology of Biomedical Investigations, and the Cell Ontology, to name a few. Ontologies specifically applicable to the biological assay domain are widely available and reasonably mature. However, within these ontologies, the number of potentially useful classes and properties easily reaches thousands. This means when it comes to selecting the ontologies and standards that are most relevant, scientists find it difficult to choose, which delays the project.
DataFAIRy: A Collaborative Case Study
From the beginning of the project, it was agreed that the scope of metadata must be clearly defined. Choosing the key classes and properties relied on the collective experience of team members, who tried to capture the typical questions scientists want to answer with assay data in public repositories but were not able to because of the current difficulty handling the data. This flow of events of first defining the common scientific queries and then adjusting the data project accordingly follows the best practice borrowed from the software engineering field. By using it, the DataFAIRy team were able to avoid the problem of being overwhelmed by scope as well as identify value-added opportunities for filling gaps in metadata that can be generalized.
In the pilot phase of the project the team used the NLP software to annotate about 500 assay protocols, selected from a variety of sources. The resulting annotations were deposited into PubChem. The next phase of the project plans to scale up the annotation process by 10 to 100-fold, and to resolve the unique technical challenges that the large scale FAIR annotation requires. It will also expand the assay templates with required and recommended metadata fields and standard vocabulary for kits and high-value panels. The team aims to canvass service providers to serve their data in conformance with the templates, and encourage assay kit and reagent makers to contribute to the development and provisioning of the metadata for their products.
Through collaboration, the project aims to come up with collective definitions of the required information, which should lead to the eventual emergence of a standard information model for assay protocol metadata. This model will be shared using resources such as FairSharing.org and the CEDAR Workbench (https://metadatacenter.org/), and supported by data management tools, such as the BioHarmony Annotator used by the Pistoia Alliance team, so the standards can be more broadly adopted.
Changing The Industry Beyond Assays
Collaborative projects such as DataFAIRy strive to create a future where standardized reference ontologies and templates allow data to be ingested seamlessly because it’s formatted in a common language that is readable by both people and computers. Assays are just one area that can benefit from being digitized to promote discovery, analysis, and usability. However, FAIR data must be rolled out across the entire industry if companies want to accelerate R&D and enable scientific queries to be answered more rapidly. Widespread change across the industry can only be achieved when organizations share their expertise to decide and adopt the data standards that are essential to driving a FAIRer future.
Dr Dana Vanderwall is Executive Director Enterprise Informatics at Schrödinger. He was previously Director of Biology & Preclinical IT at Bristol Myers Squibb, and lead of Computational Chemistry at GSK. He has a PhD in Biochemistry from University of Maryland.
Dr Vladimir Makarov is a consultant and project lead at not-for-profit life sciences organization The Pistoia Alliance. He is currently Programme Manager for the Alliance’s AI and ML Centre of Excellence, and DataFAIRy project. His past experiences are mostly centered around informatics, including at Illumina, Pfizer, and BT Global Services. He has a PhD in computational biology from Baylor College of Medicine and is former faculty at California State University and the University of Maryland. He can be reached at vladimir.makarov@pistoiaalliance.org.