AI Is Nothing Without An AI-Ready Data Strategy

Contributed Commentary by Chris Stumpf, PhD, Revvity Signals

October 25, 2024 | Applied artificial intelligence (AI) is already transforming drug discovery research and development, accelerating the journey from lab to lifesaving therapies, and both the reality and the hype are very well known.

Less obvious—almost 90% of the iceberg—is the critical role played by data. In short, AI is only as powerful as the data it consumes. While there is abundant scientific data in pharmaceuticals and biotechnology, it is often trapped in multiple independent systems, poorly modeled, and hard to access.

Despite the advances already made, it is only by optimizing the ingestion, storage, organization, and maintenance of data that leaders will be able to take full advantage of AI in our mission to deliver the promised revolution in drug discovery.

By way of example, Google DeepMind—whose founder was jointly awarded the 2024 Nobel Prize in Chemistry—and Isomorphic Labs continue to invest in AlphaFold, which uses AI to predict protein structures based on a vast library of existing data. Depending on the estimates, clinical development of a new cancer drug can take between five and eight years and could cost more than $4 billion. Even modest discovery improvements could therefore lead to significant savings and deliver life-changing outcomes rapidly for millions of people.

Clearly, tools such as generative AI and machine-learning algorithms not only hold the promise of dramatic breakthroughs, but fuel what scientists do best: think, experiment, and discover new therapies.

However, data is not enough on its own. Metadata, the ‘data about data,’ contextualizes each piece of information, detailing its origin, nature, and specifications. In R&D, effective metadata includes the methods used to collect the data, the conditions under which experiments were conducted, the parameters measured, and the protocols followed. Good metadata practice helps to make sure that data is findable, accessible, interoperable, and reusable (FAIR principles).

When intuitively integrated into the data collection process, enriched metadata enables AI and ML algorithms to perform deep analytics at scale, and to reveal those meaningful patterns for further investigation.

Unfortunately, many scientific organizations use rigid data schemas with predefined structures and strict parameters, forcing scientists to fit their rich, multifaceted data into limited input boxes, often at the expense of nuance and detail that could be crucial for future research. Scientific discovery is inherently unpredictable, and data captured today may be used in unexpected ways to answer questions that arise tomorrow.

Data bound too early into a rigid schema may lose its potential to inform future analyses or be re-interrogated under a new scientific lens. Additionally, rigid data models can impede the integration of disparate data types, which is an increasingly critical capability as research becomes more interdisciplinary. For example, combining biochemical data with clinical observations, patient demographics, and real-world evidence is essential for a full understanding of therapeutic outcomes.

In the context of AI, models trained on incomplete or inadequately structured datasets may yield suboptimal or biased predictions, undermining the accuracy, validity, and utility of the insights they generate.

In contrast, a different methodology called “late binding of schema” involves capturing data in a way that leaves room for changing the data’s presentation and restructuring it when needed. This approach acknowledges the unpredictable nature of scientific discovery and the diverse needs of data analytics, including AI. Late binding of schema ensures that data remains a living entity within the research ecosystem, capable of continual growth and re-examination in the quest for scientific breakthroughs.

As one of the early examples of generative AI, in 2018 a model predicted new ligands that were then empirically verified as new retinoid X receptor modulators. A year later, a generative AI solution identified novel kinase DDR1 inhibitors designed to combat fibrosis, which then went to successful pre-clinical trial in just 21 days. In a February 2024 Frontiers in Pharmacology paper titled Generative artificial intelligence in drug discovery: basic framework, recent advances, challenges, and opportunities, the authors reported: “The biggest obstacles to AI-driven drug development are the immense rounds of trial and error, the absence of a generic/universal mathematical system, and the lack of adequate data.”

For the pharmaceuticals sector, help is at hand. The best solutions truly recognize the role and importance of FAIR data, incorporate full metadata capabilities, and adopt late binding of schema principles. Even as ChatGPT, Gemini, Copilot and the others seize the public limelight, my conclusion is that for research purposes: AI is nothing without data and an underpinning AI-ready data strategy is crucial.

Chris Stumpf holds a PhD in Analytical Chemistry and Mass Spectrometry, and is Director, Drug Discovery Informatics Solutions, at Revvity Signals. Chris can be reached at chris.stumpf@revvity.com.