Building a Tech-Enabled Drug Discovery Pipeline at AstraZeneca
By Allison Proffitt
June 7, 2022 | AstraZeneca is using technology to speed and improve drug discovery at every step of the process, explained Anna Berg Asberg, global VP of R&D IT, last month in her presentation at the Bio-IT World Conference and Expo outlining AstraZeneca’s Bio-IT World Innovative Practices Award-winning project.
“End to end, we use technology to really accelerate [drug discovery], and we believe it’s really important how it’s moving in every step,” Asberg said. “You can’t put in all the effort and digitize one part of the chain. You need to have the whole chain with you. It’s always about maturity. If you find something that works really well here, it can be scaled to other parts. You need to have that adoption and maturity in the company—actionable willingness to take on the innovation.”
AstraZeneca’s award-winning project—Augmented Drug Design (ADD)—brings innovation to the drug design portion of the pipeline, but Asberg highlighted efforts in target discovery, digital labs, image analysis, clinical trials, and natural language processing as well. At each step, technology is being deployed to make the experience frictionless for users, whether they are researchers, data scientists, clinical trial participants, or regulatory affairs teams.
For Augmented Drug Design, AstraZeneca focused on chemists as the end users, and built the tool as a series of products on top of platforms, Asberg explained, reusing and remixing existing platform services and technologies to deliver outcomes.
“In drug discovery, there are multiple different problems… and we built [the tools] as different products so the scientists come in [and use them] almost as an application and have these different options. They dive into which one they need in that first step,” she said. “It’s all very easy to use.”
Driving Principles
Alla Bushoy, Engineering Team Lead on the project, outlined the guiding principles for the ADD architecture. “We used managed services where possible. The AWS cloud performs the heavy lifting for us, allowing us to focus on delivering business value. We moved away from traditional analytic applications architecture to microservices. It helps us to realize feature often, and pivot easily when we need to. Using open-source software and participation in the research consortium helps us to move faster and increase operational agility and save costs. And lastly,” she said, “using infrastructure-as-a-code gives us stable, consistent environment for faster iterations.”
Fast is the goal. The drug discovery process is a series of Design-Make-Test-Analyze (DMTA) cycles, each of which can take four to six weeks. Narrowing the candidates and speeding testing and analysis could decrease the number of DMTA cycles, which in turn speeds drug discovery.
The ADD functionality is broad. There are products or apps for molecule ideation, activity prediction, property prediction, and synthesis prediction. These tools are supported by a foundational layer which includes the Global Analytical Database (GAD), Predictive Insight Platform (PIP), and Chemistry Application Gateway (CAG).
Building Platform Foundations
The Chemistry Application Gateway (CAG) was built in Amazon Web Services (AWS) using Kubernetes (AWS EKS), Istio, Apache Ignite, AWS Opensearch, Nginx, AWS network load balancers, and Grafana. The entry calls CAG an autoscaling and auto-healing solution for chemistry applications at scale, providing the integration nexus between scientists’ drug discovery applications and new Data and AI services provided by the platform. Data come from AstraZeneca chemistry databases, external data sources, local lab instruments, and CRO lab instrument results delivered to an AWS S3 CRO bucket. Once these data are all uploaded to AWS and processed, they are integrated into a Data Hub that can be searched. Since ADD launched about two years ago, more than 150 data and chemistry service APIs have been delivered.
“Providing a centralized data hub of chemistry data on the cloud has unlocked the data silos, creating a single source of truth for scientists,” Bushoy said. “Using scalable APIs, scientists and chemistry applications can access required data and has created a foundation to build upon for drug discovery platforms.”
Most machine learning models are not deployed and used after development, Bushoy said. She highlighted many reasons for that: models that were not tested for real users, insufficient infrastructure to scale a model, lack of agility for new research and uses. The Predictive Insight Platform (PIP), developed to address those issues, currently runs more than 100 AI/ML models, processing ~1.5 million compounds per day generating DMPK, ADME, and other predictions. Models are deployed in less than a day now, and Bushoy predicts self-service deployment soon.
PIP uses an advanced chemistry batching and caching service surface all models built upon using NodeJS, AWS ElasticCache, and Postgres. The model deployment uses GitOps methodology to constantly sync models from a git repository to the Kubernetes cluster and uses Seldon-core, KEDA, WSO API manager, Helm, and Istio for model deployment and autoscaling, and Grafana for operational monitoring. To date, PIP has been integrated with nine DMTA applications, providing each with the ability to run predictive design, DMPK and safety models at scale, leading to several publications.
Finally, the Global Analytical Database (GAD) provides instant access to analytical and structural data through a central S3-based data lake and Elasticsearch-based metadata store, with the added ability to automate and predict analysis results through a scaled deployment of the ACD Spectrus database. Over 80 analytical instruments have been connected, and over 22 million files loaded, providing a rich source of data for analytical scientists to query and research, and datasets on which to build ML models to facilitate purification method selection.
Products on Top
On top of this foundational layer are the selection of tools or apps that ADD offers users.
REINVENT is an in-house built AI tool for de novo design that can generate cutting-edge compound libraries and perform a post-processing with a range of computational tools.
AIZynth is an ML-based retrosynthesis software that mitigates the bottleneck of manual synthesis assessment by generating hypothetical synthetic routes that can be used to rapidly prioritize compounds by ease of synthesis.
Aligned with AIZynth, Synthesis Informatics collects and curates internal and external reaction data for new AI models, and graph exploration within pharmaceutical sciences. Twelve million reactions to date have been extracted, transformed, and loaded into Reaction Connect, the central reaction data repository in Elasticsearch.
CAZP platform uses AI to limit the “astronomical size” of the chemical space, predicting a more focused set of compounds to bring into the clinic.
Schrödinger Active Learning FEP+, combines physics-based models with active learning to accurately predict the binding affinity for a large number of molecules against the disease target. This scaled capability built in collaboration with Schrödinger and AWS has enabled us to run FEP calculations at 100 times the previous rate, generating up to 250K FEP data points per annum.
An AI-Augmented Design Environment (AIDE) is scheduled for later delivery and plans to incorporate AI models and recommendations into MedChem design workflows.
Every new small molecule drug design project at AstraZeneca starts on the ADD platform, Asberg explained, with about 70% of the total projects now on the platform. About 1,000 scientists are using ADD, and the number of models of platform have doubled in a year.
“We see a lot of potential—end to end—to buy tech applications or build tech applications, adopt all over the place and scale,” Asberg said. “It’s always about helping the scientist go faster with higher quality.”