Recursion, Novo Nordisk Release New Chemistry Foundation Model
By Bio-IT World Staff
November 19, 2024 | In a paper published last week in Nature Communications, teams from Recursion and Novo Nordisk created a new foundation model for chemistry called MolE. Benchmarked for absorption, distribution, metabolism, excretion, and toxicity (ADMET) using the Therapeutic Data Commons, MolE outperformed previous models.
Using machine learning for chemical property prediction has been limited by how molecules are represented. The earliest quantitative structure-activity relationship studies used physicochemical properties or molecular weight, not chemical structure. Later models used molecular fingerprints that encode substructures of the molecules either in the form of preset chemical groups or as atom environments, yet these, too, fail to preserve the complete molecular graph topology especially when using a small fingerprint length. SMILES, a string-based representation of molecules, further improved models by making it easy to store and search molecular structures quickly, and have been used as inputs for deep learning architectures such as recurrent neural networks (RNNs) and Transformers, but each molecule does not have a unique SMILE representation and for large molecules this can lead to inconsistencies.
MolE—short for Molecular Embeddings—is a model that learns molecular embeddings, at the atomic environment level, directly from a molecular graph using a transformer, the authors write (DOI: 10.1038/s41467-024-53751-y). The model retains the position of each atom relative to others in the molecular structure.
Building the Model
The team used a new, two-step self-supervised pretraining strategy for graphs in which each atom predicts its atom environment, i.e. the atom type and connectivity of all neighboring atoms, the authors write. The first step is a self-supervised approach to learn chemical structure representation. They used an unlabeled dataset of over 840 million molecules. In this step, the authors write, “the prediction task is not to predict the identity of the masked token, but to predict the corresponding atom environment (or functional atom environment) of radius 2, meaning all atoms that are separated from the masked atom by two or less bonds.” The second training step uses graph-level supervised pretraining with a large, labeled dataset of about 465,000 molecules, they explain.
To assess the training, the team uses a set of 22 ADMET tasks included in the Therapeutic Data Commons (TDC) benchmark.
“An advantage of using this benchmark is that it provides a standardized way to compare model performance (using the mean and standard deviation of 5 independent runs),” the authors write. “As of September 2023, there have been ~15 different methods officially evaluated on this benchmark, including models using precomputed fingerprints (e.g., RDKit or Morgan Fingerprints), convolutional neural networks using SMILES, and different versions of graph neural networks such as ChemProp.”
Of the 22 tasks that make up the TDC benchmark, MolE scored best on six regression and four classification tasks and second best on four other tasks. The authors credit the training approach with the model’s success.
“We hypothesize that learning atom environments forces the model to aggregate the local chemical groups that will be used for prediction. Learning an embedding of atom environments and how to aggregate them into a molecular embedding can help to solve some problems of classical fingerprints such as sparsity and clashes when using bit vectors,” the authors wrote.