Large Language Model Predicts, as Well as Explains, Molecular Properties
By Deborah Borfitz
April 10, 2025 | Scientists use prior knowledge to solve new problems, and a novel artificial intelligence (AI) system known as LLM4SD (Large Language Model 4 Scientific Discovery) can likewise synthesize knowledge from existing literature. But the method “goes beyond simply repeating information” by interpreting it and coming up with new hypotheses, says Geoff Webb, Ph.D., data scientist from the faculty of information technology at Monash University (Australia) and co-author of a study testing the system with 58 separate research tasks.
LLM4SD not only predicts molecular properties for scientific discovery purposes; it also explains the results (Nature Machine Intelligence, DOI: 10.1038/s42256-025-00994-z). The model’s ability to spot new patterns in data that human scientists might overlook is analogous to AI algorithms capable of spotting cancers on images missed by radiologists, Webb says.
The framework “allows the conjectures of the LLM to be used in a failsafe manner,” he points out. “That is, we get the LLM to specify descriptors of the molecules that may be useful for predicting a target molecular property and then we get a conventional machine learning method to determine which of those descriptors is actually predictive and how.”
Another distinctive contribution of LLM4SD is its demonstrated ability to perform complex machine learning, adds Yizhen Zheng, Ph.D. candidate from the department of data science and AI at Monash University’s faculty of information technology. “It can take a large set of examples of molecules together with their respective target properties and learn descriptors that are predictive of the target property from those examples.”
Previous related methods have instead used “human-crafted molecular descriptors as input to the machine learning system for predicting the target property,” he says. The LLM4SD approach was shown to be more effective than the human-crafted descriptors—but also that combining the methods was “even more effective than either in isolation.”
‘Interpretable Knowledge’
In the latest study, researchers conducted a comprehensive set of tasks focused on drug discovery-related properties. Out of the 58 tasks tested with LLM4SD, 39 of them were “dedicated to toxicity and adverse drug effect predictions, which are highly relevant to assessing a drug’s safety and efficacy,” says Webb. The tasks included predictions for key biological targets and pathways, such as toxicity and adverse effect (e.g., androgen receptor, estrogen receptor, mitochondrial membrane potential, and DNA damage p53-pathway) and side effect activity (e.g., metabolism and nutrition disorders, vascular disorders, and social circumstances).
The side effect activity predictions used the SIDER dataset, which covers a broad range of potential drug-induced side effects across various physiological systems, he notes. LLM4DS was additionally tested on critical drug-like properties that impact a molecule's pharmacokinetics and overall drug potential, including lipophilicity, solubility, blood-brain barrier permeability, and binding affinity predictions targeting HIV and BACE (a key enzyme in Alzheimer's disease).
Predictions about molecular properties were made by combining knowledge from two sources—scientific literature, extracting insights that are included in the system’s training data, and scientific data, identifying patterns in molecular datasets, Webb says. “This information is presented as interpretable knowledge, transforming molecules into feature vectors that quantify how strongly they exhibit certain properties.”
LLM4SD can, for example, identify key descriptors like molecular weight and lipophilicity to predict whether a molecule can cross the blood-brain barrier, he adds. “This helps explain why a molecule behaves a certain way, improving both prediction and understanding.”
Boosting Efficiency
Since LLM4SD is freely available and open source, “its adoption could span across different regions and scientific fields, particularly in academia, biotech, and pharmaceutical research,” says Zheng. The expectation is that the tool will be widely used by a range of scientists, including biologists to understand molecular interactions and biological pathways; pharmacologists to predict drug efficacy, safety, and pharmacokinetics; and medicinal chemists to design and optimize drug candidates based on predictive insights.
The output of LLM4SD is threefold: prediction results together with scientific rules, “insights similar to well-known frameworks like the Lipinski Rule of Five,” and descriptor importance highlighting which molecular features drive a prediction. Although adoption of LLM tools in drug discovery is still in its infancy, Zheng says, “LLM4SD’s strength lies in its interpretability.”
Together with its predictive accuracy, this could well accelerate the acceptance of LLM4SD in research and development environments, he adds. The tool is now using Galactica, which has a knowledge cutoff at 2022, but the limitation can be addressed by replacing that LLM with more recent ones “to incorporate the latest scientific discoveries and keep the tool aligned with current research.”
LLM4SD is not viewed as a marketable software product, says Webb. The next goal of the development team is to extend the research tool “to handle more diverse biological data, including DNA sequences and protein sequences, to broaden its predictive capabilities and make it even more useful for biological and drug discovery research.”
“The synergy between LLMs and human expertise has the potential to significantly boost research efficiency,” Zheng adds, accelerating scientific discovery and improving decision-making in drug development.