At Harvard, Scientists Partner with AI Using the Language of Proteins

By Deborah Borfitz

July 16, 2024 | Artificial intelligence (AI) is the new frontier in biomedical research, with large language models changing the way science is done from writing code and brainstorming research ideas to helping do research and conducting literature reviews. In some of its latest applications, generative AI has been used in designing molecules with insights produced at a scale and speed not possible with traditional deep learning methods alone, according to Marinka Zitnik, Ph.D., assistant professor of biomedical informatics at Harvard Medical School.

The potential of AI to aid in the development of new drugs to cure disease is enormous given the size of the chemical universe (10⁶⁰ chemical compounds) relative to the tiny fraction (10⁵) that has been synthesized in the lab as drugs approved for use by the Food and Drug Administration (FDA), she says. Ten drug candidates have to date been primarily optimized by AI—meaning, instances where AI identified the therapeutic target, or the chemical compound was being applied to a previously unknown target—nine of which are now in clinical trials (Nature Medicine, DOI: 10.1038/s41591-023-02361-0).

Zitnik highlights a fully AI-designed Insilico Medicine drug currently in phase 2 clinical trials for idiopathic pulmonary fibrosis. It’s a small molecule inhibitor that modulates the identified protein target.

“Currently, no fully AI-discovered drugs have been approved by the FDA, but that is in part because the process of developing new drugs from scratch can be over decades long and generally associated with large number of costs,” Zitnik continues. This is where her work at Harvard comes in, which focuses on developing foundational models enabling AI to “learn on its own” and thereby make research more efficient.

Providing labels is a highly resource-intensive exercise that typically requires the expertise of scientists to identify cell types and possibly also the running of biological experiments or assays, she says. In the absence of self-supervised learning to explore and characterize data, it is much harder to build predictive models and the reason no one heretofore has trained them on hundreds of millions of data points.

With self-supervised learning, no external labels are needed to identify objects in raw data for contextualization purposes because the data itself supervises the learning, Zitnik explains. The self-learning PINNACLE model for single-cell biology developed in her lab supports therapeutic research by tailoring its outputs to the biological contexts in which it operates.

Finding Targets

AI is already being used in every step of the scientific discovery process, says Zitnik. These include data acquisition and measurements at scale, to help generate hypotheses, and to assist scientists with the design of institutional experiments and large-scale simulations.

When it comes to drug discovery, the process always starts with human knowledge, such as the genetic elements implicated in a disease or the disease phenotype, she says. Multi-modal graph learning models are then developed to leverage understanding of the condition and “predict and nominate” disease elements that could serve as potential protein targets to modulate disease effects.

Once targets are identified, researchers try to identify the best chemical compounds for the job using generative geometric models at the level of atoms, continues Zitnik. They next compile priority lists of molecules, based on their structural shape and protein target effect, to modulate disease in patient populations in their cell type context. This leads to predictions that can then be tested experimentally in animals, with the longer-term goal of matching therapies to patient benefits.

Predicting and optimizing a protein target needs to happen in context, she says, drawing a parallel with the polysemic word apple whose meaning is resolved via the context of surrounding words. Just as one can “grow an apple” or “buy an apple,” H2AFX encodes for a pleiotropic protein whose function is resolved via cell context—“particularly this protein put in an environment where a drug will operate.” H2AFX will activate homologous recombination or end joining pathways, respectively, when put in an environment with the cancer drugs olaparib or doxorubicin.

The PINNACLE model dynamically adjusts its outputs in a biological context. It leverages both natural language processing and advances in transformers, a type of AI that can learn context and track relationships between input and output sequence components. Up until a few years ago, it was not possible to understand a given protein plus its surrounding environment to identify its many potential roles, Zitnik notes.

Biological Contexts

“Providing outputs tailored to biological contexts is essential for broad use of foundation models,” she says. The applications include tasks such as enhancing 3D structural representations of therapeutically relevant interactions, studying the effects of drugs across cell-type contexts, nominating therapeutic targets in a cell-type-specific manner, and “zero-shot retrieval” of tissue hierarchy whereby the model can make predictions without using any relevant training data.

The building of virtual cell models is on the bigger research agenda, says Zitnik. For example, PINNACLE can use protein networks and single-cell transcriptomic data to learn the construction of the cell-type-specific protein-protein interaction network and use that as a starting point for self-supervised geometric deep learning to produce protein representations tailored to cell type contexts.

The biological context in her illustration comes from a multiple-organ single cell transcriptomic atlas of humans, where 15 donors provided 177 cell types. “We know how those cells relate to each other based on what tissues they were sampled from and how tissues are related to each other,” she says.

Link predictions are made from these complex protein networks spanning many cell and tissue types. In Zitnik’s example, which included 156 cell type contexts across 62 tissues of varying hierarchical scales, half a million protein latent representations were contextualized.

The data might be used to identify changes at the single-cell level predictive of patient phenotypes, for example in Alzheimer’s disease, she continues. Transfer learning across these cellular contexts would enable predictions about whether candidate drugs would affect disease-relevant cell types in terms of functions such as synaptic signaling and lipid metabolism in the brain. Drug targets could thereby be nominated in a cell-type-specific manner.

Similarly, for an immunological disease like rheumatoid arthritis, researchers developing targeted therapies have used PINNACLE to identify the cell types that play a role in synovial tissue by looking at the biological context of known targets and existing therapies, says Zitnik. In this way, cell-specific drug targets were identified in epithelial, endothelial, stromal, immune, immune-stromal, stromal-epithelial, and germ line contexts.

A great deal of resources and effort has gone toward development of single-cell atlases over the last decade, she says, with over 80 million cells being catalogued. “We are now at the same point as the Human Genome Project [upon its completion 20 years ago]... the question becomes how do we leverage those single-cell datasets [using AI] to inform medical research?”

Protein Language Models

Proteins are “the workhorse molecules of life,” says Zitnik, and often directly linked to the effects of drugs. Understanding them is consequently vital to biology, including therapeutic development. Therefore, more comprehensive and functionally insightful models of proteins are being developed, catalyzed by research with foundation models.

The “big idea” here is to train a machine learning model on a massive, unlabeled dataset that can support a broad range of predictive tasks, she says. Among the variety of foundation models currently out there are GPT, Gemini, Llama, BERT, and CLIP, all better suited for “narrow, specialized tasks” such as predicting the most suitable next word or to generate images.

Success is driven by data, hardware, self-supervised learning, and transformer neural architectures, continues Zitnik. The development of protein language models involves training on the sequences of proteins and amino acids and other biological parameters using the same concepts of natural language processing such as auto-completion.

The objective is to teach the model to independently predict the identity of amino acids in a sequence. Unlike large language models where the control tags are things like politics, sports, and a one-star or five-star review, the tags for protein models are instead different enzymes and antibodies such as immunoglobulin, chorismite mutate, glycosaminidase, and phage lysozyme.

PINNACLE addresses three key knowledge gaps in today’s protein language models in terms of multimodality, flexibility, and generalizability, Zitnik says. “Many models incorporate only sequence information, leaving other data types unexplored.” It is also challenging to generalize existing models to new tasks when protein annotations are scarce. Additionally, “models cannot easily follow human instructions.”

In addition to incorporating amino acid and molecular phenotype data, PINNACLE easily interacts with scientists from diverse backgrounds and levels of expertise, she says. It can also “generalize to new phenotypes in a zero-shot manner.”

Among its unique capabilities are prioritization of protein and peptide sequences, peptide-ligand binding affinity prediction, functional annotation and retrieval of candidate mechanisms of actions, and captioning and text generation describing peptide and amino acid sequences, says Zitnik. Examples include predicting if a certain sequence plays a role in the manifestation of obesity, prioritizing proteins that functionally bind to transition metal ions and are involved in any biological process in which a relatively long-lasting adaptive behavioral change occurs, and free response text generation to the mechanism of action of acetylsalicylic acid (aspirin) on a sequence.

“This model allows us to do... zero-shot prediction, which is identifying proteins that are likely to have function for a phenotype even if there are no known proteins that are recorded as [such] in knowledge bases,” she points out. Multimodal models combining language modeling with 3D structures can also produce a more “functionally insightful” view of proteins.

Sequence-structure co-generation of biomolecular interactions can create “high-fidelity protein pockets,” she says, meaning areas where proteins interact with ligand molecules. A version of the protein language model called PocketGen generates residue sequence and full-atom structure within these protein pocket regions.

PocketGen supports the design of novel antibodies and enzymes as well as biosensors, says Zitnik. It can generate protein pockets with higher binding affinity and structural validity than existing models, the comparators here being PocketOpt, DEPACT, DyMEAN, FAIR, RFDiffusion, and RFDiffusionAA.

System of Models

In Zitnik’s group, work of late has focused on evolving use of data-driven models for biomedical research. Databases and search engines have enabled great strides over the last few decades, such as patient outcome predictions using microarray gene expression analysis and the DeepBind tool for predicting the sequence specificities of DNA- and RNA-binding proteins. But they’re not well suited to learning the underpinnings of protein structure and function at scale because they were designed for specialized tasks and trained on “clear and specified” datasets.

Broader and more interactive foundation models have emerged that can be used for a range of different tasks that don’t require scientists to have any coding experience and allow them to provide feedback that gets integrated back into the model, Zitnik says. “Increasingly, we’re entering this stage of building... a system of models that interact with each other.”

In this scenario, AI agent systems take a more “creative role” in the scientific process, although they do not replace the creativity of human scientists who develop the actual hypotheses, she says.

With the launch of the Nobel Turing Challenge a few years ago, the ambitious goal is now “creation of highly autonomous systems with the potential to make Nobel-worthy discoveries,” says Zitnik. Large language models bring biomedical science closer to such systems.

“Even a single large language model can exhibit a broad range of capabilities [and] conversations between differently configured agents can help combine... capabilities in a modular and complementary manner,” Zitnik shares. “These models have illustrated to some extent that they can solve complex tasks by breaking [them] into simpler subtasks.” AI agents can also cooperate and form chats—potentially, with humans, scientific experimental platforms, and other tools.

Current AI models have their limitations, Zitnik continues. They’re specific to different scientific experiments (involving proteins, small molecules, peptides, etc.) and integrated by humans for prediction tasks. The methods are also limited to low-throughput experiments. AlphaFold, she notes, draws from the knowledge of scientists “who incorporate established inductive biases [e.g., valid protein folding angles and Van der Waals forces attracting neutral molecules to one another]” since the models themselves do not automatically identify them.

With conventional approaches to foundation models, she says, “we may fail to develop AI systems that can generate novel hypotheses” because such novelty would not have been encountered in the data used to train the models. Generating novel experiments requires creativity whereas generating novel text requires semantic and syntactic conformance, and only the latter aligns well with techniques for

next-token prediction—a technique that involves predicting what will come next in a sequence of data and then immediately checking if the prediction is correct—within large language models.

‘AI Scientists’

To have AI systems that can enable “novel, creative scientific research,” says Zitnik, “we need to have the ability to not only know information but generate novel creative hypotheses that are not naturally straightforward but... should be grounded in real scientific research and follow from existing scientific research.”

To provide the missing components, Zitnik’s group built a multi-agent system for single-cell biology called PINNACLE Prism where each of the AI agents is powered by its own large language model. As she describes it, the starting point is a task identification agent informed by the intent of a human scientist that transmits the information to another agent that goes to the web (e.g., Open Targets database) to extract all known evidence from studies for the disease. A third agent tries to write the code and runs it through the pipeline, based on the task to be solved, and the output gets translated into a hypothesis by a fourth agent that gets refined over several iterations involving yet several more agents (e.g., reasoning and debugger agents).

The entire exercise takes a total of two minutes to complete, most of it because the code gets generated a few times to be sure it is correct, says Zitnik. This is exciting, she adds, because it otherwise takes one week to simply gather all the data. A paper describing the envisioned “AI scientists” capable of skeptical learning and reasoning is currently under review.

The process could theoretically be repeated for hundreds of diseases in a matter of one day, says Zitnik. It is just an illustration of where she and her team see the future going—human scientists asking research questions and their AI partners doing the work of retrieving, processing, and prioritizing data to generate new insights empowering medical research. The predictions these models are making “have to be carefully curated and validated to lead to discovery.”

Among the positive opportunities AI is enabling is greater attention to rare diseases that are currently understudied due to the lack of financial incentives in the private sector, she notes. Only 5% of rare diseases have any FDA-approved drug treatment because of the small number of patients who are afflicted with any one condition.

Currently, foundation models are being designed around major disease areas where curated data repositories exist for training AI models, says Zitnik. But since zero-shot probability models can now transfer from one task to the other, the “hope and expectation” is that models trained on large databases for Alzheimer’s, Parkinson’s, and cancer will increasingly be adapted for small sporadic subtypes of those diseases.