NVIDIA Highlights AI, Large Language Model Advances in Life Sciences
By Allison Proffitt
January 20, 2023 | NVIDIA is the only semiconductor manufacturer to ever present at the J.P. Morgan Healthcare Conference, said Harlan Sur, the firm’s semiconductor analyst in his introduction of Kimberly Powell, NVIDIA’s VP of healthcare. This year—NVIDIA’s fourth year presenting at the conference—Powell emphasized why NVIDIA is a crucial part of the healthcare ecosystem.
NVIDIA launched their AI computing platform for healthcare, Clara, in 2018, she said. “We recognize that healthcare is becoming the absolute largest data-generating industry, and we have global challenges in the increasing cost of healthcare delivery and access to healthcare. So we build computing platforms to serve these grand challenges,” Powell said to the J.P. Morgan audience.
Healthcare is becoming software-defined, Powell argued, and needs two, connected computing platforms to deliver value through software: a platform for AI development with training and models, and a platform for AI deployment with edge-case data gathering. “This is the ‘as-a-service’ architecture, Powell explained, and NVIDIA is building end-to-end platforms to meet this need.
Monai is NVIDIA’s AI development platform for imaging and robotics, co-developed with the industry and open sourced. Holoscan is NVIDIA’s AI deployment platform, a commercial, off-the-shelf platform to house the various industry applications. “Not every medical device needs to reinvent their computing platform every time new sensor technology comes to market,” Powell said. Holoscan developer kits are on the market now. (Powell discussed both of these in depth at NVIDIA’s 2021 GTC event).
Shifting Biology to Engineering
But 2022, Powell said, was “an absolute breakout year for NVIDIA-accelerated genomics.” NVIDIA is partnering across the genomics industry. Powell highlighted partnerships with Oxford Nanopore and Stanford to break a world record for clinical sequencing, with the Broad Institute to make Clara Parabricks free to researchers on Broad’s Terra platform, and—this month—with Bionano Genomics to accelerate optical genomic mapping workflows.
But this is just the beginning, Powell said.
“By reducing the cost, decreasing the speed, and partnering with the clinical community, we can bring the condition to move sequencing more into the standard of care, and we’re really excited to do that across the board with all of our sequencing partners.”
Generative AI and large language models (LLMs) are taking the AI world by storm, Powell said, with ChatGPT, launched by OpenAI in November 2022, being the most recent, flashy example. “These models are trained on extremely large, unlabeld datasets and they can learn context and meaning by tracking relationships in sequential data,” Powell explained. “Sounds very much like genomics if you ask me,” she quipped.
Powell highlighted three recent advances that that she calls breakthroughs in the space. NVIDIA contributed to genomic language models and generative AI models for protein engineering.
Predicting Virus Evolution
With Argonne National Laboratories and the University of Chicago, NVIDIA developed GenSLMs, the world’s largest biological language model able to predict virus evolution. (bioRxiv DOI: 0.1101/2022.10.10.511571)
The team started with 110 million bacteria sequences and fine-tuned the model with 1.5 million SARS-CoV-2 genomes, Powell said. Classification tasks included the prediction of enhancer and promoter sequences and transcription factor binding sites.
“The model was able to not only predict the evolution of the virus—so could potentially be useful for an early warning system—but it was also able to accurately identify variants of concern,” Powell said.
The authors of the paper write: “GenSLM is a foundation model for biological sequence data and opens up avenues for building hierarchical AI models for several biological applications, including protein annotation workflows, metagenome reconstruction, protein engineering, and biological pathway design.”
Large Language Models for Biology
For genomic language models, NVIDIA worked with InstaDeep and the Technical University of Munich to find a model that allows for accurate molecular phenotype prediction. (During the week, InstaDeep announced its acquisition by BioNTech.)
The team used Cambridge-1, an NVIDIA supercomputer, to train a collection of large language models (LLMs), from 500M to 2.5B parameters integrating information from 3,202 diverse human genomes, as well as 850 genomes from a wide range of species, including model and non-model organisms. Neucleotide Transformer, the highest performing model, achieved state of the art on 15 of the 18 prediction tasks, Powell said, proving its ability to generalize across many tasks (bioRxiv DOI: 10.1101/2023.01.11.523679).
Some of the takeaways she highlighted from the work: multi-species data was super important and the largest language model sizes performed the best. “That’s why NVIDIA is here,” Powell said, “to enable that to happen.” She said NVIDIA plans to make some of the models available to the community in the coming weeks.
Generative AI for New, Functional Proteins
Generative AI, Powell said, is poised for great applicability across life sciences, but especially in protein engineering. BioNeMo, a product NVIDIA announced at its GPT event last September, seeks to make it easier and more efficient to use generative AI and large language models across the drug discovery process. BioNeMo is currently in early access, but Powell announced a win in their work with Evozyne (with a paper forthcoming).
Evozyne and NVIDIA have built a generative model for protein engineering called ProT-VAE, a protein transformer variational autoencoder, using ProtT5, a pre-trained transformer model within BioNeMo, and Evonzyne’s VAE encoder. Evozyne was able to—from sequence—generate proteins they were able to experimentally synthesize and validate in the lab.
“The beauty of this is they used BioNeMo to fine-tune that model for families of proteins. You can fine-tune it for a set of proteins that has the given properties, function, and characteristics that you want, and it can generate a library of those,” Powell explained.
By way of example, she highlighted human PAH, hPAH, a protein precursor to pigments, hormones, neurotransitters, and more. The ProT-VAE model was trained on a PAH protein family and generated proteins with various mutations leading to enhanced function of the protein. Evozyne was able to synthesize and validate those proteins in the lab.
“This is the promise of these large language models: the ability to explore way outside the space,” Powell said. “We’re going to extend what is today the common use of direct evolution and extend it into machine-guided directed evolution. It’s really an accelerator to be able to discover new proteins.”