DeepMind Releases Open Protein Structure Database Including Complete Human Proteome
By Allison Proffitt
July 22, 2021 | DeepMind and the European Molecular Biology Laboratory (EMBL) today unveiled the most complete and accurate database of predicted protein structure models for the human proteome. The database includes about 350,000 protein structures, and will be freely and openly available to the scientific community.
The AlphaFold Protein Structure Database was published today in Nature (DOI: 10.1038/s41586-021-03828-1)
“We believe that this represents the most significant contribution AI has made to advancing the state of human knowledge to date,” said Demis Hassabis, founder and CEO of DeepMind, in a press briefing yesterday.
Solving the CASP Challenge
In December 2020, the second version of DeepMind’s neural network-based model for predicting 3D protein shape, AlphaFold, was recognized by the organizers of the Critical Assessment of protein Structure Prediction (CASP) benchmark as a solution to the 50-year-old grand challenge of protein structure prediction: predicting a protein’s shape computationally from its amino acid sequence rather than determining it experimentally through years of painstaking, laborious, and often costly techniques.
The DeepMind team published the both the methodology and the open-source code behind AlphaFold2 last week in Nature (DOI: 10.1038/s41586-021-03819-2).
“Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm,” the authors wrote in the July 12 paper.
Today they share the AlphaFold Protein Structure Database, making the fruit of their work—as well as generations of structural biologists and researchers who contributed to the Protein Data Bank—available to the scientific community. The AlphaFold Protein Structure Database includes predicted structures for all of the approximately 20,000 proteins expressed by the human genome as well as the proteomes for 20 other model organisms including E.coli, fruit fly, mouse, zebrafish, malaria parasite and tuberculosis bacteria.
EMBL Assist
The AlphaFold Protein Structure Database was created in partnership with the European Molecular Biology Laboratory (EMBL). The database and artificial intelligence system provide structural biologists with powerful new tools for examining a protein’s three-dimensional structure and offer a treasure trove of data that could unlock future advances and herald a new era for AI-enabled biology.
“When Demis and the team from AlphaFold first presented their results in November at CASP, I almost fell off my chair in excitement and amazement that this long-standing problem of how proteins fold had been solved,” said EMBL Deputy Director General, and EMBL-EBI Director Ewan Birney, comparing the dataset’s value and promise to that of the first human genome. “But that amazement deepened when Demis and the team came and said, ‘We think the best way to make the most use of this information is to make it fully open. Can we work with you to make a database of the structure predictions for all proteins?’”
EMBL and DeepMind worked together to build the database and assign both local and global confidence metrics to the predictions so that users understand both the promise and limitations of the data.
AlphaFold is already being used by partners such as the Drugs for Neglected Diseases Initiative (DNDi), which has advanced their research into life-saving cures for diseases that disproportionately affect the poorer parts of the world, and the Centre for Enzyme Innovation (CEI) is using AlphaFold to help engineer faster enzymes for recycling some of our most polluting single-use plastics. For those scientists who rely on experimental protein structure determination, AlphaFold's predictions have helped accelerate their research. For example, a team at the University of Colorado Boulder is finding promise in using AlphaFold predictions to study antibiotic resistance, while a group at the University of California San Francisco has used them to increase their understanding of SARS-CoV-2 biology.
“This is a perfect example of the virtuous circle of open data,” said EMBL Director General Edith Heard during the briefing. “AlphaFold was trained thanks to the public data generated by the scientific community using a multitude of scientific technologies over the last 17 years, many of which are accessible through databases hosted at EMBL-EBI. So it’s only fitting that the AlphaFold predictions, built on the decades of data that came before, should be available to scientists all over the world.”
While the structure database will be undoubtedly useful for structural biologists, Birney said he’s most excited about how the resource will empower geneticists, genomicists, and cancer biologists. He predicts that the AlphaFold structures will be integrated into more and more tools and sources of info. “This is a great new scientific tool, which complements existing technologies, and will allow us to push the boundaries of our understanding of the world,” he said.
Hassabis said the database and system will be periodically updated as DeepMind continues to invest in future improvements to AlphaFold. He said that over the new few months, the team plans to vastly expand the coverage to almost every sequenced protein known to science—over 100 million structures covering most of the UniProt reference database.