Meta CEO Mark Zuckerberg announced Tuesday that the tech giant's Artificial Intelligence (AI) research team has created a model that predicts protein folding 60 times faster than state-of-the-art models. Protein folding is a process in which a linear polypeptide (chain of peptides) folds into a three-dimensional structure to function correctly, or to become biologically active. Unfolded or misfolded proteins can lead to diseases. Zuckerberg claims that the ability of Meta's model to predict protein folding at a fast rate will unlock new ways to treat disease and accelerate drug discovery.
What is the ESM Metagenomic Atlas?
Meta AI has created ESM (evolutionary scale modelling) Metagenomic Atlas, the first database that reveals the structures of the metagenomic world at the scale of hundreds of millions of proteins. Metagenomics is a term used to describe both a technique of DNA purified from a natural environment, and the research field focusing on studying microorganisms in their natural state, and has been extremely beneficial in mapping the protein universe.
Metagenomics has helped provide a huge amount of protein sequence data. The technique is used to study the structure and function of entire nucleotide sequences isolated and analysed from organisms in a natural environment. In other words, metagenomics refers to the direct genetic analysis of genomes present in an environmental sample.
ESM Metagenomic Atlas is the first view of the 'dark matter' of the protein universe, Meta says in a blog post. It is an open atlas of 617 million metagenomic protein structures, and is believed to be a rival of AlphaFold, an AI system developed by Alphabet subsidiary DeepMind that predicts the three-dimensional structure of a protein from its amino acid sequence.
What are metagenomic proteins?
Metagenomic proteins are found in microbes in the soil, deep in the ocean, and even inside our bodies, and vastly outnumber the proteins that make up animal and plant life. However, metagenomic proteins are the least understood proteins on Earth.
The genetic material present in an environmental sample is referred to as a metagenome. It is important to decode metagenomic structures because they can help us solve long-standing mysteries of evolutionary history and discover proteins that may help cure diseases, clean up the environment, and produce cleaner energy.
How Meta’s new model is better than state-of-the-art models
In order to make predictions of the structure of proteins at a huge scale, a breakthrough in the speed of protein folding is extremely important. Meta's AI team trained a large language model to learn evolutionary patterns and generate accurate structure predictions end-to-end directly from the sequence of a protein.
The model predicts protein folding up to 60 times faster than the current state-of-the-art models while maintaining accuracy, Meta says. This makes the tech giant's approach scalable to far larger databases.
Meta has made its models and the ESM Metagenomic Atlas accessible to the public. It has also introduced an Application Programming Interface (API) to allow scientists to easily retrieve scientific protein structures.
Proteins and their functions in biological systems
Proteins are one or more long chains of amino acids and dynamic, complex structures present in all living organisms. Encoded by genes, proteins are responsible for varied and fundamental processes of life, and have an outstanding range of roles in biology. Some examples of proteins in biological systems are the rods and cones in the eyes of humans, molecular sensors responsible for hearing and the sense of touch, complex molecular machines converting sunlight into chemical energy in plants, motors driving motion in microbes and muscles, enzymes breaking down plastic, antibodies protecting living beings from disease, and molecular circuits that cause disease when they fail, among others.
Metagenomics and its relevance to the world
Metagenomics uses gene sequencing to discover proteins in samples from environments across Earth, microbes living in the soil, in extreme environments like hydrothermal vents, deep in the oceans and in our guts and on the skin. A vast number of proteins, beyond those catalogued in well-studied organisms, exist in the natural world.
Metagenomics is helping reveal the incredible diversity of these proteins, uncovering billions of protein sequences that are new to science and catalogued for the first time in large databases. These databases have been compiled by public initiatives such as the National Center for Biotechnology Information (NCBI), European Bioinformatics Institute and Joint Genome Institute.
Metagenomic structures will help accelerate the discovery of proteins for practical applications in fields such as medicine, green chemistry, environmental applications, and renewable energy.
Metagenomic proteins are the ‘dark matter’ of the protein universe
According to Meta's blog post, Eugene V Koonin, senior investigator at NCBI, said metagenomics is revealing a vast diversity of proteins, many of which are new to science and have the potential to shed light on some of the fundamental questions about life. Also, numerous novel proteins discovered via metagenomics are the 'dark matter' of the protein universe because their structures and biological roles are unknown.
It is important to have a map of these structures because the assembly will help provide a far reaching insight into the world of protein structures and functions, V Koonin added.
Meta’s new model can predict 3D structures of proteins at a fast rate
Meta has developed a new protein-fielding approach that involves the use of large language models to create the first comprehensive view of protein structures in a metagenomics database consisting of hundreds of millions of proteins. Meta's AI research team observed that language models can accelerate the speed at which three-dimensional structures at the atomic level can be predicted at a very fast rate.
Meta believes that this advance will revolutionise a new era of structural understanding. This could make it possible for the first time to understand the structure of billions of proteins that gene-sequencing technology is cataloguing.
Meta's new model can predict nearly the entire MGnify90 database, a public resource cataloguing metagenomic sequences. ESM Metagenomic Atlas is the largest database of high resolution predicted structures, according to Meta. The database is three times larger than any existing protein structure database, and the first to cover metagenomic proteins comprehensively.
Meta's new language model, which has 15 billion parameters, is the largest language model of proteins to date, the tech giant says in the blog post.
Meta’s model can help discover proteins for use in medicine
Due to advancements in gene sequencing, it has been possible to catalogue billions of metagenomic protein sequences. Although the sequences of metagenomic proteins have been discovered, understanding their biology is quite difficult. This is because determining the three-dimensional structures for hundreds of millions of proteins experimentally is far beyond the reach of laboratory techniques such as X-ray crystallography. It can take weeks to years to determine the three-dimensional structure of a single protein through experimental laboratory techniques. One can learn more about metagenomic proteins through computational approaches.
Using the ESM Metagenomic Atlas, scientists can analyse the structures of hundreds of millions of metagenomic proteins. In this way, they can identify structures that have not been characterised before, discover new proteins that can have potential applications in medicine, and search for distant evolutionary relationships.
How Meta’s model predicts protein structures
Proteins can be written as sequences of characters, where each character corresponds to one of 20 standard chemical elements called amino acids, the building blocks of proteins.
For a protein made of 500 amino acids, there are 20^500 possible sequences. This is more than the number of atoms in the visible universe. Every sequence folds into a three-dimensional shape according to the laws of physics, and it is this shape that determines the biological function of a protein.
Statistical patterns conveying information about the folded structure of a protein are seen in protein sequences.
For instance, two positions in a protein could be co-evolving with each other. If a certain amino acid appears at one of the positions, and that amino acid is usually paired with a certain amino acid at the other position, it could be a signal that the two positions are interacting with each other in the folded structure.
Evolution chooses amino acids that fit together in the folded structure. Therefore, these patterns in protein sequences can provide important information about the structure of a protein.
How Meta trained language models to predict protein properties
ESM uses AI to read these patterns. Meta, in 2019, presented evidence that language models learn the properties of proteins, such as their structure and function. The tech giant trained a language model on the sequences of millions of natural proteins, using self-supervised learning. The model can use this approach to correctly fill in the blanks in a protein sequence. In this way, the training helps predict the structure and function of proteins.
Meta, in 2020, released ESM1b, a state-of-the-art protein language model. It helps scientists predict the evolution of Covid-19 and discover genetic causes of the disease.
The tech giant has now scaled up this approach to create ESM-2, a next-generation protein language model. It has 15 billion parameters, and enables three-dimensional structure prediction at an atomic resolution.
Meta's new model accelerates the speed of structure prediction by up to 60 times, which is fast enough to make predictions for an entire metagenomics database in weeks. The tech giant says that the new model was able to predict sequences for more than 600 million metagenomic proteins in just two weeks.
Meta hopes that the ESM Metagenomic Atlas and fast protein folding models will lead to scientific progress and help the world make important discoveries.