ESM-2 (evolutionary-scale prediction of atomic level protein structure with a language model)

✏️ I write these paper ‘summaries’ for me to clearly understand the paper by summarizing the paper and synthesizing the literature, I hope it might be helpful to some readers too. If you have any feedback please write to me helloramith.fyi or comment below.

Highlights of the ESM-2 Paper

💡• Train protein language models upto 15B parameters

• Infer structure directly from primary sequence using LLM

• LM leverages the evolutionary patterns captured in the LM to produce atomic level predictions

• Order of magnitude faster (60x) in high res structure prediction

• Present a ESM metagenomic atlas (structural characterization of of more than 617 million metagenomic proteins) Largest LM for proteins upto date

225M of them are high confidence predictions. (https://esmatlas.com/)

Metagenomicsis a term used to describe both a technique of sequencing of DNA purified directly from a natural environment and the research field focusing on studying microbial communities in their natural state“ excerpt from Godzik 2011

1. Introduction

1.1 - Structure and Function is Hidden in Sequences.

Biological properties of proteins influence which position/(s) in its sequence can undergo mutations. Therefore, based on these observations, we can define types of evolutionary functions that has happened such ascoevolution and conservationof amino acids. These observations can lead to infer properties regarding the function and structure of proteins.

In essence, information about structure and function of proteins is hidden in sequences.

Usually we rely on aligning sequences before we can draw conclusions into the function and structure. This intermediate representation known asmultiple sequence alignment (MSA)has a high time complexity as we have to 1) search for related sequences first, and 2) align them.

Search process in the pipeline of Alphafold and RosettaFold can take more than 10 minutes.

What if we can get rid of this intermediate representation? That’s one aspect this paper accomplishes.

1.2 - Large language models (LLMs)

Historically LMs were pretrained by using techniques such as predicting the next word in a sentence. But Devlin et al.[BERT]showed that just by masking some words in the input and trying to predict it (“masked language model objective - MLM”) is a better pretraining strategy .

“unlike left-to-right language model pre-training, the MLM objective enables the representation to fuse the left and the right context , which allows us to pretrain a deep bidirectional Transformer”

1.3 - Contributions

Inspired by this widely adopted strategy, the authors of this paper hypothesise that filling missing amino acids might result in learning things which are valuable enough to infer the structure. Thus, they scale protein language models from 8 million parameters upto 15 billion parameters . Doing so reveals the following,

Because of this one to two orders of speed improvement and the fact that MSA is not needed, they expand structure prediction tometagenomic proteins which is much greater in extent and diverse as well. Therefore, in summary they,

took 2 weeks on 2000 GPUs 🤯

2. Method

2.1 - How does structure emerge from LM trained on sequences?

ESM-2 language model is trained with  65 million unique sequences . Because of the MLM objective, we ask the model to predict missing pieces (amino acids) of the sequences using the neighbouring amino acid context. Therefore, the assumption is that the model needs to learn inter-dependencies of amino acids. In previous work[1]and[2], it was shown that transformer models trained with MLM on protein sequences develop attention patterns which corresponds to the residue-residue contact map.

Sampled with even weighting across  43million UniRef50 training clusters. I will write a separate article about that paper.

After training the LM, in order to compute the contact map from attention patterns, the authors use the approach in[2], where they use logistic regression to identify contacts as follows.

Figure 1: (Source: Rao, Roshan, et al. “Transformer protein language models are unsupervised structure learners.” Biorxiv (2020))

2.2 - Ok, what about atomic level structure?  (Enter ESM-Fold)

While authors extract the contact map from the attention maps, in order to extract spatial coordinates of the atoms, they use an equivariant transformer. This is thestructure moduleintroduced inAlphaFold.This equivariant transformer makes it possible to project out the spatial coordinates of the atoms just by using the internal language model representation. This architecture is referred to in the paper as ESMFold.

Steps in ESMFold

  1. Pass representation learnt by ESM-2 to a series of folding blocks

2.1 - Each blocksequentially updatesa sequence representation and a pairwise representation

  1. Pass the output to the structure module
  2. (repeat with 3 steps of recycling)(view code)

Click the below thumbnail to learn more about the structure module ↩

Figure 2:

Figure 3: (ESMFold Architecture : img credit - Lin et al.)

Figure 4:

Figure 5:

Training : To train the structure model to obtain spatial coordinates, they use, experimentally determined structures fromPDB( 25K clusters covering a total of  325K structures).
This is augmented with 12M structures predicted with AlphaFold2.

Predicted structures from AF are sampled 75% of the time, real structures 25% of the time during training. Evaluation : 194CAMEOProteins and 51 CASP14 protiens

This language model based approach vastly simplifies the usual SOTA structure prediction process by eliminating the need for the following ,

Eg:AlphaFoldrequires access to these

3. Results

3.1 - How well does it predict structures ?

As mentioned before, they evaluate performance on CAMEO and CASP14 proteins and check how well the structure was predicted using the TM-Score.

In predicting the structure just by single sequences, ESMFold achieves very good performance compared to AlphaFold and RoseTTAFold.

Figure 6: (Figure 2B from the ESM-2 paper)

3.2 - How important is the language model in the pipeline ?

The key question that arises is how important is the representation learnt by the LM for the task of structure prediction . To quantify this we need several metrics.

First, we need to characterize how good the understanding of the language model (ESM-2) is. This is whereperplexitycomes in. We already have the TM-score to determine how well a structure matches the groundtruth.

Thus, the graph to the right in Fig. 2B shows that,

What’s Perplexity?
How well language model can predict a protein sequence.
“ranges from 1 for a perfect model to 20 for a model that makes predictions at random . Intuitively, the perplexity describes the number of amino acids the model is choosing between for each prediction”

How can we achieve better perplexity?

Okay, now we know that havingbetter language model representation(lower perplexity) leads to better structure prediction.
So how can we achieve betterlanguage model representation? 🤔  Is scaling all you need ?

To answer this question, authors explore the effect of scaling and look at what happens to the following :

Figure 7: (Figure 1D of the paper)

So they plot how the long range precision @ L changes once we move from a smaller model (x axis) to a larger one (y axis). From the points above the diagonal, it seems that scaling does help to achieve betterlong range precision @ L(some proteins show improvement).

So is scaling ‘the’ answer ?

It seems it’s not that simple. While certainly for some proteins P@L increases as we scale, if we look at the how many evolutionary related sequences were there, it tells another story. It seems that LMs cannot perform well when there is less number of relevant training data to the query we are asking. This seems intuitive though right? more studying more results? But the thing is it seems it’s more memorization than understanding the subject?

Figure 8:

I think that authors could’ve used a different color scheme though (white points means no change has happened in perplexity, while a point on the diagonal says that no improvement in long range precision @ L). So, it would be nice to see the proportion of prtoeins which obtained better performance by the plot itself 🤔. Maybe there’s lot of peopls which makes it clutterd?

3.2 - Ok, what about prediction speed ?

6x speedup compared to AlphaFold2

Figure 9: (ESMFold performs better for shorter sequences; However since pairwise representations have a complexity of , performance gets worse for large sequence lengths. Also other methods need to search and construct the MSA. This could take additional > 10 minutes as well.

3.3 - Comparison with other protein language models

Figure 10:

4. Conclusion

It’s remarkable that these authors scale protein language models and it has resulted in learning structure hidden through databases of sequences, and thus we do not need to depend onto the MSA.

Is it because, the model has learnt to obtain the signal which we previously obtained through MSAs? What can we tell about the performance of sequences that had less number of evolutionary sequences in training data? why does it still struggle to obtain decent performance. It would be very interesting to analyze these directions.

Thanks for reading this, hope you found it useful. If you have any suggestions/ comments please share below.

References

  1. Lin, Zeming, et al. “Evolutionary-scale prediction of atomic level protein structure with a language model.” bioRxiv (2022): 2022-07.
  2. https://twitter.com/Eric_Wallace_/status/1592929060539469824