A chat with Rudi Gunawan turned me on to the idea of using language models for peptide problems, which came up again in a chat with Yujia Xu about designer collagen. Some notes on protein language models…

Model types

Proteins seem like a language problem: We have an alphabet of amino acids and we have a natural sequence. So it makes sense to apply the same type of transformer type models that are becoming popular for natural language processing (NLP). If you’d like a more scholarly justification see the brief review by Ofer et al, “The language of proteins: NLP, machine learning & protein sequences” (2021). This review also goes into other (non-transformer) NLP-style approaches to the problem.

So let’s quickly review some transformer models for natural language processing.

  • BERT is an encoder-only transformer trained by masking tokens in the input; it’s magic power is that it is bidirectional (it tries to predict the missing information from c-terminus to n-terminus and then from n-terminus to c-terminus). At the end of this we are left with an embedding that we can use as input to a neural network or whatever to make a prediction. This makes it natural for infilling tasks (fill in the missing blank), but not well suited for de novo generation tasks

  • GPT-x (where x>=2 ) is a decoder-only transformer that predicts the next word moving from left-to right. That is to say it has a “causal” autoregressive training objective. Its strength is that this makes it easy to generate next tokens (hence allowing ChatGPT to exist). A weakness is that it lacks the bidirectional context of BERT

  • BART is a text-infilling model where some text spans are replaced by a single mask token and the decoder predicts uncorrupted tokens. This might be really interesting for peptides, but I haven’t come across it yet.

  • Convolutional neural networks are old-school (they’re not transformers), but can also be used for proteins, but you might try them for pro teins as well… more below..

  • XLNet (and friends): an autoregressive language model, but with a training objective which obtains full bidirectional attention by maximizing the likelihood over all permutations of a factorization order. Authors claim that it performs better than BERT at BERT-like bidirectional tasks.

Pre-trained protein/peptide language models

In general, the idea of a foundational model is that we pre-train it on a large corpus and then we should be able to apply to new problems—either directly or with a small amount of fine-tuning. The dominant usage style in many of these science problems is to expose the embeddings (either at the local amino-acid level or at the global whole-sequence level) and then train some model that relates those embeddings to a property of interest.

  • ProteinBERT (2022) is a BERT-style model that is pretrained on 106M proteins (essentially all known…) on two simultaneous tasks: (i) bidirectional language modeling of sequences; and (ii) gene ontology annotations. Unlike classic Transformers, ProteinBERT outputs local and global representations allowing for both types of tasks. In the paper they test this on 9 other tasks including secondary structure, disorder, remote homology, post-translational modficiations, fluoresecne, and stability. Code and model weights can be found online.. Something that is neat is that the same model weights can be used for any sequence length, so you can in principle support 10^4 AA long sequences

  • ProtGPT2 (2022) is pretty much what the name implies. Trained on 50M non-annotated sequences. Because it is a GPT-flavor model it can generate new protein seuqneces efficiently, and the resulting distributions are consistent with the observed frequences of globular proteins, , amino acid and disorder propensities, etc. even though they are “evolutionarily distant” in an amino-acid change sense. Model and dataset are on Huggingface.

  • Convolutional autoencoding representations of proteins (2023) There are good reasons to eschew transformers—they scale quadratically with sequence length in run-time and memory. So these folks from Microsoft used an efficient CNN which scales linearly with seuqence length. The results are competitive to (sometimes superior) to transformers . Code and pretrained weights online

  • ESM-2 (20230) another maksed transformer model. They combine this with a downstreat task of determining protein folding (ESMFold). Data and model weights online; the defalt distribution has some convenience command line programs for computing embeddings in bulk

  • Regresssion Transformer (2023): This is an XLNet-style transformer. The core idea is that you concatenate the SMILES string and properties (expressed numerically with a special tokenization scheme). Then you can fill in [MASK] tokens regardless of whether they are structural or property driven. Applications to drug-likeness, molecular properties, protein sequence modeling, protein fluoresnce/stability (results of the latter are comparable to ProteinBERT mentioned above), and organic reaction yield prediction

Stromateis

  • ChemBERTa is a RoBERTa-style transformer model trained on SMILES strings representations of molecules
    • You can get decent performance just by gzipping the SMILES strings. Not quite as good as the most recent ChemBERTa based models, but cheap.
  • Mass2SMILES is a transformer based model that takes MS/MS spectra as inputs and returns SMILES strings and functional group presence/absence