Abstract

Computational biology and bioinformatics provide vast data gold-mines from protein sequences, ideal for Language Models (LMs) taken from Natural Language Processing (NLP). These LMs reach for new prediction frontiers at low inference costs. Here, we trained two auto-regressive models (Transformer-XL, XLNet) and four auto-encoder models (BERT, Albert, Electra, T5) on data from UniRef and BFD containing up to 393 billion amino acids. The protein LMs (pLMs) were trained on the Summit supercomputer using 5616 GPUs and TPU Pod up-to 1024 cores. Dimensionality reduction revealed that the raw pLM-embeddings from unlabeled data captured some biophysical features of protein sequences. We validated the advantage of using the embeddings as exclusive input for several subsequent tasks: (1) a per-residue (per-token) prediction of protein secondary structure (3-state accuracy Q3=81%-87%); (2) per-protein (pooling) predictions of protein sub-cellular location (ten-state accuracy: Q10=81%) and membrane versus water-soluble (2-state accuracy Q2=91%). For secondary structure, the most informative embeddings (ProtT5) for the first time outperformed the state-of-the-art without multiple sequence alignments (MSAs) or evolutionary information thereby bypassing expensive database searches. Taken together, the results implied that pLMs learned some of the grammar of the language of life. All our models are available through https://github.com/agemagician/ProtTrans.

Keywords

Computer scienceArtificial intelligenceMachine learningLanguage acquisitionNatural language processingPsychologyMathematics education

MeSH Terms

AlgorithmsComputational BiologyNatural Language ProcessingProteinsSupervised Machine Learning

Affiliated Institutions

Related Publications

Universal Sentence Encoder

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate pe...

2018 arXiv (Cornell University) 1289 citations

Publication Info

Year
2021
Type
article
Volume
44
Issue
10
Pages
7112-7127
Citations
1804
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1804
OpenAlex
146
Influential
1558
CrossRef

Cite This

Ahmed Elnaggar, Michael Heinzinger, Christian Dallago et al. (2021). ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence , 44 (10) , 7112-7127. https://doi.org/10.1109/tpami.2021.3095381

Identifiers

DOI
10.1109/tpami.2021.3095381
PMID
34232869
arXiv
2007.06225

Data Quality

Data completeness: 93%