Abstract

Significance Learning biological properties from sequence data is a logical step toward generative and predictive artificial intelligence for biology. Here, we propose scaling a deep contextual language model with unsupervised learning to sequences spanning evolutionary diversity. We find that without prior knowledge, information emerges in the learned representations on fundamental properties of proteins such as secondary structure, contacts, and biological activity. We show the learned representations are useful across benchmarks for remote homology detection, prediction of secondary structure, long-range residue–residue contacts, and mutational effect. Unsupervised representation learning enables state-of-the-art supervised prediction of mutational effect and secondary structure and improves state-of-the-art features for long-range contact prediction.

Keywords

Artificial intelligenceGenerative grammarGenerative modelUnsupervised learningComputer scienceMachine learningProtein secondary structureBiology

MeSH Terms

Amino AcidsProtein ConformationSequence AnalysisProteinSequence HomologyAmino AcidUnsupervised Machine Learning

Affiliated Institutions

Related Publications

Publication Info

Year
2021
Type
article
Volume
118
Issue
15
Citations
2607
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

2607
OpenAlex

Cite This

Alexander Rives, Joshua Meier, Tom Sercu et al. (2021). Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences , 118 (15) . https://doi.org/10.1073/pnas.2016239118

Identifiers

DOI
10.1073/pnas.2016239118
PMID
33876751
PMCID
PMC8053943

Data Quality

Data completeness: 90%