Abstract

Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.

Keywords

Computational biologySequence (biology)Protein familyRelevance (law)Table (database)BiologyProtein sequencingSequence alignmentComputer scienceGeneticsDatabasePeptide sequenceGene

Affiliated Institutions

Related Publications

The Pfam Protein Families Database

Pfam is a large collection of protein multiple sequence alignments and profile hidden Markov models. Pfam is available on the WWW in the UK at http://www.sanger.ac.uk/Software/P...

2000 Nucleic Acids Research 1285 citations

Publication Info

Year
2008
Type
review
Volume
9
Issue
3
Pages
210-219
Citations
128
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

128
OpenAlex

Cite This

Stephen‐John Sammut, ROBERT FINN, Alex Bateman (2008). Pfam 10 years on: 10 000 families and still growing. Briefings in Bioinformatics , 9 (3) , 210-219. https://doi.org/10.1093/bib/bbn010

Identifiers

DOI
10.1093/bib/bbn010