Abstract

We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.

Keywords

Sequence alignmentSequence logoMultiple sequence alignmentSequence (biology)Pairwise comparisonSet (abstract data type)Computer scienceSimilarity (geometry)Pattern recognition (psychology)Alignment-free sequence analysisProtein superfamilyTree (set theory)Protein sequencingNode (physics)Structural alignmentComputational biologyCluster analysisGeneticsArtificial intelligenceBiologyPeptide sequenceCombinatoricsMathematics

Affiliated Institutions

Related Publications

Publication Info

Year
1990
Type
article
Volume
87
Issue
1
Pages
118-122
Citations
292
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

292
OpenAlex

Cite This

Randall F. Smith, Temple F. Smith (1990). Automatic generation of primary sequence patterns from sets of related protein sequences.. Proceedings of the National Academy of Sciences , 87 (1) , 118-122. https://doi.org/10.1073/pnas.87.1.118

Identifiers

DOI
10.1073/pnas.87.1.118