Abstract
We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.
Keywords
Affiliated Institutions
Related Publications
PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees
PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phyl...
Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive perfo...
Consolidation of glycosyl hydrolase family 30: A dual domain 4/7 hydrolase family consisting of two structurally distinct groups
In this work glycosyl hydrolase (GH) family 30 (GH30) is analyzed and shown to consist of its currently classified member sequences as well as several homologous sequence groups...
Identification and classification of protein fold families
We have developed a method for identifying fold families in the protein structure data bank. Pairwise sequence alignments are first performed to extract families of homologous p...
A tool for multiple sequence alignment.
Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to...
Publication Info
- Year
- 1990
- Type
- article
- Volume
- 87
- Issue
- 1
- Pages
- 118-122
- Citations
- 292
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1073/pnas.87.1.118