Automatic generation of primary sequence patterns from sets of related protein sequences.

Abstract

We have developed a computer algorithm that can extract the pattern of conserved primary sequence elements common to all members of a homologous protein family. The method involves clustering the pairwise similarity scores among a set of related sequences to generate a binary dendrogram (tree). The tree is then reduced in a stepwise manner by progressively replacing the node connecting the two most similar termini by one common pattern until only a single common "root" pattern remains. A pattern is generated at a node by (i) performing a local optimal alignment on the sequence/pattern pair connected by the node with the use of an extended dynamic programming algorithm and then (ii) constructing a single common pattern from this alignment with a nested hierarchy of amino acid classes to identify the minimal inclusive amino acid class covering each paired set of elements in the alignment. Gaps within an alignment are created and/or extended using a "pay once" gap penalty rule, and gapped positions are converted into gap characters that function as 0 or 1 amino acid of any type during subsequent alignment. This method has been used to generate a library of covering patterns for homologous families in the National Biomedical Research Foundation/Protein Identification Resource protein sequence data base. We show that a covering pattern can be more diagnostic for sequence family membership than any of the individual sequences used to construct the pattern.

Keywords

Sequence alignmentSequence logoMultiple sequence alignmentSequence (biology)Pairwise comparisonSet (abstract data type)Computer scienceSimilarity (geometry)Pattern recognition (psychology)Alignment-free sequence analysisProtein superfamilyTree (set theory)Protein sequencingNode (physics)Structural alignmentComputational biologyCluster analysisGeneticsArtificial intelligenceBiologyPeptide sequenceCombinatoricsMathematics

Affiliated Institutions

Related Publications

PANDIT: an evolution-centric database of protein and associated nucleotide domains with inferred trees

Simon Whelan

PANDIT is a database of homologous sequence alignments accompanied by estimates of their corresponding phylogenetic trees. It provides a valuable resource to those studying phyl...

2005 Nucleic Acids Research 70 citations

Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site

Henrik Nielsen , Jacob Engelbrecht , Gunnar von Heijne +1 more

When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive perfo...

1996 Proteins Structure Function and Bioin... 89 citations

Consolidation of glycosyl hydrolase family 30: A dual domain 4/7 hydrolase family consisting of two structurally distinct groups

Franz J. St John , Javier González , Edwin Pozharski

In this work glycosyl hydrolase (GH) family 30 (GH30) is analyzed and shown to consist of its currently classified member sequences as well as several homologous sequence groups...

2010 FEBS Letters 134 citations

Identification and classification of protein fold families

Christine Orengo , Tomas P. Flores , William R. Taylor +1 more

We have developed a method for identifying fold families in the protein structure data bank. Pairwise sequence alignments are first performed to extract families of homologous p...

1993 Protein Engineering Design and Selection 222 citations

A tool for multiple sequence alignment.

David J. Lipman , Stephen F. Altschul , John Kececioglu

Multiple sequence alignment can be a useful technique for studying molecular evolution and analyzing sequence-structure relationships. Until recently, it has been impractical to...

1989 Proceedings of the National Academy o... 506 citations

Publication Info

Year: 1990
Type: article
Volume: 87
Issue: 1
Pages: 118-122
Citations: 292
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Automatic generation of primary sequence patterns from sets of related protein sequences.

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

292

OpenAlex

Cite This

APA Style

                            
                                    Randall F. Smith, 
                                
                                    Temple F. Smith
                                
                            (1990). 
                            Automatic generation of primary sequence patterns from sets of related protein sequences.. 
                            Proceedings of the National Academy of Sciences
                            , 87
                            (1)
                            , 118-122.
                            https://doi.org/10.1073/pnas.87.1.118

Identifiers

DOI: 10.1073/pnas.87.1.118