Abstract

When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution mode. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.

Keywords

False positive paradoxPattern recognition (psychology)Protein sequencingSimilarity (geometry)MathematicsEntropy (arrow of time)Matthews correlation coefficientArtificial intelligenceComputational biologyPeptide sequenceComputer scienceAlgorithmBiologyGeneticsSupport vector machinePhysics

Affiliated Institutions

Related Publications

Publication Info

Year
1996
Type
article
Volume
24
Issue
2
Pages
165-177
Citations
89
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

89
OpenAlex

Cite This

Henrik Nielsen, Jacob Engelbrecht, Gunnar von Heijne et al. (1996). Defining a similarity threshold for a functional protein sequence pattern: The signal peptide cleavage site. Proteins Structure Function and Bioinformatics , 24 (2) , 165-177. https://doi.org/10.1002/(sici)1097-0134(199602)24:2<165::aid-prot4>3.0.co;2-i

Identifiers

DOI
10.1002/(sici)1097-0134(199602)24:2<165::aid-prot4>3.0.co;2-i