Abstract
Abstract We have compared commonly used sequence comparison algorithms, scoring matrices, and gap penalties using a method that identifies statistically significant differences in performance. Search sensitivity with either the Smith‐Waterman algorithm or FASTA is significantly improved by using modern scoring matrices, such as BLOSUM45–55, and optimized gap penalties instead of the conventional PAM250 matrix. More dramatic improvement can be obtained by scaling similarity scores by the logarithm of the length of the library sequence (ln()‐scaling). With the best modern scoring matrix (BLOSUM55 or J093) and optimal gap penalties (‐12 for the first residue in the gap and —2 for additional residues), Smith‐Waterman and FASTA performed significantly better than BLASTP. With ln()‐scaling and optimal scoring matrices (BLOSUM45 or Gonnet92) and gap penalties (‐12, ‐1), the rigorous Smith‐Waterman algorithm performs better than either BLASTP and FASTA, although with the Gonnet92 matrix the difference with FASTA was not significant. Ln()‐scaling performed better than normalization based on other simple functions of library sequence length. Ln()‐scaling also performed better than scores based on normalized variance, but the differences were not statistically significant for the BLOSUM50 and Gonnet92 matrices. Optimal scoring matrices and gap penalties are reported for Smith‐Waterman and FASTA, using conventional or ln()‐scaled similarity scores. Searches with no penalty for gap extension, or no penalty for gap opening, or an infinite penalty for gaps performed significantly worse than the best methods. Differences in performance between FASTA and Smith‐Waterman were not significant when partial query sequences were used. However, the best performance with complete query sequences was obtained with the Smith‐Waterman algorithm and ln()‐scaling.
Keywords
Affiliated Institutions
Related Publications
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individu...
Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks.
We describe an approach to analyzing protein sequence databases that, starting from a single uncharacterized sequence or group of related sequences, generates blocks of conserve...
Protein homology detection by HMM–HMM comparison
Abstract Motivation: Protein homology detection and sequence alignment are at the basis of protein structure prediction, function prediction and evolution. Results: We have gene...
The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets
Abstract Cellular life depends on a complex web of functional associations between biomolecules. Among these associations, protein–protein interactions are particularly importan...
Design and validation of a histological scoring system for nonalcoholic fatty liver disease†
Nonalcoholic fatty liver disease (NAFLD) is characterized by hepatic steatosis in the absence of a history of significant alcohol use or other known liver disease. Nonalcoholic ...
Publication Info
- Year
- 1995
- Type
- article
- Volume
- 4
- Issue
- 6
- Pages
- 1145-1160
- Citations
- 342
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1002/pro.5560040613