A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation

2008 PLoS Computational Biology 348 citations

Abstract

Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence alignment scores follow Gumbel distributions, but determining an important parameter of the distribution (lambda) requires time-consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that integrate over alignment uncertainty ("Forward" scores), but the expected distribution of Forward scores remains unknown. Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling methods are used. For a probabilistic model of local sequence alignment, optimal alignment bit scores ("Viterbi" scores) are Gumbel-distributed with constant lambda = log 2, and the high scoring tail of Forward scores is exponential with the same constant lambda. Simulation studies support these conjectures over a wide range of profile/sequence comparisons, using 9,318 profile-hidden Markov models from the Pfam database. This enables efficient and accurate determination of expectation values (E-values) for both Viterbi and Forward scores for probabilistic local alignments.

Keywords

Gumbel distributionProbabilistic logicViterbi algorithmStatistical modelSequence (biology)Hidden Markov modelComputer scienceMarkov chainMathematicsArtificial intelligenceStatisticsAlgorithmPattern recognition (psychology)Extreme value theoryBiologyGenetics

MeSH Terms

AlgorithmsBase SequenceChromosome MappingComputer SimulationData InterpretationStatisticalModelsGeneticModelsStatisticalMolecular Sequence DataSequence AlignmentSequence AnalysisDNA

Affiliated Institutions

Related Publications

Accelerated Profile HMM Searches

Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, pr...

2011 PLoS Computational Biology 6891 citations

Profile hidden Markov models.

Abstract The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-spe...

1998 Bioinformatics 5657 citations

Publication Info

Year
2008
Type
article
Volume
4
Issue
5
Pages
e1000069-e1000069
Citations
348
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

348
OpenAlex
37
Influential
301
CrossRef

Cite This

Sean R. Eddy (2008). A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Computational Biology , 4 (5) , e1000069-e1000069. https://doi.org/10.1371/journal.pcbi.1000069

Identifiers

DOI
10.1371/journal.pcbi.1000069
PMID
18516236
PMCID
PMC2396288

Data Quality

Data completeness: 86%