Abstract
We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms. The statistic is Rn,k, the total number of heads in head runs of length k or more in a sequence of iid Bernoulli trials of length n. Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful. Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well. In our application, p is large, say .75, and we have turned instead to direct calculation of the variance. Surprisingly, we are able to show that the variance, which is based on the interactions of O(n2) random variables, can be computed in constant time, independent of the length of the sequence and probability p. This result can be used to calculate the mean and variance of a number of other head run statistics in constant time. Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O(n) time.
Keywords
Affiliated Institutions
Related Publications
Elements of Algebraic Topology
Elements of Algebraic Topology provides the most concrete approach to the subject. With coverage of homology and cohomology theory, universal coefficient theorems, Kunneth theor...
Introduction: Motivation
The introduction motivates the remainder of the book via two specific examples of theorems from the early days of symplectic topology in which intersection theory plays a promin...
SSAHA: A Fast Search Method for Large DNA Databases
We describe an algorithm, SSAHA ( S equence S earch and A lignment by H ashing A lgorithm), for performing fast searches on databases containing multiple gigabases of DNA. Seque...
Database of homology‐derived protein structures and the structural meaning of sequence alignment
Abstract The database of known protein three‐dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The ...
PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences
Abstract Motivation: To construct a multiple sequence alignment (MSA) of a large number (>∼10 000) of sequences, the calculation of a guide tree with a complexity of O(N2...
Publication Info
- Year
- 1998
- Type
- article
- Volume
- 5
- Issue
- 1
- Pages
- 87-100
- Citations
- 13
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1089/cmb.1998.5.87