On the Distribution of<i>K</i>-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance

Abstract

We study the distribution of a statistic useful in calculating the significance of the number of k-tuple matches detected in biological sequence homology algorithms. The statistic is Rn,k, the total number of heads in head runs of length k or more in a sequence of iid Bernoulli trials of length n. Calculation of the mean is straightforward. Poisson approximation formulas have been used for the variance because they are simple and powerful. Unfortunately, when p = P(Head) is large, the Poisson approximation no longer works well. In our application, p is large, say .75, and we have turned instead to direct calculation of the variance. Surprisingly, we are able to show that the variance, which is based on the interactions of O(n2) random variables, can be computed in constant time, independent of the length of the sequence and probability p. This result can be used to calculate the mean and variance of a number of other head run statistics in constant time. Additionally, we show how to extend the result to sequences generated by a stationary Markov process where the variance can be calculated in O(n) time.

Keywords

MathematicsConstant (computer programming)Variance (accounting)Homology (biology)Sequence (biology)Distribution (mathematics)StatisticsCombinatoricsBiologyComputer scienceMathematical analysisGenetics

Affiliated Institutions

Icahn School of Medicine at Mount Sinai US

Related Publications

Elements of Algebraic Topology

James R. Munkres

Elements of Algebraic Topology provides the most concrete approach to the subject. With coverage of homology and cohomology theory, universal coefficient theorems, Kunneth theor...

2018 2203 citations

Introduction: Motivation

Chris Wendl

The introduction motivates the remainder of the book via two specific examples of theorems from the early days of symplectic topology in which intersection theory plays a promin...

2020 Cambridge University Press eBooks 2277 citations

SSAHA: A Fast Search Method for Large DNA Databases

Zemin Ning , Anthony J. Cox , James C. Mullikin

We describe an algorithm, SSAHA ( S equence S earch and A lignment by H ashing A lgorithm), for performing fast searches on databases containing multiple gigabases of DNA. Seque...

2001 Genome Research 962 citations

Database of homology‐derived protein structures and the structural meaning of sequence alignment

Chris Sander , Reinhard Schneider

Abstract The database of known protein three‐dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The ...

1991 Proteins Structure Function and Bioin... 1639 citations

PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences

Kazutaka Katoh , Hiroyuki Toh

Abstract Motivation: To construct a multiple sequence alignment (MSA) of a large number (&gt;∼10 000) of sequences, the calculation of a guide tree with a complexity of O(N2...

2006 Bioinformatics 108 citations

Publication Info

Year: 1998
Type: article
Volume: 5
Issue: 1
Pages: 87-100
Citations: 13
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

On the Distribution of<i>K</i>-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Gary Benson, 
                                
                                    Xiaoping Su
                                
                            (1998). 
                            On the Distribution of<i>K</i>-tuple Matches for Sequence Homology: A Constant Time Exact Calculation of the Variance. 
                            Journal of Computational Biology
                            , 5
                            (1)
                            , 87-100.
                            https://doi.org/10.1089/cmb.1998.5.87

Identifiers

DOI: 10.1089/cmb.1998.5.87