Local homology recognition and distance measures in linear time using compressed amino acid alphabets

R. C. Edgar

doi:10.1093/nar/gkh180

Abstract

Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments.

Keywords

Fast Fourier transformPhylogenetic treeEdit distanceSimilarity (geometry)BiologyTime complexityPattern recognition (psychology)Sequence (biology)MathematicsSmith–Waterman algorithmTree (set theory)AlgorithmComputer scienceSequence alignmentCombinatoricsArtificial intelligenceGeneticsPeptide sequence

Affiliated Institutions

Mill Valley Public Library US

Related Publications

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

Kazutaka Katoh

A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i)...

2002 Nucleic Acids Research 16606 citations

FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix

Morgan N. Price , Paramvir Dehal , Adam P. Arkin

Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constr...

2009 Molecular Biology and Evolution 5466 citations

Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study

Qiongyi Zhao , Yi Wang , Yimeng Kong +3 more

Abstract Background With the fast advances in nextgen sequencing technology, high-throughput RNA sequencing has emerged as a powerful and cost-effective way for transcriptome st...

2011 BMC Bioinformatics 484 citations

A fast, lock-free approach for efficient parallel counting of occurrences of <i>k</i> -mers

Guillaume Marçais , Carl Kingsford

Abstract Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome as...

2011 Bioinformatics 4605 citations

The Multiple Sequence Alignment Problem in Biology

Humberto Carrillo-Calvet , David J. Lipman

The study and comparison of sequences of characters from a finite alphabet is relevant to various areas of science, notably molecular biology. The measurement of sequence simila...

1988 SIAM Journal on Applied Mathematics 525 citations

Publication Info

Year: 2004
Type: article
Volume: 32
Issue: 1
Pages: 380-385
Citations: 151
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Local homology recognition and distance measures in linear time using compressed amino acid alphabets

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

151

OpenAlex

Cite This

APA Style

                            
                                    R. C. Edgar
                                
                            (2004). 
                            Local homology recognition and distance measures in linear time using compressed amino acid alphabets. 
                            Nucleic Acids Research
                            , 32
                            (1)
                            , 380-385.
                            https://doi.org/10.1093/nar/gkh180

Identifiers

DOI: 10.1093/nar/gkh180