Abstract

Abstract Motivation: When analyzing protein sequences using sequence similarity searches, orthologous sequences (that diverged by speciation) are more reliable predictors of a new protein’s function than paralogous sequences (that diverged by gene duplication), because duplication enables functional diversification. The utility of phylogenetic information in high-throughput genome annotation (‘phylogenomics’) is widely recognized, but existing approaches are either manual or indirect (e.g. not based on phylogenetic trees). Our goal is to automate phylogenomics using explicit phylogenetic inference. A necessary component is an algorithm to infer speciation and duplication events in a given gene tree. Results: We give an algorithm to infer speciation and duplication events on a gene tree by comparison to a trusted species tree. This algorithm has a worst-case running time of O(\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(n^{2}\) \end{document}) which is inferior to two previous algorithms that are \batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \({\sim}\) \end{document}O(\batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(n\) \end{document}) for a gene tree of \batchmode \documentclass[fleqn,10pt,legalpaper]{article} \usepackage{amssymb} \usepackage{amsfonts} \usepackage{amsmath} \pagestyle{empty} \begin{document} \(n\) \end{document}sequences. However, our algorithm is extremely simple, and its asymptotic worst case behavior is only realized on pathological data sets. We show empirically, using 1750 gene trees constructed from the Pfam protein family database, that it appears to be a practical (and often superior) algorithm for analyzing real gene trees. Availability: http://www.genetics.wustl.edu/eddy/forester Contact: zmasek@genetics.wustl.edu; eddy@genetics.wustl.edu

Keywords

Gene duplicationSimple (philosophy)Genetic algorithmAlgorithmTree (set theory)Computer scienceGeneComputational biologyBiologyGeneticsMathematicsCombinatorics

Affiliated Institutions

Related Publications

Publication Info

Year
2001
Type
article
Volume
17
Issue
9
Pages
821-828
Citations
209
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

209
OpenAlex

Cite This

Christian M. Zmasek, Sean R. Eddy (2001). A simple algorithm to infer gene duplication and speciation events on a gene tree. Bioinformatics , 17 (9) , 821-828. https://doi.org/10.1093/bioinformatics/17.9.821

Identifiers

DOI
10.1093/bioinformatics/17.9.821