Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes

Michelle M. McMahon; Michael J. Sanderson

doi:10.1080/10635150600999150

Abstract

A comprehensive phylogeny of papilionoid legumes was inferred from sequences of 2228 taxa in GenBank release 147. A semiautomated analysis pipeline was constructed to download, parse, assemble, align, combine, and build trees from a pool of 11,881 sequences. Initial steps included all-against-all BLAST similarity searches coupled with assembly, using a novel strategy for building length-homogeneous primary sequence clusters. This was followed by a combination of global and local alignment protocols to build larger secondary clusters of locally aligned sequences, thus taking into account the dramatic differences in length of the heterogeneous coding and noncoding sequence data present in GenBank. Next, clusters were checked for the presence of duplicate genes and other potentially misleading sequences and examined for combinability with other clusters on the basis of taxon overlap. Finally, two supermatrices were constructed: a "sparse" matrix based on the primary clusters alone (1794 taxa x 53,977 characters), and a somewhat more "dense" matrix based on the secondary clusters (2228 taxa x 33,168 characters). Both matrices were very sparse, with 95% of their cells containing gaps or question marks. These were subjected to extensive heuristic parsimony analyses using deterministic and stochastic heuristics, including bootstrap analyses. A "reduced consensus" bootstrap analysis was also performed to detect cryptic signal in a subtree of the data set corresponding to a "backbone" phylogeny proposed in previous studies. Overall, the dense supermatrix appeared to provide much more satisfying results, indicated by better resolution of the bootstrap tree, excellent agreement with the backbone papilionoid tree in the reduced bootstrap consensus analysis, few problematic large polytomies in the strict consensus, and less fragmentation of conventionally recognized genera. Nevertheless, at lower taxonomic levels several problems were identified and diagnosed. A large number of methodological issues in supermatrix construction at this scale are discussed, including detection of annotation errors in GenBank sequences; the shortage of effective algorithms and software for local multiple sequence alignment; the difficulty of overcoming effects of fragmentation of data into nearly disjoint blocks in sparse supermatrices; and the lack of informative tools to assess confidence limits in very large trees.

Keywords

SupermatrixGenBankPhylogenetic treeSupertreeBiologyTaxonPhylogeneticsTree (set theory)AlgorithmEvolutionary biologyMathematicsPattern recognition (psychology)Artificial intelligenceComputer scienceBotanyCombinatoricsGeneticsGene

Affiliated Institutions

University of California, Davis US

Related Publications

Using Supermatrices for Phylogenetic Inquiry: An Example Using the Sedges

Cody E. Hinchliff , Eric H. Roalson

In this article, we use supermatrix data-mining methods to reconstruct a large, highly inclusive phylogeny of Cyperaceae from nucleotide data available on GenBank. We explore th...

2012 Systematic Biology 92 citations

Global ecological patterns in uncultured Archaea

Jean‐Christophe Auguet , Albert Barberán , Emilio O. Casamayor

Abstract We have applied a global analytical approach to uncultured Archaea that for the first time reveals well-defined community patterns along broad environmental gradients a...

2009 The ISME Journal 456 citations

Terrace Aware Data Structure for Phylogenomic Inference from Supermatrices

Olga Chernomor , Arndt von Haeseler , Bùi Quang Minh

In phylogenomics the analysis of concatenated gene alignments, the so-called supermatrix, is commonly accompanied by the assumption of partition models. Under such models each g...

2016 Systematic Biology 2153 citations

Inferring species phylogenies from multiple genes: Concatenated sequence tree versus consensus gene tree

Sudhindra R. Gadagkar , Michael S. Rosenberg , Sudhir Kumar

Abstract Phylogenetic trees from multiple genes can be obtained in two fundamentally different ways. In one, gene sequences are concatenated into a super‐gene alignment, which i...

2004 Journal of Experimental Zoology Part ... 457 citations

A Rapid Bootstrap Algorithm for the RAxML Web Servers

Alexandros Stamatakis , Paul Hoover , Jacques Rougemont

Despite recent advances achieved by application of high-performance computing methods and novel algorithmic techniques to maximum likelihood (ML)-based inference programs, the m...

2008 Systematic Biology 6968 citations

Publication Info

Year: 2006
Type: article
Volume: 55
Issue: 5
Pages: 818-836
Citations: 178
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

178

OpenAlex

Cite This

APA Style

                            
                                    Michelle M. McMahon, 
                                
                                    Michael J. Sanderson
                                
                            (2006). 
                            Phylogenetic Supermatrix Analysis of GenBank Sequences from 2228 Papilionoid Legumes. 
                            Systematic Biology
                            , 55
                            (5)
                            , 818-836.
                            https://doi.org/10.1080/10635150600999150

Identifiers

DOI: 10.1080/10635150600999150