Abstract
This note describes the program EST_GENOME for aligning spliced DNA to unspliced genomic DNA. It is written in ANSI C and has been tested under Digital OSF3.2. The spurce code and documentation are available from ftp:// www.sanger.ac.uky ftp/pub/ badger/est_genome.2.tar.Z. The prediction of genes in uncharacterized genomic DNA sequence is currently one of the main problems facing sequence annotators. Methods based on de novo prediction, e.g. searching for motifs like the splice-site consensus, or on statistical properties such as biased codon usage, etc. (Solovyev et al., 1994; Hebsgaard et al., 1996) have been only partially successful, and investigators have often found that the surest way of predicting a gene is by alignment with a homologous protein sequence (Birney et al., 1996; Gelfand et al., 1996; Huang and Zhang, 1996), or a spliced gene product [an expressed sequence tag (EST), mRNA or cDNA], particularly now that a large number of ESTs are available (Hillier et al., 1996). Standard alignment tools are not ideal for finding the correct alignment of a spliced product to genomic DNA, because of the large introns which can occur in the genomic sequence and because the programs ignore the conserved sequences found at donor/acceptor splice sites (intron/exon boundaries). In addition, very large genomic DNA sequences can be hard to align using quadratic-space dynamic programming because they require too much memory. The program EST_GENOME addresses this problem. It allows large introns, can recognize splice sites and uses limited memory. This combination of features makes a powerful and useful tool. EST_GENOME is used routinely at the Sanger Centre to help annotate human genomic sequence. As it is slow compared with search methods like BLAST (Altschul et al., 1990), we first screen genomic DNA against dbEST using BLASTN. Any matching ESTs are realigned using EST_GENOME. The algorithm uses a modification of Smith and Waterman (1981). The penalty structure used to score an alignment is as follows (defaults are in parentheses). Aligned bases score +match (1) or cost —mismatch (1) as appropriate. An indel in
Keywords
Affiliated Institutions
Related Publications
The order of sequence alignment can bias the selection of tree topology.
Sequential pairwise alignment of multiple sequences is a widely used procedure (Kruskal 1983 ).It is useful and generally successful when sequences within a set differ by relati...
The Lymphochip: A Specialized cDNA Microarray for the Genomic-scale Analysis of Gene Expression in Normal and Malignant Lymphocytes
Immunologists have a long tradition of dissecting thecellular components of the immune system based on theexpression of cell surface markers. Because of the easewith which immun...
DCSE, an interactive tool for sequence alignment and secondary structure research
DCSE provides a user-friendly package for the creation and editing of sequence alignments. The program runs on different platforms, including microcomputers and workstations. Ap...
TCS: a computer program to estimate gene genealogies
Phylogenies are extremely useful tools, not only for establishing genealogical relationships among a group of organisms or their parts (e.g. genes), but also for a variety of re...
Predictive Identification of Exonic Splicing Enhancers in Human Genes
Specific short oligonucleotide sequences that enhance pre-mRNA splicing when present in exons, termed exonic splicing enhancers (ESEs), play important roles in constitutive and ...
Publication Info
- Year
- 1997
- Type
- article
- Volume
- 13
- Issue
- 4
- Pages
- 477-478
- Citations
- 275
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1093/bioinformatics/13.4.477