Construction of validated, non-redundant composite protein sequence databases

Alan J. Bleasby; John C. Wootton

doi:10.1093/protein/3.3.153

Abstract

A strategy has been developed for the construction of a validated, comprehensive composite protein sequence database. Entries are amalgamated from primary source data bases by a largely automated set of processes in which redundant and trivially different entries are eliminated. A modular approach has been adopted to allow scientific judgement to be used at each stage of database processing and amalgamation. Source databases are assigned a priority depending on the quality of sequence validation and commenting. Rejection of entries from the lower priority database, in each pairwise comparison of databases, is carried out according to optionally defined redundancy criteria based on sequence segment mismatches. Efficient algorithms for this methodology are embodied in the COMPO software system. COMPO has been applied for over 2 years in construction and regular updating of the OWL composite protein sequence database from the source databases NBRF-PIR, SWISS-PROT, a GenBank translation retrieved from the feature tables, NBRF-NEW, NEWAT86, PSD-KYOTO and the sequences contained in the Brookhaven protein structure databank. OWL is part of the ISIS integrated data resource of protein sequence and structure [Akrigg et al. (1988) Nature, 335, 745-746]. The modular nature of the integration process greatly facilitates the frequent updating of OWL following releases of the source databases. The extent of redundancy in these sources is revealed by the comparison process. The advantages of a robust composite database for sequence similarity searching and information retrieval are discussed.

Keywords

DatabaseComputer scienceSequence databaseRedundancy (engineering)Modular designProtein structure databaseSequence (biology)Information retrievalGenBankData miningPairwise comparisonArtificial intelligenceProgramming language

Affiliated Institutions

University of Leeds GB

Related Publications

UniProt archive

Rasko Leinonen , Federico Garcia Diez , David Binns +3 more

Abstract Summary: UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, p...

2004 Bioinformatics 209 citations

UniProt: the Universal Protein knowledgebase

Rolf Apweiler , Amos Bairoch , Cathy Wu +12 more

To provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information, the Swiss-Prot, TrEMBL and PIR protein d...

2003 Nucleic Acids Research 7724 citations

GenBank

D. A. Benson , David J. Lipman , James Ostell

The GenBank sequence database has undergone an expansion in data coverage, annotation content and the development of new services for the scientific community. In addition to nu...

1993 Nucleic Acids Research 250 citations

GenBank

Karen Clark , Ilene Karsch‐Mizrachi , David J. Lipman +2 more

GenBank(®) (www.ncbi.nlm.nih.gov/genbank/) is a comprehensive database that contains publicly available nucleotide sequences for over 340 000 formally described species. Recent ...

2015 Nucleic Acids Research 1478 citations

The Universal Protein Resource (UniProt)

Amos Bairoch

The Universal Protein Resource (UniProt) provides the scientific community with a single, centralized, authoritative resource for protein sequences and functional information. F...

2004 Nucleic Acids Research 3760 citations

Publication Info

Year: 1990
Type: article
Volume: 3
Issue: 3
Pages: 153-159
Citations: 192
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Construction of validated, non-redundant composite protein sequence databases

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

192

OpenAlex

Cite This

APA Style

                            
                                    Alan J. Bleasby, 
                                
                                    John C. Wootton
                                
                            (1990). 
                            Construction of validated, non-redundant composite protein sequence databases. 
                            Protein Engineering Design and Selection
                            , 3
                            (3)
                            , 153-159.
                            https://doi.org/10.1093/protein/3.3.153

Identifiers

DOI: 10.1093/protein/3.3.153