Abstract

Abstract Motivation Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data. Results We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research. Availability and implementation The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/

Keywords

dbSNPGenomic medicinePrecision medicineComputer scienceComputational biologyGeneticsBiologySingle-nucleotide polymorphismGeneGenotype

MeSH Terms

Data CurationData MiningDatabasesFactualGenetic Predisposition to DiseaseGenomicsHumansMutationPhenotypePolymorphismGeneticPrecision MedicinePubMedPublicationsSoftware

Affiliated Institutions

Related Publications

Publication Info

Year
2017
Type
article
Volume
34
Issue
1
Pages
80-87
Citations
97
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

97
OpenAlex
7
Influential
70
CrossRef

Cite This

Chih-Hsuan Wei, Lon Phan, Juliana Feltz et al. (2017). tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine. Bioinformatics , 34 (1) , 80-87. https://doi.org/10.1093/bioinformatics/btx541

Identifiers

DOI
10.1093/bioinformatics/btx541
PMID
28968638
PMCID
PMC5860583

Data Quality

Data completeness: 90%