A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Heng Li Heng Li
2011 Bioinformatics 6,923 citations

Abstract

Abstract Motivation: Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty. Results: We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors. Availability: http://samtools.sourceforge.net Contact: hengli@broadinstitute.org

Keywords

Imputation (statistics)GenotypingBiologyComputational biologyGeneticsDNA sequencing1000 Genomes ProjectAllele frequencyPopulationGenetic associationSingle-nucleotide polymorphismData miningGenotypeComputer scienceMissing dataMachine learningGene

Affiliated Institutions

Related Publications

Publication Info

Year
2011
Type
article
Volume
27
Issue
21
Pages
2987-2993
Citations
6923
Access
Closed

External Links

Citation Metrics

6923
OpenAlex

Cite This

Heng Li (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics , 27 (21) , 2987-2993. https://doi.org/10.1093/bioinformatics/btr509

Identifiers

DOI
10.1093/bioinformatics/btr509