Abstract

An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 (https://github.com/tbenavi1/genomescope2.0), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot (https://github.com/KamilSJaron/smudgeplot) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the Meloidogyne genus and the extreme case of octoploid Fragaria × ananassa. Prior to genome assembly, the raw sequencing reads must be analyzed for assessment of major genome characteristics such as genome size, heterozygosity, and repetitiveness. For this purpose, the authors introduce GenomeScope 2.0, an extension of GenomeScope for polyploid genomes, and Smudgeplot, which can estimate a genome’s ploidy.

Keywords

PolyploidGenomePloidyLoss of heterozygosityBiologyComputational biologyProfiling (computer programming)Genome sizeComputer scienceGeneticsGeneAllele

MeSH Terms

AlgorithmsAnimalsComputational BiologyFragariaGenomePlantHeterozygotePhylogenyPolyploidySoftwareTylenchoidea

Affiliated Institutions

Related Publications

Publication Info

Year
2020
Type
article
Volume
11
Issue
1
Pages
1432-1432
Citations
2055
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

2055
OpenAlex
222
Influential

Cite This

T. Rhyker Ranallo-Benavidez, Kamil S. Jaroň, Michael C. Schatz (2020). GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nature Communications , 11 (1) , 1432-1432. https://doi.org/10.1038/s41467-020-14998-3

Identifiers

DOI
10.1038/s41467-020-14998-3
PMID
32188846
PMCID
PMC7080791

Data Quality

Data completeness: 86%