Abstract
An important assessment prior to genome assembly and related analyses is genome profiling, where the k-mer frequencies within raw sequencing reads are analyzed to estimate major genome characteristics such as size, heterozygosity, and repetitiveness. Here we introduce GenomeScope 2.0 (https://github.com/tbenavi1/genomescope2.0), which applies combinatorial theory to establish a detailed mathematical model of how k-mer frequencies are distributed in heterozygous and polyploid genomes. We describe and evaluate a practical implementation of the polyploid-aware mixture model that quickly and accurately infers genome properties across thousands of simulated and several real datasets spanning a broad range of complexity. We also present a method called Smudgeplot (https://github.com/KamilSJaron/smudgeplot) to visualize and estimate the ploidy and genome structure of a genome by analyzing heterozygous k-mer pairs. We successfully apply the approach to systems of known variable ploidy levels in the Meloidogyne genus and the extreme case of octoploid Fragaria × ananassa. Prior to genome assembly, the raw sequencing reads must be analyzed for assessment of major genome characteristics such as genome size, heterozygosity, and repetitiveness. For this purpose, the authors introduce GenomeScope 2.0, an extension of GenomeScope for polyploid genomes, and Smudgeplot, which can estimate a genome’s ploidy.
Keywords
MeSH Terms
Affiliated Institutions
Related Publications
Phased diploid genome assembly with single-molecule real-time sequencing
While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. T...
Extending assembly of short DNA sequences to handle error
Abstract Inexpensive de novo genome sequencing, particularly in organisms with small genomes, is now possible using several new sequencing technologies. Some of these technologi...
Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies
Recent long-read assemblies often exceed the quality and completeness of available reference genomes, making validation challenging. Here we present Merqury, a novel tool for re...
MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads
Background PacBio high fidelity (HiFi) sequencing reads are both long (15–20 kb) and highly accurate (> Q20). Because of these properties, they have revolutionised genome assem...
COPE: an accurate <i>k</i>-mer-based pair-end reads connection tool to facilitate genome assembly
Abstract Motivation: The boost of next-generation sequencing technologies provides us with an unprecedented opportunity for elucidating genetic mysteries, yet the short-read len...
Publication Info
- Year
- 2020
- Type
- article
- Volume
- 11
- Issue
- 1
- Pages
- 1432-1432
- Citations
- 2055
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1038/s41467-020-14998-3
- PMID
- 32188846
- PMCID
- PMC7080791