Abstract

Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions. The quality of viral genomes assembled from metagenome data is assessed by CheckV.

Keywords

GenomeHuman viromeMetagenomicsBiologyComputational biologyCompleteness (order theory)GeneticsGene

MeSH Terms

GenomeViralMetagenomeMetagenomicsMolecular Sequence AnnotationSoftware

Affiliated Institutions

Related Publications

Publication Info

Year
2020
Type
article
Volume
39
Issue
5
Pages
578-585
Citations
1454
Access
Closed

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1454
OpenAlex
272
Influential

Cite This

Stephen Nayfach, Antônio Pedro Camargo, Frederik Schulz et al. (2020). CheckV assesses the quality and completeness of metagenome-assembled viral genomes. Nature Biotechnology , 39 (5) , 578-585. https://doi.org/10.1038/s41587-020-00774-7

Identifiers

DOI
10.1038/s41587-020-00774-7
PMID
33349699
PMCID
PMC8116208

Data Quality

Data completeness: 90%