Abstract

The objective of this paper is speaker recognition under noisy and\nunconstrained conditions.\n We make two key contributions. First, we introduce a very large-scale\naudio-visual speaker recognition dataset collected from open-source media.\nUsing a fully automated pipeline, we curate VoxCeleb2 which contains over a\nmillion utterances from over 6,000 speakers. This is several times larger than\nany publicly available speaker recognition dataset.\n Second, we develop and compare Convolutional Neural Network (CNN) models and\ntraining strategies that can effectively recognise identities from voice under\nvarious conditions. The models trained on the VoxCeleb2 dataset surpass the\nperformance of previous works on a benchmark dataset by a significant margin.\n

Keywords

Computer scienceSpeaker recognitionConvolutional neural networkSpeech recognitionMargin (machine learning)Benchmark (surveying)Pipeline (software)Speaker diarisationArtificial intelligenceKey (lock)Feature extractionPattern recognition (psychology)Machine learning

Affiliated Institutions

Related Publications

Publication Info

Year
2018
Type
article
Citations
2123
Access
Closed

Social Impact

Altmetric

Social media, news, blog, policy document mentions

Citation Metrics

2123
OpenAlex
346
Influential
1429
CrossRef

Cite This

Joon Son Chung, Arsha Nagrani, Andrew Zisserman (2018). VoxCeleb2: Deep Speaker Recognition. Interspeech 2018 . https://doi.org/10.21437/interspeech.2018-1929

Identifiers

DOI
10.21437/interspeech.2018-1929
arXiv
1806.05622

Data Quality

Data completeness: 84%