Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition

Abstract

Generation of high-precision sub-phonetic attribute (also known as phonological features) and phone lattices is a key frontend component for detection-based bottom-up speech recognition. In this paper we employ deep neural networks (DNNs) to improve detection accuracy over conventional shallow MLPs (multi-layer perceptrons) with one hidden layer. A range of DNN architectures with five to seven hidden layers and up to 2048 hidden units per layer have been explored. Training on the SI84 and testing on the Nov92 WSJ data, the proposed DNNs achieve significant improvements over the shallow MLPs, producing greater than 90% frame-level attribute estimation accuracies for all 21 attributes tested for the full system. On the phone detection task, we also obtain excellent frame-level accuracy of 86.6%. With this level of high-precision detection of basic speech units we have opened the door to a new family of flexible speech recognition system design for both top-down and bottom-up, lattice-based search strategies and knowledge integration.

Keywords

Computer scienceBoosting (machine learning)PerceptronSpeech recognitionArtificial neural networkPhoneArtificial intelligenceFrame (networking)Voice activity detectionPattern recognition (psychology)Multilayer perceptronDeep neural networksSpeech processingTelecommunications

Affiliated Institutions

Related Publications

Backpropagation training for multilayer conditional random field based phone recognition

Rohit Prabhavalkar , Eric Fosler‐Lussier

Conditional random fields (CRFs) have recently found increased popularity in automatic speech recognition (ASR) applications. CRFs have previously been shown to be effective com...

2010 31 citations

Comparing multilayer perceptron to Deep Belief Network Tandem features for robust ASR

Oriol Vinyals , Suman Ravuri

In this paper, we extend the work done on integrating multilayer perceptron (MLP) networks with HMM systems via the Tandem approach. In particular, we explore whether the use of...

2011 55 citations

Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups

Geoffrey E. Hinton , Li Deng , Dong Yu +8 more

Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well ...

2012 IEEE Signal Processing Magazine 10065 citations

Deep Belief Networks using discriminative features for phone recognition

Abdelrahman Mohamed , Tara N. Sainath , George E. Dahl +3 more

Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of fe...

2011 289 citations

Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

Brian Kingsbury

Acoustic models used in hidden Markov model/neural-network (HMM/NN) speech recognition systems are usually trained with a frame-based cross-entropy error criterion. In contrast,...

2009 238 citations

Publication Info

Year: 2012
Type: article
Pages: 4169-4172
Citations: 64
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

OpenAlex

Cite This

APA Style

                            
                                    Dong Yu, 
                                
                                    Sabato Marco Siniscalchi, 
                                
                                    Li Deng
                                
                                et al.
                            
                            (2012). 
                            Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition. 
                            
                            , 4169-4172.
                            https://doi.org/10.1109/icassp.2012.6288837

Identifiers

DOI: 10.1109/icassp.2012.6288837