Abstract
While research in large vocabulary continuous speech recognition (LVCSR) has sparked the development of many state of the art research ideas, research in this domain suffers from two main drawbacks. First, because of the large number of parameters and poorly labeled transcriptions, gaining insight into further improvements based on error analysis is very difficult. Second, LVCSR systems often take a significantly longer time to train and test new research ideas compared to small vocabulary tasks. A small vocabulary task like TIMIT provides a phonetically rich and hand-labeled corpus and offers a good test bed to study algorithmic improvements. However, oftentimes research ideas explored for small vocabulary tasks do not always provide gains on LVCSR systems. In this paper, we address these issues by taking the standard "recipe" used in typical LVCSR systems and applying it to the TIMIT phonetic recognition corpus, which provides a standard benchmark to compare methods. We find that at the speaker-independent (SI) level, our results offer comparable performance to other SI HMM systems. By taking advantage of speaker adaptation and discriminative training techniques commonly used in LVCSR systems, we achieve an error rate of 20%, the best results reported on the TIMIT task to date, moving us closer to the human reported phonetic recognition error rate of 15%. We propose the use of this system as the baseline for future research and believe that it will serve as a good framework to explore ideas that will carry over to LVCSR systems.
Keywords
Affiliated Institutions
Related Publications
Exemplar-Based Sparse Representation Features: From TIMIT to LVCSR
The use of exemplar-based methods, such as support vector machines (SVMs), k-nearest neighbors (kNNs) and sparse representations (SRs), in speech recognition has thus far been l...
Deep Belief Networks using discriminative features for phone recognition
Deep Belief Networks (DBNs) are multi-layer generative models. They can be trained to model windows of coefficients extracted from speech and they discover multiple layers of fe...
Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition
Convolutional Neural Networks (CNN) have showed success in achieving translation invariance for many image processing tasks. The success is largely attributed to the use of loca...
Global optimization of a neural network-hidden Markov model hybrid
An original method for integrating artificial neural networks (ANN) with hidden Markov models (HMM) is proposed. ANNs are suitable for performing phonetic classification, wherea...
A statistical approach to automatic speech recognition using the atomic speech units constructed from overlapping articulatory features
In recent years, the development of a feature-based general statistical framework has been pursued for automatic speech recognition via novel designs of minimal or atomic units ...
Publication Info
- Year
- 2009
- Type
- article
- Pages
- 359-364
- Citations
- 33
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1109/asru.2009.5373263