Abstract

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

Keywords

MNIST databaseStochastic gradient descentComputer scienceOverhead (engineering)Artificial intelligenceGradient descentHyperparameterDimension (graph theory)Word error rateTask (project management)Machine learningSelection (genetic algorithm)Scale (ratio)Pattern recognition (psychology)Deep learningArtificial neural networkMathematics

Related Publications

Optimization for training neural nets

Various techniques of optimizing criterion functions to train neural-net classifiers are investigated. These techniques include three standard deterministic techniques (variable...

1992 IEEE Transactions on Neural Networks 210 citations

Publication Info

Year
2012
Type
preprint
Citations
5515
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

5515
OpenAlex

Cite This

Matthew D. Zeiler (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1212.5701

Identifiers

DOI
10.48550/arxiv.1212.5701

Data Quality

Data completeness: 77%