ADADELTA: An Adaptive Learning Rate Method

Abstract

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information, different model architecture choices, various data modalities and selection of hyperparameters. We show promising results compared to other methods on the MNIST digit classification task using a single machine and on a large scale voice dataset in a distributed cluster environment.

Keywords

MNIST databaseStochastic gradient descentComputer scienceOverhead (engineering)Artificial intelligenceGradient descentHyperparameterDimension (graph theory)Word error rateTask (project management)Machine learningSelection (genetic algorithm)Scale (ratio)Pattern recognition (psychology)Deep learningArtificial neural networkMathematics

Related Publications

Large-Scale Machine Learning with Stochastic Gradient Descent

Léon Bottou

During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by th...

2010 5479 citations

Hyperparameter Optimization for Machine Learning Models Based on Bayesian Optimization

Jia Wu , Xiu Yun Chen , Hao Zhang +3 more

Hyperparameters are important for machine learning algorithms since they directly control the behaviors of training algorithms and have a significant effect on the performance o...

2019 DOAJ (DOAJ: Directory of Open Access ... 1350 citations

Optimization for training neural nets

Etienne Barnard

Various techniques of optimizing criterion functions to train neural-net classifiers are investigated. These techniques include three standard deterministic techniques (variable...

1992 IEEE Transactions on Neural Networks 210 citations

SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent

Antoine Bordes , Léon Bottou , Patrick Gallinari

The SGD-QN algorithm is a stochastic gradient descent algorithm that makes careful use of second-order information and splits the parameter update into independently scheduled c...

2009 HAL (Le Centre pour la Communication ... 342 citations

Large Scale Distributed Deep Networks

Jay B. Dean , Greg S. Corrado , Rajat Monga +9 more

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider t...

2012 2906 citations

Publication Info

Year: 2012
Type: preprint
Citations: 5515
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

ADADELTA: An Adaptive Learning Rate Method

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

5515

OpenAlex

Cite This

APA Style

                            
                                    Matthew D. Zeiler
                                
                            (2012). 
                            ADADELTA: An Adaptive Learning Rate Method. 
                            arXiv (Cornell University)
                            
                            .
                            https://doi.org/10.48550/arxiv.1212.5701

Identifiers

DOI: 10.48550/arxiv.1212.5701

Data Quality

Data completeness: 77%