Fixing Weight Decay Regularization in Adam

Ilya Loshchilov; Frank Hutter

Abstract

We note that common implementations of adaptive gradient algorithms, such as Adam, limit the potential benefit of weight decay regularization, because the weights do not decay multiplicatively (as would be expected for standard weight decay) but by an additive constant factor. We propose a simple way to resolve this issue by decoupling weight decay and the optimization steps taken w.r.t. the loss function. We provide empirical evidence that our proposed modification (i) decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam, and (ii) substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets (on which it was previously typically outperformed by the latter). We also demonstrate that longer optimization runs require smaller weight decay values for optimal results and introduce a normalized variant of weight decay to reduce this dependence. Finally, we propose a version of Adam with warm restarts (AdamWR) that has strong anytime performance while achieving state-of-the-art results on CIFAR-10 and ImageNet32x32. Our source code will become available after the review process.

Keywords

Regularization (linguistics)Decoupling (probability)Computer scienceMathematicsAlgorithmApplied mathematicsArtificial intelligence

Affiliated Institutions

University of Freiburg DE

Related Publications

Improved Adam Optimizer for Deep Neural Networks

Zijun Zhang

Adaptive optimization algorithms, such as Adam and RMSprop, have witnessed better optimization performance than stochastic gradient descent (SGD) in some scenarios. However, rec...

2018 1244 citations

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Feng Niu , Benjamin Recht , Christopher Ré +1 more

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently ...

2011 arXiv (Cornell University) 1224 citations

Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning

Takeru Miyato , Shin‐ichi Maeda , Masanori Koyama +1 more

We propose a new regularization method based on virtual adversarial loss: a new measure of local smoothness of the conditional label distribution given input. Virtual adversaria...

2018 IEEE Transactions on Pattern Analysis... 2696 citations

ProxylessNAS: Direct Neural Architecture Search on Target Task and\n Hardware

Han Cai , Ligeng Zhu , Song Han

Neural architecture search (NAS) has a great impact by automatically\ndesigning effective neural network architectures. However, the prohibitive\ncomputational demand of convent...

2018 arXiv (Cornell University) 1280 citations

Federated Learning with Non-IID Data

Yue Zhao , Meng Li , Liangzhen Lai +3 more

Federated learning enables resource-constrained edge compute devices, such as mobile phones and IoT devices, to learn a shared model for prediction, while keeping the training d...

2018 arXiv (Cornell University) 1894 citations

Publication Info

Year: 2018
Type: article
Citations: 1137
Access: Closed

External Links

Citation Metrics

1137

OpenAlex

Cite This

APA Style

                            
                                    Ilya Loshchilov, 
                                
                                    Frank Hutter
                                
                            (2018). 
                            Fixing Weight Decay Regularization in Adam. 
                            
                            .