Large Scale Distributed Deep Networks

Abstract

Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm. 1

Keywords

Computer scienceDeep learningArtificial intelligenceAsynchronous communicationStochastic gradient descentDeep neural networksArtificial neural networkMachine learningScale (ratio)Distributed computingTask (project management)Feature (linguistics)Focus (optics)Computer network

Affiliated Institutions

Google (United States) US

Related Publications

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent

Feng Niu , Benjamin Recht , Christopher Ré +1 more

Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently ...

2011 arXiv (Cornell University) 1224 citations

Highway Networks

Rupesh K. Srivastava , Klaus Greff , Jürgen Schmidhuber

There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult w...

2015 arXiv (Cornell University) 301 citations

Training Very Deep Networks

Rupesh K. Srivastava , Klaus Greff , Jürgen Schmidhuber

Theoretical and empirical evidence indicates that the depth of neural networks is crucial for their success. However, training becomes more difficult as depth increases, and tra...

2015 arXiv (Cornell University) 1100 citations

Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

Christian Szegedy , Sergey Ioffe , Vincent Vanhoucke +1 more

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has b...

2017 Proceedings of the AAAI Conference on... 4483 citations

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe , Christian Szegedy

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. T...

2024 arXiv (Cornell University) 15635 citations

Publication Info

Year: 2012
Type: article
Volume: 25
Pages: 1223-1231
Citations: 2906
Access: Closed

External Links

Citation Metrics

2906

OpenAlex

Cite This

APA Style

                            
                                    Jay B. Dean, 
                                
                                    Greg S. Corrado, 
                                
                                    Rajat Monga
                                
                                et al.
                            
                            (2012). 
                            Large Scale Distributed Deep Networks. 
                            
                            , 25
                            
                            , 1223-1231.