Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Abstract

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224 × 224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102 × faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

Keywords

PoolingPascal (unit)Artificial intelligenceComputer scienceConvolutional neural networkPattern recognition (psychology)Pyramid (geometry)Contextual image classificationObject detectionDeep learningFeature extractionComputer visionImage (mathematics)Mathematics

Affiliated Institutions

Related Publications

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal\n Networks

Shaoqing Ren , Kaiming He , Ross Girshick +1 more

State-of-the-art object detection networks depend on region proposal\nalgorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN\nhave reduced the running t...

2015 arXiv (Cornell University) 6211 citations

AttentionNet: Aggregating Weak Directions for Accurate Object Detection

Donggeun Yoo , Sunggyun Park , Joon‐Young Lee +2 more

We present a novel detection method using a deep convolutional neural network (CNN), named AttentionNet. We cast an object detection problem as an iterative classification probl...

2015 178 citations

Fast R-CNN

Ross Girshick

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposa...

2015 2015 IEEE International Conference on... 26511 citations

Fast R-CNN

Ross Girshick

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposa...

2015 arXiv (Cornell University) 1766 citations

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren , Kaiming He , Ross Girshick +1 more

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running tim...

2015 arXiv (Cornell University) 18214 citations

Publication Info

Year: 2015
Type: article
Volume: 37
Issue: 9
Pages: 1904-1916
Citations: 10916
Access: Closed

External Links

View on DOI.org

Social Impact

Altmetric

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

PlumX Metrics

Social media, news, blog, policy document mentions

Citation Metrics

10916

OpenAlex

Cite This

APA Style

                            
                                
                                    Kaiming He, 
                                
                                    Xiangyu Zhang, 
                                
                                    Shaoqing Ren
                                
                                et al.
                            
                            (2015). 
                            Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. 
                            IEEE Transactions on Pattern Analysis and Machine Intelligence
                            , 37
                            (9)
                            , 1904-1916.
                            https://doi.org/10.1109/tpami.2015.2389824
                        

Identifiers

DOI: 10.1109/tpami.2015.2389824