Abstract

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly gains in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which produces CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101, and HMDB51.

Keywords

Action recognitionComputer scienceResidualBlock (permutation group theory)Artificial intelligenceConvolutional neural networkPattern recognition (psychology)Action (physics)MathematicsAlgorithmClass (philosophy)

Affiliated Institutions

Related Publications

Publication Info

Year
2018
Type
article
Pages
6450-6459
Citations
3320
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

3320
OpenAlex

Cite This

Du Tran, Heng Wang, Lorenzo Torresani et al. (2018). A Closer Look at Spatiotemporal Convolutions for Action Recognition. , 6450-6459. https://doi.org/10.1109/cvpr.2018.00675

Identifiers

DOI
10.1109/cvpr.2018.00675