Abstract

Convolutional Neural Networks (CNNs) have been established as a powerful class of models for image recognition problems. Encouraged by these results, we provide an extensive empirical evaluation of CNNs on large-scale video classification using a new dataset of 1 million YouTube videos belonging to 487 classes. We study multiple approaches for extending the connectivity of a CNN in time domain to take advantage of local spatio-temporal information and suggest a multiresolution, foveated architecture as a promising way of speeding up the training. Our best spatio-temporal networks display significant performance improvements compared to strong feature-based baselines (55.3% to 63.9%), but only a surprisingly modest improvement compared to single-frame models (59.3% to 60.9%). We further study the generalization performance of our best model by retraining the top layers on the UCF-101 Action Recognition dataset and observe significant performance improvements compared to the UCF-101 baseline model (63.3% up from 43.9%).

Keywords

Computer scienceConvolutional neural networkArtificial intelligencePattern recognition (psychology)GeneralizationFeature (linguistics)Contextual image classificationFrame (networking)Feature extractionDomain (mathematical analysis)RetrainingClass (philosophy)Scale (ratio)Machine learningImage (mathematics)Mathematics

Affiliated Institutions

Related Publications

Publication Info

Year
2014
Type
article
Citations
6224
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

6224
OpenAlex

Cite This

Andrej Karpathy, George Toderici, Sanketh Shetty et al. (2014). Large-Scale Video Classification with Convolutional Neural Networks. . https://doi.org/10.1109/cvpr.2014.223

Identifiers

DOI
10.1109/cvpr.2014.223