Abstract

Inspired by recent work in machine translation and object detection, we\nintroduce an attention based model that automatically learns to describe the\ncontent of images. We describe how we can train this model in a deterministic\nmanner using standard backpropagation techniques and stochastically by\nmaximizing a variational lower bound. We also show through visualization how\nthe model is able to automatically learn to fix its gaze on salient objects\nwhile generating the corresponding words in the output sequence. We validate\nthe use of attention with state-of-the-art performance on three benchmark\ndatasets: Flickr8k, Flickr30k and MS COCO.\n

Keywords

Computer scienceBenchmark (surveying)Artificial intelligenceGazeVisualizationObject (grammar)SalientSequence (biology)BackpropagationImage (mathematics)Machine learningArtificial neural networkComputer vision

Related Publications

Publication Info

Year
2015
Type
preprint
Citations
1750
Access
Closed

External Links

Social Impact

Social media, news, blog, policy document mentions

Citation Metrics

1750
OpenAlex

Cite This

Kelvin Xu, Jimmy Ba, Ryan Kiros et al. (2015). Show, Attend and Tell: Neural Image Caption Generation with Visual\n Attention. arXiv (Cornell University) . https://doi.org/10.48550/arxiv.1502.03044

Identifiers

DOI
10.48550/arxiv.1502.03044