Structure Preserving Convolutional Attention for Image Captioning
release_ljxjwxgra5hqzogydnfez4jjc4
by
Shichen Lu, Ruimin Hu, Jing Liu, Longteng Guo, Fei Zheng
Abstract
In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.
In application/xml+jats
format
Archived Files and Locations
application/pdf 2.4 MB
file_turdtq6dibcerlseuzw7dwqnqy
|
web.archive.org (webarchive) res.mdpi.com (publisher) |
Open Access Publication
In DOAJ
In ISSN ROAD
In Keepers Registry
ISSN-L:
2076-3417
access all versions, variants, and formats of this works (eg, pre-prints)
Crossref Metadata (via API)
Worldcat
SHERPA/RoMEO (journal policies)
wikidata.org
CORE.ac.uk
Semantic Scholar
Google Scholar