Structure Preserving Convolutional Attention for Image Captioning release_ljxjwxgra5hqzogydnfez4jjc4

by Shichen Lu, Ruimin Hu, Jing Liu, Longteng Guo, Fei Zheng

Published in Applied Sciences by MDPI AG.

2019   Issue 14, p2888

Abstract

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.
In application/xml+jats format

Archived Files and Locations

application/pdf  2.4 MB
file_turdtq6dibcerlseuzw7dwqnqy
web.archive.org (webarchive)
res.mdpi.com (publisher)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2019-07-19
Language   en ?
Container Metadata
Open Access Publication
In DOAJ
In ISSN ROAD
In Keepers Registry
ISSN-L:  2076-3417
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: abd321da-55e3-4504-bb2d-ef15caf58624
API URL: JSON