Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval
release_mrl2m2cc2farjayr5rxrrpesiy
by
Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez, Dimosthenis Karatzas
2020
Abstract
Scene text instances found in natural images carry explicit semantic
information that can provide important cues to solve a wide array of computer
vision problems. In this paper, we focus on leveraging multi-modal content in
the form of visual and textual cues to tackle the task of fine-grained image
classification and retrieval. First, we obtain the text instances from images
by employing a text reading system. Then, we combine textual features with
salient image regions to exploit the complementary information carried by the
two sources. Specifically, we employ a Graph Convolutional Network to perform
multi-modal reasoning and obtain relationship-enhanced features by learning a
common semantic space between salient objects and text found in an image. By
obtaining an enhanced set of visual and textual features, the proposed model
greatly outperforms the previous state-of-the-art in two different tasks,
fine-grained classification and image retrieval in the Con-Text and Drink
Bottle datasets.
In text/plain
format
Archived Files and Locations
application/pdf 2.3 MB
file_b5ihnnpknfapzbucjjfjrgpvxi
|
arxiv.org (repository) web.archive.org (webarchive) |
2009.09809v1
access all versions, variants, and formats of this works (eg, pre-prints)