Multi-Modal Multi-Scale Deep Learning for Large-Scale Image Annotation release_w4bq3glgnjd3zoa4hbh25tguwu

by Yulei Niu, Zhiwu Lu, Ji-Rong Wen, Tao Xiang, Shih-Fu Chang

Released as a article .

2017  

Abstract

Image annotation aims to annotate a given image with a variable number of class labels corresponding to diverse visual concepts. In this paper, we address two main issues in large-scale image annotation: 1) how to learn a rich feature representation suitable for predicting a diverse set of visual concepts ranging from object, scene to abstract concept; 2) how to annotate an image with the optimal number of class labels. To address the first issue, we propose a novel multi-scale deep model for extracting rich and discriminative features capable of representing a wide range of visual concepts. Specifically, a novel two-branch deep neural network architecture is proposed which comprises a very deep main network branch and a companion feature fusion network branch designed for fusing the multi-scale features computed from the main branch. The deep model is also made multi-modal by taking noisy user-provided tags as model input to complement the image input. For tackling the second issue, we introduce a label quantity prediction auxiliary task to the main label prediction task to explicitly estimate the optimal label number for a given image. Extensive experiments are carried out on two large-scale image annotation benchmark datasets and the results show that our method significantly outperforms the state-of-the-art.
In text/plain format

Archived Files and Locations

application/pdf  910.9 kB
file_t2of2oi63fgdtobc43yruhw2eq
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2017-09-05
Version   v1
Language   en ?
arXiv  1709.01220v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 5756f98f-8d47-44e4-8cc0-08afb50168ee
API URL: JSON