MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
release_75pjlfayfzb7hntukwhe4h5wbq
by
Revanth Gangi Reddy, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, Mohit Bansal, Avirup Sil, Shih-Fu Chang, Alexander Schwing, Heng Ji
2021
Abstract
Recently, there has been an increasing interest in building question
answering (QA) models that reason across multiple modalities, such as text and
images. However, QA using images is often limited to just picking the answer
from a pre-defined set of options. In addition, images in the real world,
especially in news, have objects that are co-referential to the text, with
complementary information from both modalities. In this paper, we present a new
QA evaluation benchmark with 1,384 questions over news articles that require
cross-media grounding of objects in images onto text. Specifically, the task
involves multi-hop questions that require reasoning over image-caption pairs to
identify the grounded visual object being referred to and then predicting a
span from the news body text to answer the question. In addition, we introduce
a novel multimedia data augmentation framework, based on cross-media knowledge
extraction and synthetic question-answer generation, to automatically augment
data that can provide weak supervision for this task. We evaluate both
pipeline-based and end-to-end pretraining-based multimedia QA models on our
benchmark, and show that they achieve promising performance, while considerably
lagging behind human performance hence leaving large room for future work on
this challenging new task.
In text/plain
format
Archived Files and Locations
application/pdf 6.7 MB
file_juswdeearvc5fhfiznbmpnanr4
|
arxiv.org (repository) web.archive.org (webarchive) |
2112.10728v1
access all versions, variants, and formats of this works (eg, pre-prints)