MarioQA: Answering Questions by Watching Gameplay Videos
release_ah5fpvhwcjfnrmmn3y3wduvjfe
by
Jonghwan Mun, Paul Hongsuck Seo, Ilchae Jung, Bohyung Han
2017
Abstract
We present a framework to analyze various aspects of models for video
question answering (VideoQA) using customizable synthetic datasets, which are
constructed automatically from gameplay videos. Our work is motivated by the
fact that existing models are often tested only on datasets that require
excessively high-level reasoning or mostly contain instances accessible through
single frame inferences. Hence, it is difficult to measure capacity and
flexibility of trained models, and existing techniques often rely on ad-hoc
implementations of deep neural networks without clear insight into datasets and
models. We are particularly interested in understanding temporal relationships
between video events to solve VideoQA problems; this is because reasoning
temporal dependency is one of the most distinct components in videos from
images. To address this objective, we automatically generate a customized
synthetic VideoQA dataset using Super Mario Bros. gameplay videos so that
it contains events with different levels of reasoning complexity. Using the
dataset, we show that properly constructed datasets with events in various
complexity levels are critical to learn effective models and improve overall
performance.
In text/plain
format
Archived Files and Locations
application/pdf 2.3 MB
file_ema2xbq5f5ablkqd4qiv2fdvz4
|
arxiv.org (repository) web.archive.org (webarchive) |
1612.01669v2
access all versions, variants, and formats of this works (eg, pre-prints)