Localizing Moments in Video with Natural Language
release_sgrv3qlhhfaujh6szkoxgwgmqa
by
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor
Darrell, Bryan Russell
2017
Abstract
We consider retrieving a specific temporal segment, or moment, from a video
given a natural language text description. Methods designed to retrieve whole
video clips with natural language determine what occurs in a video but not
when. To address this issue, we propose the Moment Context Network (MCN) which
effectively localizes natural language queries in videos by integrating local
and global video features over time. A key obstacle to training our MCN model
is that current video datasets do not include pairs of localized video segments
and referring expressions, or text descriptions which uniquely identify a
corresponding moment. Therefore, we collect the Distinct Describable Moments
(DiDeMo) dataset which consists of over 10,000 unedited, personal videos in
diverse visual settings with pairs of localized video segments and referring
expressions. We demonstrate that MCN outperforms several baseline methods and
believe that our initial results together with the release of DiDeMo will
inspire further research on localizing video moments with natural language.
In text/plain
format
Archived Files and Locations
application/pdf 10.0 MB
file_j3ufim65crfm3ltlcihn7tqkmu
|
arxiv.org (repository) web.archive.org (webarchive) |
1708.01641v1
access all versions, variants, and formats of this works (eg, pre-prints)