Memory Warps for Learning Long-Term Online Video Representations
release_iy4fmp352bc4nn66s7td3hjiie
by
Tuan-Hung Vu, Wongun Choi, Samuel Schulter, Manmohan Chandraker
2018
Abstract
This paper proposes a novel memory-based online video representation that is
efficient, accurate and predictive. This is in contrast to prior works that
often rely on computationally heavy 3D convolutions, ignore actual motion when
aligning features over time, or operate in an off-line mode to utilize future
frames. In particular, our memory (i) holds the feature representation, (ii) is
spatially warped over time to compensate for observer and scene motions, (iii)
can carry long-term information, and (iv) enables predicting feature
representations in future frames. By exploring a variant that operates at
multiple temporal scales, we efficiently learn across even longer time
horizons. We apply our online framework to object detection in videos,
obtaining a large 2.3 times speed-up and losing only 0.9% mAP on ImageNet-VID
dataset, compared to prior works that even use future frames. Finally, we
demonstrate the predictive property of our representation in two novel
detection setups, where features are propagated over time to (i) significantly
enhance a real-time detector by more than 10% mAP in a multi-threaded online
setup and to (ii) anticipate objects in future frames.
In text/plain
format
Archived Files and Locations
application/pdf 4.2 MB
file_mdfeak6hm5aivh4a4pwzperdm4
|
arxiv.org (repository) web.archive.org (webarchive) |
1803.10861v1
access all versions, variants, and formats of this works (eg, pre-prints)