Efficient training for future video generation based on hierarchical disentangled representation of latent variables
release_6d7wf5gpp5d2pgrqbiqkt3nqle
by
Naoya Fushishita, Antonio Tejero-de-Pablos, Yusuke Mukuta, Tatsuya Harada
2021
Abstract
Generating videos predicting the future of a given sequence has been an area
of active research in recent years. However, an essential problem remains
unsolved: most of the methods require large computational cost and memory usage
for training. In this paper, we propose a novel method for generating future
prediction videos with less memory usage than the conventional methods. This is
a critical stepping stone in the path towards generating videos with high image
quality, similar to that of generated images in the latest works in the field
of image generation. We achieve high-efficiency by training our method in two
stages: (1) image reconstruction to encode video frames into latent variables,
and (2) latent variable prediction to generate the future sequence. Our method
achieves an efficient compression of video into low-dimensional latent
variables by decomposing each frame according to its hierarchical structure.
That is, we consider that video can be separated into background and foreground
objects, and that each object holds time-varying and time-independent
information independently. Our experiments show that the proposed method can
efficiently generate future prediction videos, even for complex datasets that
cannot be handled by previous methods.
In text/plain
format
Archived Files and Locations
application/pdf 18.2 MB
file_v2kpoxiiszb6zcxdjcuvrtfnam
|
arxiv.org (repository) web.archive.org (webarchive) |
2106.03502v2
access all versions, variants, and formats of this works (eg, pre-prints)