StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2
release_qnsoi4xsgbglfgmumpioveuvjq
by
Ivan Skorokhodov, Sergey Tulyakov, Mohamed Elhoseiny
2022
Abstract
Videos show continuous events, yet most - if not all - video synthesis
frameworks treat them discretely in time. In this work, we think of videos of
what they should be - time-continuous signals, and extend the paradigm of
neural representations to build a continuous-time video generator. For this, we
first design continuous motion representations through the lens of positional
embeddings. Then, we explore the question of training on very sparse videos and
demonstrate that a good generator can be learned by using as few as 2 frames
per clip. After that, we rethink the traditional image + video discriminators
pair and design a holistic discriminator that aggregates temporal information
by simply concatenating frames' features. This decreases the training cost and
provides richer learning signal to the generator, making it possible to train
directly on 1024^2 videos for the first time. We build our model on top of
StyleGAN2 and it is just ≈5% more expensive to train at the same
resolution while achieving almost the same image quality. Moreover, our latent
space features similar properties, enabling spatial manipulations that our
method can propagate in time. We can generate arbitrarily long videos at
arbitrary high frame rate, while prior work struggles to generate even 64
frames at a fixed rate. Our model is tested on four modern 256^2 and one
1024^2-resolution video synthesis benchmarks. In terms of sheer metrics, it
performs on average ≈30% better than the closest runner-up. Project
website: https://universome.github.io.
In text/plain
format
Archived Files and Locations
application/pdf 4.2 MB
file_bixd57yvrzdzpeztzqx7h26sta
|
arxiv.org (repository) web.archive.org (webarchive) |
2112.14683v3
access all versions, variants, and formats of this works (eg, pre-prints)