Motion Segmentation using Frequency Domain Transformer Networks
release_rkhyiqwl5jfrdiazmn4rsrpise
by
Hafez Farazi, Sven Behnke
2020
Abstract
Self-supervised prediction is a powerful mechanism to learn representations
that capture the underlying structure of the data. Despite recent progress, the
self-supervised video prediction task is still challenging. One of the critical
factors that make the task hard is motion segmentation, which is segmenting
individual objects and the background and estimating their motion separately.
In video prediction, the shape, appearance, and transformation of each object
should be understood only by predicting the next frame in pixel space. To
address this task, we propose a novel end-to-end learnable architecture that
predicts the next frame by modeling foreground and background separately while
simultaneously estimating and predicting the foreground motion using Frequency
Domain Transformer Networks. Experimental evaluations show that this yields
interpretable representations and that our approach can outperform some widely
used video prediction methods like Video Ladder Network and Predictive Gated
Pyramids on synthetic data.
In text/plain
format
Archived Files and Locations
application/pdf 4.0 MB
file_ztsovy4szfgtfc2ziulbkphjvq
|
arxiv.org (repository) web.archive.org (webarchive) |
2004.08638v1
access all versions, variants, and formats of this works (eg, pre-prints)