Attentive and Contrastive Learning for Joint Depth and Motion Field Estimation
release_tiigihspmjbltlgbefpqxq7nsi
by
Seokju Lee, Francois Rameau, Fei Pan, In So Kweon
2021
Abstract
Estimating the motion of the camera together with the 3D structure of the
scene from a monocular vision system is a complex task that often relies on the
so-called scene rigidity assumption. When observing a dynamic environment, this
assumption is violated which leads to an ambiguity between the ego-motion of
the camera and the motion of the objects. To solve this problem, we present a
self-supervised learning framework for 3D object motion field estimation from
monocular videos. Our contributions are two-fold. First, we propose a two-stage
projection pipeline to explicitly disentangle the camera ego-motion and the
object motions with dynamics attention module, called DAM. Specifically, we
design an integrated motion model that estimates the motion of the camera and
object in the first and second warping stages, respectively, controlled by the
attention module through a shared motion encoder. Second, we propose an object
motion field estimation through contrastive sample consensus, called CSAC,
taking advantage of weak semantic prior (bounding box from an object detector)
and geometric constraints (each object respects the rigid body motion model).
Experiments on KITTI, Cityscapes, and Waymo Open Dataset demonstrate the
relevance of our approach and show that our method outperforms state-of-the-art
algorithms for the tasks of self-supervised monocular depth estimation, object
motion segmentation, monocular scene flow estimation, and visual odometry.
In text/plain
format
Archived Files and Locations
application/pdf 4.5 MB
file_yuxiqxrzvrhh3pun2flxd44dte
|
arxiv.org (repository) web.archive.org (webarchive) |
2110.06853v1
access all versions, variants, and formats of this works (eg, pre-prints)