Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective
release_hsq2qqf7qbehznho4wviwtostq
by
Jiarui Xu, Xiaolong Wang
2021
Abstract
Learning a good representation for space-time correspondence is the key for
various computer vision tasks, including tracking object bounding boxes and
performing video object pixel segmentation. To learn generalizable
representation for correspondence in large-scale, a variety of self-supervised
pretext tasks are proposed to explicitly perform object-level or patch-level
similarity learning. Instead of following the previous literature, we propose
to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e,
simply learning from comparing video frames. Our work is inspired by the recent
success in image-level contrastive learning and similarity learning for visual
recognition. Our hypothesis is that if the representation is good for
recognition, it requires the convolutional features to find correspondence
between similar objects or parts. Our experiments show surprising results that
VFS surpasses state-of-the-art self-supervised approaches for both OTB visual
object tracking and DAVIS video object segmentation. We perform detailed
analysis on what matters in VFS and reveals new properties on image and frame
level similarity learning. Project page is available at
https://jerryxu.net/VFS.
In text/plain
format
Archived Files and Locations
application/pdf 5.1 MB
file_qalhah2eh5e5nm3du5abgdrl7e
|
arxiv.org (repository) web.archive.org (webarchive) |
2103.17263v1
access all versions, variants, and formats of this works (eg, pre-prints)