Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective release_hsq2qqf7qbehznho4wviwtostq

by Jiarui Xu, Xiaolong Wang

Released as a article .

2021  

Abstract

Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning. Instead of following the previous literature, we propose to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e, simply learning from comparing video frames. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation. We perform detailed analysis on what matters in VFS and reveals new properties on image and frame level similarity learning. Project page is available at https://jerryxu.net/VFS.
In text/plain format

Archived Files and Locations

application/pdf  5.1 MB
file_qalhah2eh5e5nm3du5abgdrl7e
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2021-03-31
Version   v1
Language   en ?
arXiv  2103.17263v1
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: b09e8ade-be6f-46ab-aa15-9f9d95aeae1c
API URL: JSON