DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting
release_vpy4fs655jdcnmf332sur62azm
by
Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu
2022
Abstract
Recent progress has shown that large-scale pre-training using contrastive
image-text pairs can be a promising alternative for high-quality visual
representation learning from natural language supervision. Benefiting from a
broader source of supervision, this new paradigm exhibits impressive
transferability to downstream classification tasks and datasets. However, the
problem of transferring the knowledge learned from image-text pairs to more
complex dense prediction tasks has barely been visited. In this work, we
present a new framework for dense prediction by implicitly and explicitly
leveraging the pre-trained knowledge from CLIP. Specifically, we convert the
original image-text matching problem in CLIP to a pixel-text matching problem
and use the pixel-text score maps to guide the learning of dense prediction
models. By further using the contextual information from the image to prompt
the language model, we are able to facilitate our model to better exploit the
pre-trained knowledge. Our method is model-agnostic, which can be applied to
arbitrary dense prediction systems and various pre-trained visual backbones
including both CLIP models and ImageNet pre-trained models. Extensive
experiments demonstrate the superior performance of our methods on semantic
segmentation, object detection, and instance segmentation tasks. Code is
available at https://github.com/raoyongming/DenseCLIP
In text/plain
format
Archived Files and Locations
application/pdf 1.6 MB
file_obk2e3zjjvhxdp6mi5ai65c2tq
|
arxiv.org (repository) web.archive.org (webarchive) |
2112.01518v2
access all versions, variants, and formats of this works (eg, pre-prints)