DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
release_gyqt5w4pivh5fpxt33canxd3oy
by
John Giorgi, Osvald Nitski, Bo Wang, Gary Bader
2021
Abstract
Sentence embeddings are an important component of many natural language
processing (NLP) systems. Like word embeddings, sentence embeddings are
typically learned on large text corpora and then transferred to various
downstream tasks, such as clustering and retrieval. Unlike word embeddings, the
highest performing solutions for learning sentence embeddings require labelled
data, limiting their usefulness to languages and domains where labelled data is
abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for
Unsupervised Textual Representations. Inspired by recent advances in deep
metric learning (DML), we carefully design a self-supervised objective for
learning universal sentence embeddings that does not require labelled training
data. When used to extend the pretraining of transformer-based language models,
our approach closes the performance gap between unsupervised and supervised
pretraining for universal sentence encoders. Importantly, our experiments
suggest that the quality of the learned embeddings scale with both the number
of trainable parameters and the amount of unlabelled training data, making
further improvements straightforward. Our code and pretrained models are
publicly available and can be easily adapted to new domains or used to embed
unseen text.
In text/plain
format
Archived Files and Locations
application/pdf 1.2 MB
file_2t7t43jtzves7mtm7xnze5jxuq
|
arxiv.org (repository) web.archive.org (webarchive) |
2006.03659v3
access all versions, variants, and formats of this works (eg, pre-prints)