CL4AC: A Contrastive Loss for Audio Captioning
release_lbct4nnulfh5nmwwuvblrh2unq
by
Xubo Liu, Qiushi Huang, Xinhao Mei, Tom Ko, H Lilian Tang, Mark D. Plumbley, Wenwu Wang
2021
Abstract
Automated Audio captioning (AAC) is a cross-modal translation task that aims
to use natural language to describe the content of an audio clip. As shown in
the submissions received for Task 6 of the DCASE 2021 Challenges, this problem
has received increasing interest in the community. The existing AAC systems are
usually based on an encoder-decoder architecture, where the audio signal is
encoded into a latent representation, and aligned with its corresponding text
descriptions, then a decoder is used to generate the captions. However,
training of an AAC system often encounters the problem of data scarcity, which
may lead to inaccurate representation and audio-text alignment. To address this
problem, we propose a novel encoder-decoder framework called Contrastive Loss
for Audio Captioning (CL4AC). In CL4AC, the self-supervision signals derived
from the original audio-text paired data are used to exploit the
correspondences between audio and texts by contrasting samples, which can
improve the quality of latent representation and the alignment between audio
and texts, while trained with limited data. Experiments are performed on the
Clotho dataset to show the effectiveness of our proposed approach.
In text/plain
format
Archived Files and Locations
application/pdf 2.6 MB
file_54s6pxmsl5c3ldgmuvjmlwn5bm
|
arxiv.org (repository) web.archive.org (webarchive) |
2107.09990v1
access all versions, variants, and formats of this works (eg, pre-prints)