Does Visual Self-Supervision Improve Learning of Speech Representations for Emotion Recognition?
release_wzhbwisw5rgubmryky27nrbxry
by
Abhinav Shukla, Stavros Petridis, Maja Pantic
2020
Abstract
Self-supervised learning has attracted plenty of recent research interest.
However, most works for self-supervision in speech are typically unimodal and
there has been limited work that studies the interaction between audio and
visual modalities for cross-modal self-supervision. This work (1) investigates
visual self-supervision via face reconstruction to guide the learning of audio
representations; (2) proposes an audio-only self-supervision approach for
speech representation learning; (3) shows that a multi-task combination of the
proposed visual and audio self-supervision is beneficial for learning richer
features that are more robust in noisy conditions; (4) shows that
self-supervised pretraining can outperform fully supervised training and is
especially useful to prevent overfitting on smaller sized datasets. We evaluate
our learned audio representations for discrete emotion recognition, continuous
affect recognition and automatic speech recognition. We outperform existing
self-supervised methods for all tested downstream tasks. Our results
demonstrate the potential of visual self-supervision for audio feature learning
and suggest that joint visual and audio self-supervision leads to more
informative audio representations for speech and emotion recognition.
In text/plain
format
Archived Files and Locations
application/pdf 1.7 MB
file_wrpn6ly4ebdptmgzremqtk2nou
|
arxiv.org (repository) web.archive.org (webarchive) |
2005.01400v2
access all versions, variants, and formats of this works (eg, pre-prints)