An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition
release_p4peb5urpzaxja62cgalfjnyuy
by
Xuankai Chang, Takashi Maekaku, Pengcheng Guo, Jing Shi, Yen-Ju Lu, Aswin Shanmugam Subramanian, Tianzi Wang, Shu-wen Yang, Yu Tsao, Hung-yi Lee, Shinji Watanabe
2021
Abstract
Self-supervised pretraining on speech data has achieved a lot of progress.
High-fidelity representation of the speech signal is learned from a lot of
untranscribed data and shows promising performance. Recently, there are several
works focusing on evaluating the quality of self-supervised pretrained
representations on various tasks without domain restriction, e.g. SUPERB.
However, such evaluations do not provide a comprehensive comparison among many
ASR benchmark corpora. In this paper, we focus on the general applications of
pretrained speech representations, on advanced end-to-end automatic speech
recognition (E2E-ASR) models. We select several pretrained speech
representations and present the experimental results on various open-source and
publicly available corpora for E2E-ASR. Without any modification of the
back-end model architectures or training strategy, some of the experiments with
pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or
outperform current state-of-the-art (SOTA) recognition performance. Moreover,
we further explore more scenarios for whether the pretraining representations
are effective, such as the cross-language or overlapped speech. The scripts,
configuratons and the trained models have been released in ESPnet to let the
community reproduce our experiments and improve them.
In text/plain
format
Archived Files and Locations
application/pdf 783.0 kB
file_um5gydes4re5robn366kfoshe4
|
arxiv.org (repository) web.archive.org (webarchive) |
2110.04590v1
access all versions, variants, and formats of this works (eg, pre-prints)