Large scale weakly and semi-supervised learning for low-resource video ASR
release_cxg27rdi7rah3oupwgtgcbu6s4
by
Kritika Singh, Vimal Manohar, Alex Xiao, Sergey Edunov, Ross Girshick, Vitaliy Liptchinsky, Christian Fuegen, Yatharth Saraf, Geoffrey Zweig, Abdelrahman Mohamed
2020
Abstract
Many semi- and weakly-supervised approaches have been investigated for
overcoming the labeling cost of building high quality speech recognition
systems. On the challenging task of transcribing social media videos in
low-resource conditions, we conduct a large scale systematic comparison between
two self-labeling methods on one hand, and weakly-supervised pretraining using
contextual metadata on the other. We investigate distillation methods at the
frame level and the sequence level for hybrid, encoder-only CTC-based, and
encoder-decoder speech recognition systems on Dutch and Romanian languages
using 27,000 and 58,000 hours of unlabeled audio respectively. Although all
approaches improved upon their respective baseline WERs by more than 8%,
sequence-level distillation for encoder-decoder models provided the largest
relative WER reduction of 20% compared to the strongest data-augmented
supervised baseline.
In text/plain
format
Archived Files and Locations
application/pdf 138.6 kB
file_qnforeizbzgt3kglftyslvssli
|
arxiv.org (repository) web.archive.org (webarchive) |
2005.07850v1
access all versions, variants, and formats of this works (eg, pre-prints)