On Addressing Practical Challenges for RNN-Transducer release_fogqm5uwfbde3lpmfjs755woq4

by Rui Zhao, Jian Xue, Jinyu Li, Wenning Wei, Lei He, Yifan Gong

Released as a article .

2021  

Abstract

In this paper, several works are proposed to address practical challenges for deploying RNN Transducer (RNN-T) based speech recognition system. These challenges are adapting a well-trained RNN-T model to a new domain without collecting the audio data, obtaining time stamps and confidence scores at word level. The first challenge is solved with a splicing data method which concatenates the speech segments extracted from the source domain data. To get the time stamp, a phone prediction branch is added to the RNN-T model by sharing the encoder for the purpose of force alignment. Finally, we obtain word-level confidence scores by utilizing several types of features calculated during decoding and from confusion network. Evaluated with Microsoft production data, the splicing data adaptation method improves the baseline and adaption with the text to speech method by 58.03% and 15.25% relative word error rate reduction, respectively. The proposed time stamping method can get less than 50ms word timing difference on average while maintaining the recognition accuracy of the RNN-T model. We also obtain high confidence annotation performance with limited computation cost.
In text/plain format

Archived Files and Locations

application/pdf  257.3 kB
file_5cxm7447bzbyhkuptjhnzdjfna
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2021-05-04
Version   v2
Language   en ?
arXiv  2105.00858v2
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 6b9b14c9-6b6c-4a36-b6b3-804c1b57249b
API URL: JSON