Low Latency End-to-End Streaming Speech Recognition with a Scout Network release_2iczpkgbnvgbhj2ulirctjkv54

by Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou

Released as a article .

2020  

Abstract

The attention-based Transformer model has achieved promising results for speech recognition (SR) in the offline mode. However, in the streaming mode, the Transformer model usually incurs significant latency to maintain its recognition accuracy when applying a fixed-length look-ahead window in each encoder layer. In this paper, we propose a novel low-latency streaming approach for Transformer models, which consists of a scout network and a recognition network. The scout network detects the whole word boundary without seeing any future frames, while the recognition network predicts the next subword by utilizing the information from all the frames before the predicted boundary. Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency on the test-clean and test-other data sets of Librispeech.
In text/plain format

Archived Files and Locations

application/pdf  391.3 kB
file_jiwndg3grvcujjxbpl5j3ofssq
arxiv.org (repository)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article
Stage   submitted
Date   2020-05-03
Version   v4
Language   en ?
arXiv  2003.10369v4
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 931274e5-5151-4750-b58e-ed4c8bbdfb62
API URL: JSON