Low Latency End-to-End Streaming Speech Recognition with a Scout Network
release_2iczpkgbnvgbhj2ulirctjkv54
by
Chengyi Wang, Yu Wu, Shujie Liu, Jinyu Li, Liang Lu, Guoli Ye, Ming Zhou
2020
Abstract
The attention-based Transformer model has achieved promising results for
speech recognition (SR) in the offline mode. However, in the streaming mode,
the Transformer model usually incurs significant latency to maintain its
recognition accuracy when applying a fixed-length look-ahead window in each
encoder layer. In this paper, we propose a novel low-latency streaming approach
for Transformer models, which consists of a scout network and a recognition
network. The scout network detects the whole word boundary without seeing any
future frames, while the recognition network predicts the next subword by
utilizing the information from all the frames before the predicted boundary.
Our model achieves the best performance (2.7/6.4 WER) with only 639 ms latency
on the test-clean and test-other data sets of Librispeech.
In text/plain
format
Archived Files and Locations
application/pdf 391.3 kB
file_jiwndg3grvcujjxbpl5j3ofssq
|
arxiv.org (repository) web.archive.org (webarchive) |
2003.10369v4
access all versions, variants, and formats of this works (eg, pre-prints)