All-neural beamformer for continuous speech separation
release_xutqsekhofb3ljeh5h6wxgukgu
by
Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez
2021
Abstract
Continuous speech separation (CSS) aims to separate overlapping voices from a
continuous influx of conversational audio containing an unknown number of
utterances spoken by an unknown number of speakers. A common application
scenario is transcribing a meeting conversation recorded by a microphone array.
Prior studies explored various deep learning models for time-frequency mask
estimation, followed by a minimum variance distortionless response (MVDR)
filter to improve the automatic speech recognition (ASR) accuracy. The
performance of these methods is fundamentally upper-bounded by MVDR's spatial
selectivity. Recently, the all deep learning MVDR (ADL-MVDR) model was proposed
for neural beamforming and demonstrated superior performance in a target speech
extraction task using pre-segmented input. In this paper, we further adapt
ADL-MVDR to the CSS task with several enhancements to enable end-to-end neural
beamforming. The proposed system achieves significant word error rate reduction
over a baseline spectral masking system on the LibriCSS dataset. Moreover, the
proposed neural beamformer is shown to be comparable to a state-of-the-art
MVDR-based system in real meeting transcription tasks, including AMI, while
showing potentials to further simplify the runtime implementation and reduce
the system latency with frame-wise processing.
In text/plain
format
Archived Files and Locations
application/pdf 820.9 kB
file_5ya3lq34xvbkni7hzqjskhojx4
|
arxiv.org (repository) web.archive.org (webarchive) |
2110.06428v1
access all versions, variants, and formats of this works (eg, pre-prints)