Speaker Recognition from Raw Waveform with SincNet
release_tcq3tdguorcmrlce2nav6vefxq
by
Mirco Ravanelli, Yoshua Bengio
2018
Abstract
Deep learning is progressively gaining popularity as a viable alternative to
i-vectors for speaker recognition. Promising results have been recently
obtained with Convolutional Neural Networks (CNNs) when fed by raw speech
samples directly. Rather than employing standard hand-crafted features, the
latter CNNs learn low-level speech representations from waveforms, potentially
allowing the network to better capture important narrow-band speaker
characteristics such as pitch and formants. Proper design of the neural network
is crucial to achieve this goal. This paper proposes a novel CNN architecture,
called SincNet, that encourages the first convolutional layer to discover more
meaningful filters. SincNet is based on parametrized sinc functions, which
implement band-pass filters. In contrast to standard CNNs, that learn all
elements of each filter, only low and high cutoff frequencies are directly
learned from data with the proposed method. This offers a very compact and
efficient way to derive a customized filter bank specifically tuned for the
desired application. Our experiments, conducted on both speaker identification
and speaker verification tasks, show that the proposed architecture converges
faster and performs better than a standard CNN on raw waveforms.
In text/plain
format
Archived Files and Locations
application/pdf 419.9 kB
file_ysi2ykn4rfb2rcljcb67peudbm
|
arxiv.org (repository) web.archive.org (webarchive) |
1808.00158v1
access all versions, variants, and formats of this works (eg, pre-prints)