Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
release_td7rf6s2xzfm5ntagdtiyovx4m
by
Jason Pelecanos and Quan Wang and Ignacio Lopez Moreno
2021
Abstract
Many neural network speaker recognition systems model each speaker using a
fixed-dimensional embedding vector. These embeddings are generally compared
using either linear or 2nd-order scoring and, until recently, do not handle
utterance-specific uncertainty. In this work we propose scoring these
representations in a way that can capture uncertainty, enroll/test asymmetry
and additional non-linear information. This is achieved by incorporating a
2nd-stage neural network (known as a decision network) as part of an end-to-end
training regimen. In particular, we propose the concept of decision residual
networks which involves the use of a compact decision network to leverage
cosine scores and to model the residual signal that's needed. Additionally, we
present a modification to the generalized end-to-end softmax loss function to
target the separation of same/different speaker scores. We observed significant
performance gains for the two techniques.
In text/plain
format
Archived Files and Locations
application/pdf 443.7 kB
file_gcjhbmmdpvhhvgbc5ljj7emkhy
|
arxiv.org (repository) web.archive.org (webarchive) |
2104.01989v2
access all versions, variants, and formats of this works (eg, pre-prints)