WaveFlow: A Compact Flow-based Model for Raw Audio
release_cih35fsuine2xls5jbadrcd3za
by
Wei Ping, Kainan Peng, Kexin Zhao, Zhao Song
2019
Abstract
In this work, we present WaveFlow, a small-footprint generative flow for raw
audio, which is trained with maximum likelihood without probability density
distillation and auxiliary losses as used in Parallel WaveNet and ClariNet. It
provides a unified view of likelihood-based models for raw audio, including
WaveNet and WaveGlow as special cases. We systematically study these
likelihood-based generative models for raw waveforms in terms of test
likelihood and speech fidelity. We demonstrate that WaveFlow can synthesize
high-fidelity speech as WaveNet, while only requiring a few sequential steps to
generate very long waveforms with hundreds of thousands of time-steps.
Furthermore, WaveFlow closes the significant likelihood gap that has existed
between autoregressive models and flow-based models for efficient synthesis.
Finally, our small-footprint WaveFlow has 5.91M parameters and can generate
22.05kHz high-fidelity speech 42.6 times faster than real-time on a GPU without
engineered inference kernels.
In text/plain
format
Archived Files and Locations
application/pdf 470.3 kB
file_u3ickldhifhlthxjgt4jipm2u4
|
arxiv.org (repository) web.archive.org (webarchive) |
1912.01219v1
access all versions, variants, and formats of this works (eg, pre-prints)