On the Limitations of Multimodal VAEs
release_uylvpkukifglzcwz5gcunmr7bu
by
Imant Daunhawer, Thomas M. Sutter, Kieran Chin-Cheong, Emanuele Palumbo, Julia E. Vogt
2022
Abstract
Multimodal variational autoencoders (VAEs) have shown promise as efficient
generative models for weakly-supervised data. Yet, despite their advantage of
weak supervision, they exhibit a gap in generative quality compared to unimodal
VAEs, which are completely unsupervised. In an attempt to explain this gap, we
uncover a fundamental limitation that applies to a large family of
mixture-based multimodal VAEs. We prove that the sub-sampling of modalities
enforces an undesirable upper bound on the multimodal ELBO and thereby limits
the generative quality of the respective models. Empirically, we showcase the
generative quality gap on both synthetic and real data and present the
tradeoffs between different variants of multimodal VAEs. We find that none of
the existing approaches fulfills all desired criteria of an effective
multimodal generative model when applied on more complex datasets than those
used in previous benchmarks. In summary, we identify, formalize, and validate
fundamental limitations of VAE-based approaches for modeling weakly-supervised
data and discuss implications for real-world applications.
In text/plain
format
Archived Files and Locations
application/pdf 18.2 MB
file_wir3wmicvzdo5f6tg56akvd4fa
|
arxiv.org (repository) web.archive.org (webarchive) |
2110.04121v2
access all versions, variants, and formats of this works (eg, pre-prints)