Learning Multimodal Latent Generative Models with Energy-Based Prior

ECCV2024 Oral

Shiyu Yuan¹, Jiali Cui², Hanao Li², and Tian Han²

¹Department of Systems and Enterprises
²Department of Computer Science

Stevens Institute of Technology, USA

Abstract

Multimodal models have gained increasing popularity recently. Many works have been proposed to learn the representations for different modalities. The representation can learn shared information from these domains, leading to increased and coherent joint and cross-generation. However, these works mainly considered standard Gaussian or Laplacian as their prior distribution. It can be challenging for the uni-modal and non-informative distribution to capture all the information from multiple data types. Meanwhile, energy-based models (EBM) have shown their effectiveness in multiple tasks due to their expressiveness and flexibility. But its capacity has yet to be discovered for the multimodal generative models. In this paper, we propose a novel framework to train multimodal latent generative models together with the energy-based models. The proposed method can lead to more expressive and informative prior which can better capture the information within multiple modalities. Our experiments showed that our model is effective and can increase generation coherence and latent classification for different multimodal datasets.

Paper

The publication can be obtained here.

@article{syuan2024mulener,
  title={Learning Multimodal Latent Generative Models with Energy-Based Prior},
  author={Shiyu, Yuan and Jiali, Cui and Hanao, Li and Tian, Han},
  journal={ECCV},
  year={2024}
}

Contributions

(1)We propose the energy-based prior model for multimodal latent generative models to capture complex shared information within multiple modalities.
(2) We develop the variational training scheme where the generation model, inference model, and energy-based prior can be jointly and effectively learned.
(3) We conduct various experiments and ablation studies and demonstrate superior performance compared to Laplacian prior baselines.

Settings

1. Base version: EBM prior.
2. Generalized version: EBM prior with modality specific prior.

Code

The code can be obtained here.

Experiments

Experiment 1.1: EBM prior Base version on MNIST-SVHN: Digit Coherence

Our proposed multimodal generative model with EBM prior generate image with highly consistant digit information both in clean and noisy background. The qualitative results are shown in Figure 1. We further evaluate our model quantitatively by using Joint Coherence and Cross Coherence in the table below. It can be seen that our model achieves superior generation performance compared to listed baseline models.

Figure 1: Joint Generated samples for MINIST-SVHN .

Experiment 1.2: EBM prior Base version on PolyMNIST: Digit Coherence

Figure 2: Joint Generated samples for PolyMNIST (EBM prior: left; Laplacian prior: right).

Experiment 2: EBM prior Generalized version on PolyMNIST: Digit Coherence

Figure 3: Joint Generated samples for PolyMNIST (EBM prior: left; Laplacian prior: right).

Experiment 3: Markov chain transition from standard prior to EBM prior on CUB

We examine the exponential tilting of the reference prior \(p_0(z)\) through Langevin samples initialized from \(p_0(z)\) with target distribution \(p_\alpha(z)\). As the reference distribution \(p_0(z)\) is in the form of an Laplacian, we expect the energy-based correction \(f_\alpha\) to tilt \(p_0\) into an irregular shape like some shallow local modes. Therefore, the trajectory of a Markov chain initialized from the reference distribution \(p_0(z)\) with well-learned target \(p_\alpha(z)\) should depict the transition towards more coherent information between image and caption. Figure 4 depicts such transitions for CUB, which is based on a model trained with 50 steps. The quality of synthesis improves significantly with increasing number of steps.

Figure 4: Transition of Markov chains initialized from \(p_0(z)\) towards \(\tilde{p}_{\alpha}(z)\) for langevin steps with 50.

References

[1] Wu, M., & Goodman, N. (2018). Multimodal generative models for scalable weakly-supervised learning. Advances in neural information processing systems, 31.
[2] Shi, Y., Paige, B., & Torr, P. (2019). Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in neural information processing systems, 32..
[3] Palumbo, E., Daunhawer, I., & Vogt, J. E. (2023). MMVAE+: Enhancing the generative quality of multimodal VAEs without compromises. In The Eleventh International Conference on Learning Representations. OpenReview..
[4] Sutter, T. M., Daunhawer, I., & Vogt, J. E. (2021). Generalized multimodal ELBO. arXiv preprint arXiv:2105.02470..
[5] Hwang, H., Kim, G. H., Hong, S., & Kim, K. E. (2021). Multi-view representation learning via total correlation objective. Advances in Neural Information Processing Systems, 34, 12194-12207..
[6] Sutter, T., Daunhawer, I., & Vogt, J. (2020). Multimodal generative learning utilizing jensen-shannon-divergence. Advances in neural information processing systems, 33, 6100-6110..