MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning (original) (raw)

1CVIU Lab, University of Arkansas 2University of Florida 3COSMOS Research Center, University of Arkansas, Little Rock, USA 4ICSI, University of California, Berkeley, USA

🎉 Accepted to NeurIPS 2025 🎉

fail

Highlights

Invertible Cross-Attention Layer: We propose a novel Invertible Cross-Attention (ICA) layer that integrates with normalizing flows to provide explicit, interpretable, and tractable multimodal fusion.
DNew Cross-Attention Mechanisms: We design three complementary mechanisms—Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA)—to efficiently capture complex inter- and intra-modality correlations.
Scalable Multimodal Normalizing Flow: We introduce the Multimodal Attention-based Normalizing Flow (MANGO), enabling scalable and effective modeling of high-dimensional multimodal data, achieving state-of-the-art performance across diverse tasks including semantic segmentation, image-to-image translation, and movie genre classification.

fail

Abstract

Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach\footnote{The source code of this work will be publicly available.} to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

Experimental Results

Comparison of RGB-D Semantic Segmentation Performance on NYUDv2 and SUN RGB-D with Prior Methods.

fail

Comparison of Multimodal Image Translation Performance on Taskonomy with Prior Multimodal Methods.

fail

omparison of Movie Genre Classification Performance on the MM-IMDB dataset with Prior Multimodal Methods.

fail

Acknowledgements

This work is partly supported by NSF CAREER (No. 2442295), NSF SCH (No. 2501021), NSF E-RISE (No. 2445877), NSF SBIR Phase 2 (No. 2247237) and USDA/NIFA Award. We also acknowledge the Arkansas High-Performance Computing Center (HPC) for GPU servers. Nitin Agarwal’s participation was supported by U.S. NSF (OIA-1946391, OIA-1920920), AFOSR (FA9550-22-1-0332), ARO (W911NF-23-1-0011, W911NF-24-1-0078, W911NF-25-1-0147), ONR (N00014-21-1-2121, N00014-21-1-2765, N00014-22-1-2318), AFRL, DARPA, Australian DSTO Strategic Policy Grants Program, Arkansas Research Alliance, the Jerry L. Maulden/Entergy Endowment, and the Donaghey Foundation at the University of Arkansas at Little Rock.

BibTex

@inproceedings{truong2025mango,
      title={MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning},
      author={Thanh-Dat Truong and Christophe Bobda and Nitin Agarwal and Khoa Luu},
      booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
      year={2025}
    }