Modality-Aware Gated Attention Network (MAGAN) that focuses on event-relevant visual regions, consolidates informative audio frequencies, and captures event-specific modality biases. Specifically, a cross-modal gated co-attention (CMGCA) scheme is presented for modeling the correspondence between the potential (self-guided) localization maps and the modality-guided localization maps through two gated components, i.e., audio-to-visual attention and visual-to-audio attention. Furthermore, a cross-modal gated co-interaction (CMGCI) mechanism that incorporates both unimodal gated interaction and multimodal gated interaction is introduced to capture event-specific modality biases by considering unimodal independence and multimodal synergy simultaneously. Extensive experiments on the AVE dataset demonstrate the superiority and effectiveness of our model over state-of-the-art approaches in both fully- and weakly-supervised AVE settings.">

Modality-Aware Gated Attention Network for Audio-Visual Event Localization (original) (raw)

IEEE Account

Purchase Details

Profile Information

Need Help?

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.
© Copyright 2026 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.