Making Visual Dialogue More Engaging: A New Task, Method, and Metric (original) (raw)
Authors
- Guanghui Ye Hunan University
- Huan Zhao Hunan University
- Yingxue Gao Hunan University
- Zhixue Zhao University of Sheffield
- Kehan Wang Hunan University
- Xupeng Zha Hunan University
- Zhihua Jiang Jinan University
DOI:
https://doi.org/10.1609/aaai.v40i19.38650
Abstract
Large language model (LLM)-based visual dialogue (VD) systems have made response generation for image-grounded conversations more correct and coherent. However, user engagement - the extent to which a user is interested, emotionally involved, and willing to continue the conversation - remains a challenge. To fully explore engaging VD, we propose: (i) a new task named Audio-enhanced VD (AVD), which introduces additional audio dialogue contexts that can more vividly convey the speaker's emotions as input, with the aim of generating correct but more engaging dialogue responses. Specifically, we employ a text-to-speech model as the modality translator to generate the paired acoustic utterances from the inputting textual utterances; (ii) an accompanying approach named Visually-grounded and Interleaved Text-Audio Dialogue Modeling (VITA-DM), which utilizes both image-grounded information and interleaved text-audio utterances for visual dialogue modeling, differentiating from previous multi-modal LLM (MLLM)-based methods that normally model text and audio modalities separately. We also present three pre-training tasks to better learn multi-modal interactions across language, vision, and audio; (iii) a novel metric named Multi-Modal Engagement (MME), which fills the gap of engagement estimation in VD and can provide a fine-grained assessment along emotional, attentional, and reply engagement dimensions (EE, AE, RE). We experiment on two popular datasets and provide extensive evaluations (automatic, engagement-specific, and human), supporting the validity of our approach. Furthermore, based on empirical results that reveal that emotions contribute the most to engagement, we justify our emphasis on the emotional aspect throughout the definition, solution, and evaluation of our task.
How to Cite
Ye, G., Zhao, H., Gao, Y., Zhao, Z., Wang, K., Zha, X., & Jiang, Z. (2026). Making Visual Dialogue More Engaging: A New Task, Method, and Metric. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16145-16153. https://doi.org/10.1609/aaai.v40i19.38650
Issue
Section
AAAI Technical Track on Data Mining & Knowledge Management III