Making Visual Dialogue More Engaging: A New Task, Method, and Metric (original) (raw)

Authors

Guanghui Ye Hunan University
Huan Zhao Hunan University
Yingxue Gao Hunan University
Zhixue Zhao University of Sheffield
Kehan Wang Hunan University
Xupeng Zha Hunan University
Zhihua Jiang Jinan University

DOI:

https://doi.org/10.1609/aaai.v40i19.38650

Abstract

Large language model (LLM)-based visual dialogue (VD) systems have made response generation for image-grounded conversations more correct and coherent. However, user engagement - the extent to which a user is interested, emotionally involved, and willing to continue the conversation - remains a challenge. To fully explore engaging VD, we propose: (i) a new task named Audio-enhanced VD (AVD), which introduces additional audio dialogue contexts that can more vividly convey the speaker's emotions as input, with the aim of generating correct but more engaging dialogue responses. Specifically, we employ a text-to-speech model as the modality translator to generate the paired acoustic utterances from the inputting textual utterances; (ii) an accompanying approach named Visually-grounded and Interleaved Text-Audio Dialogue Modeling (VITA-DM), which utilizes both image-grounded information and interleaved text-audio utterances for visual dialogue modeling, differentiating from previous multi-modal LLM (MLLM)-based methods that normally model text and audio modalities separately. We also present three pre-training tasks to better learn multi-modal interactions across language, vision, and audio; (iii) a novel metric named Multi-Modal Engagement (MME), which fills the gap of engagement estimation in VD and can provide a fine-grained assessment along emotional, attentional, and reply engagement dimensions (EE, AE, RE). We experiment on two popular datasets and provide extensive evaluations (automatic, engagement-specific, and human), supporting the validity of our approach. Furthermore, based on empirical results that reveal that emotions contribute the most to engagement, we justify our emphasis on the emotional aspect throughout the definition, solution, and evaluation of our task.

How to Cite

Ye, G., Zhao, H., Gao, Y., Zhao, Z., Wang, K., Zha, X., & Jiang, Z. (2026). Making Visual Dialogue More Engaging: A New Task, Method, and Metric. Proceedings of the AAAI Conference on Artificial Intelligence, 40(19), 16145-16153. https://doi.org/10.1609/aaai.v40i19.38650

Issue

Section

AAAI Technical Track on Data Mining & Knowledge Management III