Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception (original) (raw)

Yuanchen Wu1,2, Lu Zhang2 Hang Yao1 Junlong Du2 Ke Yan2,†
Shouhong Ding2 Yunsheng Wu2 Xiaoqiang Li1,
1 School of Computer Engineering & Science, Shanghai University 2 Tencent YouTu Lab
{yuanchenwu,yaohang,xqli}@shu.edu.cn {xluzhang,jeffdu,kerwinyan,ericshding,simonwu}@tencent.com

Abstract

Large Vision-Language Models (LVLMs) have achieved impressive results across various cross-modal tasks. However, hallucinations, i.e., the models generating counterfactual responses, remain a challenge. Though recent studies have attempted to alleviate object perception hallucinations, they focus on the models’ response generation, and overlooking the task question itself. This paper discusses the vulnerability of LVLMs in solving counterfactual presupposition questions (CPQs), where the models are prone to accept the presuppositions of counterfactual objects and produce severe hallucinatory responses. To this end, we introduce “Antidote”111Code: https://github.com/Wu0409/Antidote., a unified, synthetic data-driven post-training framework for mitigating both types of hallucination above. It leverages synthetic data to incorporate factual priors into questions to achieve self-correction, and decouple the mitigation process into a preference optimization problem. Furthermore, we construct “CP-Bench”, a novel benchmark to evaluate LVLMs’ ability to correctly handle CPQs and produce factual responses. Applied to the LLaVA series, Antidote can simultaneously enhance performance on CP-Bench by over 50%, POPE by 1.8-3.3%, and CHAIR & SHR by 30-50%, all without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.

1 Introduction

Recently, Large Vision-Language Models (LVLMs) have achieved significant advancements, showing promising performance across various tasks, such as image caption, visual question answering (VQA), and visual dialogues [43, 23, 5, 45]. Despite their capabilities and versatility, the hallucination, characterized by generating counterfactual information, remains a significant challenge. It undermines their reliability and limits applications in sensitive domains like healthcare and autonomous systems. In LVLM, studies of hallucination mainly focus on “object perception”, including “object existence” and “image description” [50, 17, 47]. The former refers to whether the mentioned objects are actually non-existent, while the latter further evaluates whether the models output counterfactual attributes or relationships between objects. To alleviate the above types of hallucinations, recent works improve the instruction tuning process [20, 15], post-calibrate the model response via experts or decoding strategies [46, 17], and conduct post-training of models [50, 41]. They have manifested effectiveness in alleviating object hallucinations on popular benchmarks, such as POPE [18] and CHAIR [32].

Refer to caption

Figure 1: The hallucination responses induced by CPQ. Though recent hallucination mitigation methods improves LVLMs in object perception, while their models can still be easily deceived by CPQs and induce severe hallucinatory responses.

However, a phenomenon emerges in Figure 1: for a simple question about the existence of a “car”, the LVLMs with recent hallucination mitigation methods (HACL [15] as an example) can confirm its absence. When we implicitly presuppose its existence and pose a relevant question “What is the brand of the car?”, the model suddenly outputs hallucinatory responses. This issue is particularly notable when querying about objects that are absent in the current image but frequently appear in similar visual contexts, underscoring a potential limitation of existing hallucination alleviation techniques. We call this type of question “Counterfactual Presupposition Question (CPQ)”. Compared to the conventional “object perception” hallucination focusing on response generation, CPQ further requires the model’s judgment of presuppositions grounded by images. It is more challenging and evaluates the severity of hallucination in practical VQA scenarios where the validity of the presupposition cannot be guaranteed. Additionally, we investigate recent advanced open-source LVLMs (e.g., InternVL-2 [5] and Qwen2-VL [38] that have demonstrated superior capabilities on general tasks and anti-hallucination on object perception, surpassing closed-source models such as GPT-4o [29] and Claude-3.5 [3]. However, as illustrated in Figure 2, we observe that they still suffer from CPQs and output hallucinatory responses as shown in Figure 3(a). To this end, we make attempts towards two following aspects to address the CPQ challenge and meanwhile addressing object perception hallucination:

Refer to caption

Figure 2: Performance comparison of LVLMs on benchmarks of general capabilities (MMBench [26]), hallucination of object perception (POPE [18]) and CPQ (the proposed CP-Bench). Higher values indicate better performance on corresponding benchmarks.

  1. We introduce a unified, synthetic data-driven post-training framework called “Antidote”. We reckon that a primary cause of the above hallucinations is the object co-occurrence and over-learning of instructions. Hence, we aim to obtain images where the statistically co-occurring objects/scenes are decoupled (e.g., a speedboat without bridges in the background) and construct QA pairs targeted on these decoupled, non-existent objects (e.g., bridge), as presented in Figure 3(b). We develop an automated data synthesis pipeline, comprising steps of image caption curation, visual scene understanding, factual verification, and sample construction. It allows us to derive factual priors for each sample, and then incorporate them into the prompt for models’ self-correction. This process reformulates hallucination mitigation as a preference optimization problem, where the original response is treated as a “rejected” sample, and the corrected response as a “preferred” sample. By employing Antidote, LVLMs learn a preference constraint during training, enabling them to discriminate counterfactual presuppositions and generate factual responses well. Extensive experiments demonstrate its effectiveness on CPQ and object perception hallucinations.

  2. We construct “CP-Bench”, a benchmark evaluating LVLMs’ ability to discern counterfactual presuppositions and generate factual responses. It is composed of two parts, the synthetic validation (dev) set and the manually annotated test set. The dev set is automatically constructed from the data synthetic pipeline of Antidote. The CPQs in test set can be categorized into four categories in daily scenarios, i.e., item, knowledge, scene, and activity. The evaluations on recent advanced closed-sourced and open-sourced LVLMs highlights the critical gap in current models’ ability to discern counterfactual presuppositions, providing a new insight into hallucinations of LVLMs.

Refer to caption

(a) CPQ samples from the proposed CP-Bench.

Refer to caption

(b) Training samples from Antidote’s pipeline.

Figure 3: Examples of hallucination induced by CPQs and the synthetic samples of Antidote. The CPQs are selected from the test set of the proposed CP-Bench, which will be introduced in Section 4. “Hallucination candidates” are the non-existent objects that commonly appear in similar scenes. More examples can be viewed in Appendix.

Hallucination in LVLMs. Recently, many large vision-language models (LVLMs) have emerged [23, 5, 27], extending the reasoning brain of LLMs to the vision modality. This enables LVLMs to complete various tasks, such as visual question answering and general visual dialogue. However, the hallucination, stemming from the inherent nature of LLMs [11], modality misalignment [5], and the quality of instruction turning data [20], raises concerns about their reliability and applicability. To evaluate the severity of hallucinations in LVLMs, POPE [18] identifies hallucinations related to object existence, while CHAIR [32] evaluates the proportion of hallucinated objects in image descriptions. To further broaden the scope of evaluation to include categories, attributes, and emotions within image descriptions, SHR [50], a GPT-assisted evaluation metric, has been proposed. This paper discusses LVLMs’ hallucinations in counterfactual presupposition questions and introduce a corresponding benchmark, CP-Bench, to evaluate the severity of this issue. Our findings reveal that recent open-source LVLMs largely overlook this critical issue.

Hallucination Mitigation. Previous works mitigating hallucinations of LVLMs primarily focus on object existence and image descriptions, i.e., object perception hallucination. Three mainstream approaches have emerged for mitigating these hallucinations: supervised fine-tuning (SFT), post-calibration, and post-training. SFT aims to fine-tune with the hallucination-free data [47], such as LRV [20] and InstructBLIP [6]. Post-calibration conducts additional post-processing techniques to model outputs, such as contrastive decoding strategies [17, 39, 36] and leveraging existing tools or expert models [46]. Post-training focuses on improving the hallucination of off-the-shelf LVLMs, which commonly employ retraining or preference optimization to alleviate hallucination [50, 41, 51]. In contrast, the proposed Antidote not only effectively improves object perception hallucinations but also further overcomes the hallucinations induced by CPQs. It adopts the preference optimization paradigm but differs in that it does not rely on any expert models (e.g., GPT-4V) [50] to generate preference samples or exclusively utilize dis-preferred data [51]. Instead, we fully leverage the advantages of our synthetic data pipeline, seamlessly utilizing factual information without additional cost to enable the model to self-correct its responses.

3 Our Post-training Framework: Antidote

3.1 Motivation

As illustrated in Figure 3(a), two key issues can be observed: (1) LVLMs tend to blindly follow the instruction in the task query (Image #1 and Image #2). When asking “what does the label on the beer show?” for Image #1, the model ignores the existence of the subject in the question (i.e., the beer) and directly follows the instruction to identify the text on the label. (2) LVLMs overfit to similar scene-based QA patterns (Image #3 and Image #4). When asking “what is the player’s number?” in a scenario where the player has no number on the uniform, the model generates a hallucinated answer that is usually in the VQA tasks with similar scenes. Based on the above observation, we aim to obtain the images where statistically co-occurring objects are decoupled (e.g., a car without wheels) and construct queries targeting these decoupled, non-existent objects (e.g., wheel), aiming at calibrating the bias of LVLMs. Thanks to the advancements in image generation models [30, 7] and LLMs [37, 44], we can synthesize the images and corresponding questions in a controlled manner (Figure 4). Then, with the factual prior during the above process, we incorporate them into the models’ prompt for self-correction. Finally, the self-correction process reformulates hallucination mitigation as a preference optimization problem, and we conduct direct preference optimization (DPO) to post-train the LVLMs.

3.2 Data Synthesis Pipeline

Step 1: Construction of Caption Pool. The caption pool is critical for enhancing the diversity and richness of the training set for Antidote. Captions can be sourced either from web-crawled datasets or generated by LLMs. Here, we collect the captions from CC3M [35] to build our pool. Since CC3M contains many noisy or unsuitable captions for image and question generation, DeepSeek-V2 [19] is adopted to perform a re-captioning and filtering process. To enhance the pool’s diversity, we employ a fuzzy deduplication strategy through MinHash and LSH algorithms [13].

Step 2: Visual Scene Understanding. First, we prompt DeepSeek-V2 with instructions, such as “removing abstract concepts and specific terms” and “limit to less than 15 words”, to rewrite captions 𝒞i⁢m⁢gsubscript𝒞𝑖𝑚𝑔\mathcal{C}_{img}caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT for subsequent image generation. Second, we utilize DeepSeek-V2’s comprehension and reasoning capability to identify the objects 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT within the scenes described by captions. Third, we leverage DeepSeek-V2’s world knowledge to generate objects 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT that typically occur in similar scenes. These objects will be served as hallucination candidates that will not be present in the generated images. The prompt template P1 in this step is detailed in Appendix. Inspired by recent self-reflection strategies [14, 42], each triplet <𝒞i⁢m⁢g,𝒪p⁢r⁢e,𝒪h⁢a⁢l⁢l⁢u>\textless\mathcal{C}_{img},\mathcal{O}_{pre},\mathcal{O}_{hallu}\textgreater< caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT > is sent back again to DeepSeek-V2, verifying whether each element conforms to their rules in P1, such as “the number of generated objects”, “generating objects with visible entities”, and “avoiding conflicts in 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT and 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT”.

Step 3: Data Synthesis. In image generation, 𝒞i⁢m⁢gsubscript𝒞𝑖𝑚𝑔\mathcal{C}_{img}caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT serves as the prompt and 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT as the negative prompt. Benefiting from recent image generation models [30, 7], the generated images exhibit a high degree of photorealism and diverse content. This paper adopts Stable-Diffusion-3 [7] as the generator. However, it cannot ensure that the generated objects can fully align with 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT while suppressing the existence of 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT. Hence, we introduce “Factual Assessor” driven by an open-set grounding model, Grounding-DINO [24]. It checks the presence of 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT and 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT in the generated images. If an object in 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT is not detected, it will be removed. Similarly, detected objects in 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT will also be removed. If either 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT or 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT is ∅\emptyset∅, the corresponding triplet will be discarded. Finally, the remaining triplets are sent to DeepSeek-V2 to generate task queries, such as CPQs and description queries.

In early experiments, we note that the LLMs tend to generate similar questions when facing choosing the same main object (e.g., frequently focusing on “color” or “brand” when selecting “car” in 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT or 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT), degrading the diversity of task queries in the generated dataset. Thus, a key-value memory bank is maintained to save processed captions and corresponding queries. The captions are extracted to sentence embedding using BGE-m3 [4] as the key of the memory bank. For each <𝒞i⁢m⁢g,𝒪p⁢r⁢e,𝒪h⁢a⁢l⁢l⁢u>\textless\mathcal{C}_{img},\mathcal{O}_{pre},\mathcal{O}_{hallu}\textgreater< caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT >, we retrieve questions whose captions are semantically close to 𝒞i⁢m⁢gsubscript𝒞𝑖𝑚𝑔\mathcal{C}_{img}caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT through the memory bank. These are then integrated to the prompt “Do not generate questions similar to the following: …” to mitigate redundancy in generation.

Refer to caption

Figure 4: The data synthesis pipeline for Antidote. The pipeline consists of three stages: (a) construction of caption Pool; (b) visual scene understanding; (c) data synthesis.

Refer to caption

Figure 5: Overview of the proposed Antidote post-training. The factual information from the synthetic data is seamlessly integrated into the input task prompt. The LVLMs can utilize this information to self-correct the responses as “positive” samples. For the original responses, they are regarded as “negative” samples to achieve preference alignment for hallucination alleviation.

3.3 Self-Correction via Preference Alignment

Antidote is a universal post-training framework for alleviating hallucination in CPQs, object existence, and image description. The overview of post-training is presented in Figure 5. Through the data synthetic pipeline, we construct the three types of task queries for post-training:

  1. Presupposition Questions: We prompt DeepSeek-V2 to generate CPQs based on 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT using P2 in Appendix. In early experiments, we observed that only post-training with CPQs makes the baseline model overly “cautious” in responding to questions. Thus, we construct a True Presupposition Question (TPQ) set based on 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT. For the Antidote for presupposition questions, we prompt the baseline model with “Given the fact that there is {facts of 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT / 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT}, please answer: {CPQ/TPQ}” to self-correct the original answer. For CPQs and TPQs, their self-corrected answer will be used as the positive response yp⁢o⁢ssubscript𝑦𝑝𝑜𝑠y_{pos}italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT, while the original answer will be used as negative response yn⁢e⁢gsubscript𝑦𝑛𝑒𝑔y_{neg}italic_y start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT.

  2. Object Existence: We randomly select objects in 𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT and 𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT as the object candidate to build the training set of the object existence. The question prompts are generated by DeepSeek-V2, such as “Is / Are there {object} in the image?” and “Can you see {object} in the image?”. For the Antidote for object existence, we prompt the model with “Given the fact that there is / isn’t {object}, please answer: {question}” to self-correct its response.

  3. Image Description: We generate task queries of image description by DeepSeek-V2, such as “Please describe the image in detail.” and “Can you describe what you see in the image thoroughly?”. For the Antidote of image description, we integrate <𝒞i⁢m⁢g,𝒪p⁢r⁢e,𝒪h⁢a⁢l⁢l⁢u>\textless\mathcal{C}_{img},\mathcal{O}_{pre},\mathcal{O}_{hallu}\textgreater< caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT > into the query and prompt the model to self-correct its response: “Given the hint of the image: the image caption: {𝒞i⁢m⁢gsubscript𝒞𝑖𝑚𝑔\mathcal{C}_{img}caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT}, the object(s) you can see: {𝒪p⁢r⁢esubscript𝒪𝑝𝑟𝑒\mathcal{O}_{pre}caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT}, the object(s) you cannot see: {𝒪h⁢a⁢l⁢l⁢usubscript𝒪ℎ𝑎𝑙𝑙𝑢\mathcal{O}_{hallu}caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT}, please {query}”.

Response Filtering: Since not all model responses contain hallucinations, especially in object existence queries, such samples are unhelpful for hallucination mitigation and even lead to difficulty in optimization [40]. Thus, we compare the original and self-corrected responses and filter out samples with similar answers. In our experiments, we extract the embeddings of both responses using BGE-m3 [4] and calculate their cosine similarity to perform the filtering.

Preference Optimization: By direct preference optimization (DPO) [31], we encourage the model to favor corrected positive response and reject hallucinatory negative response without building an implicit reward model [34]. Given the above constructed preference pairs 𝒟𝒟\mathcal{D}caligraphic_D, the policy model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (i.e., the post-trained LVLMs with Antidote) is optimized by maximizing the log-likelihood of the preferred response yp⁢o⁢ssubscript𝑦𝑝𝑜𝑠y_{pos}italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT while minimizing the likelihood of the hallucinated response yn⁢e⁢gsubscript𝑦𝑛𝑒𝑔y_{neg}italic_y start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT. The training objective function is given by:

ℒd⁢p⁢o⁢(πθ;πr⁢e⁢f)=subscriptℒ𝑑𝑝𝑜subscript𝜋𝜃subscript𝜋𝑟𝑒𝑓absent\displaystyle\mathcal{L}_{dpo}(\pi_{\theta};\pi_{ref})=caligraphic_L start_POSTSUBSCRIPT italic_d italic_p italic_o end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ; italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) = −𝔼𝒟[logσ(βlogπθ⁢(yp⁢o⁢s∣[xT,xI])πr⁢e⁢f⁢(yp⁢o⁢s∣[xT,xI])\displaystyle-\mathbb{E}_{\mathcal{D}}\Bigg{[}\log\sigma\Bigg{(}\beta\log\frac% {\ \ \ \pi_{\theta}(y_{pos}\mid[x_{T},x_{I}])}{\pi_{ref}(y_{pos}\mid[x_{T},x_{% I}])}- blackboard_E start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG (1)
−βlogπθ⁢(yn⁢e⁢g∣[xT,xI])πr⁢e⁢f⁢(yn⁢e⁢g∣[xT,xI]))],\displaystyle-\beta\log\frac{\ \ \ \pi_{\theta}(y_{neg}\mid[x_{T},x_{I}])}{\pi% _{ref}(y_{neg}\mid[x_{T},x_{I}])}\Bigg{)}\Bigg{]},- italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG ) ] ,

where xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and xIsubscript𝑥𝐼x_{I}italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT represent the text task prompt (without factual prior) and image, and πr⁢e⁢fsubscript𝜋𝑟𝑒𝑓\pi_{ref}italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT denotes the reference model (i.e., the original baseline LVLMs). The function σ𝜎\sigmaitalic_σ is the log-sigmoid, and β𝛽\betaitalic_β is a hyperparameter controlling the preference margin. In the above preference optimization process, the reward margin is defined as:

r^⁢(xT,xI,y)=β⁢log⁡πθ⁢(yp⁢o⁢s∣[xT,xI])πr⁢e⁢f⁢(yp⁢o⁢s∣[xT,xI]).^𝑟subscript𝑥𝑇subscript𝑥𝐼𝑦𝛽subscript𝜋𝜃conditionalsubscript𝑦𝑝𝑜𝑠subscript𝑥𝑇subscript𝑥𝐼subscript𝜋𝑟𝑒𝑓conditionalsubscript𝑦𝑝𝑜𝑠subscript𝑥𝑇subscript𝑥𝐼\hat{r}(x_{T},x_{I},y)=\beta\log\frac{\ \ \ \pi_{\theta}(y_{pos}\mid[x_{T},x_{% I}])}{\pi_{ref}(y_{pos}\mid[x_{T},x_{I}])}.over^ start_ARG italic_r end_ARG ( italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_y ) = italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∣ [ italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] ) end_ARG . (2)

Through maximizing the reward margin between the self-corrected response yp⁢o⁢ssubscript𝑦𝑝𝑜𝑠y_{pos}italic_y start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT and the hallucinated response yn⁢e⁢gsubscript𝑦𝑛𝑒𝑔y_{neg}italic_y start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT, we ensure that the model increasingly favors non-hallucinated samples over hallucinatory ones, leading to a robust self-correction process.

4 CP-Bench: Evaluate Hallucination of CPQ

Motivation. Recent hallucination benchmarks primarily focus on response generation, including object existence, attributes, and relations, while overlooking the textual semantics within queries. Figure 3(a) illustrates the vulnerability of recent LVLMs in solving counterfactual presupposition questions. To address this gap, CP-Bench is developed to quantify the LVLMs’ ability to judge the correctness of presuppositions and generate factual responses.

Details. CP-Bench includes of two subsets: a validation (dev) set and a test set. For the dev set, the images and questions are generated using the above data synthetic pipeline of Antidote. For the test set, the images are sampled from the CC3M dataset [35]. The corresponding queries are constructed by the following steps: object candidates are first obtained via an open-set grounding model, followed by generating query candidates via DeepSeek-V2 using prompt P2. Challenging candidates are manually selected as the final task queries. Both sets consist of 1,000 curated samples, equally split into 500 CPQs and 500 true presupposition questions (TPQs). All counterfactual candidates in CPQs of CP-Bench are chosen from objects commonly associated with similar semantics or scenes, such as "railroad" in train-related contexts. This increases the benchmark’s complexity, as prior research has shown that LVLMs often suffer from inherent statistical biases in their pre-training or fine-tuning datasets [18, 47]. The samples in CP-Bench can be classified into four categories in daily scenarios: item, knowledge, scene, and activity. More statistical details of CP-Bench can be viewed in Figure 6.

Evaluation. Given the open-ended responses of LVLMs, GPT-4o [2] is introduced to convert responses into a binary classification task, assessing whether the models correctly recognize the correctness presupposition and output factual responses. CPQ labeled as “positive” class, while TPQs are labeled as “negative” samples. For CPQs, GPT-4o evaluates whether the models accurately identify the presence of the objects and generate corresponding responses. For TPQs, GPT-4o evaluates whether the models can discriminate and the reveal counterfactual information implicit in presuppositions. The primary evaluation metrics are the F1-score, Recall, Accuracy, and Precision. The prompt P3 used for CP-Bench evaluation is provided in Appendix.

Refer to caption

Figure 6: The statistical details of CP-Bench (test) and CPQ examples. CP-Bench includes four types (i.e., scene, knowledge, item, and activity) of CPQs and TPQs from different scenes. It can comprehensively evaluate the LVLMs’ ability to discriminate the correctness of presuppositions and generate factual responses.

5 Experiment

5.1 Implement Setup

Experiment Baselines: We post-trained LLaVA series with the proposed Antidote, including LLaVA-1.5-Vincuna-7B/13B [23], and LLaVA-Next-Mistral-7B [22]. All models above have been fully tuned on their collected visual instruction data before post-training. In practical implementation, we adopt LoRA [10] for training efficiency. The LoRA’s dimension (rank) r𝑟ritalic_r is 64, α𝛼\alphaitalic_α is 128, and the scale parameter β𝛽\betaitalic_β in direct preference optimization is 0.1. More hyper-parameter setting can be viewed in Appendix.

Evaluation Benchmarks. Besides CP-Bench, we assess the effectiveness of Antidote using three popular hallucination benchmarks and four general benchmarks. POPE [18] is used for evaluating object existence, while CHAIR [32] and SHR [50] are served to evaluate the hallucinations in image descriptions. Compared to CHAIR, which focuses on evaluating object-related hallucinations in responses, SHR focuses on sentence-level hallucinations with the introduction of LLMs. To further validate the catastrophic forgetting issue, we verify the general capability (such as visual reasoning, perception, and cross-domain generalization) of models trained with Antidote, including Science-QA [33], MMBench [25], MMVet [48], and LLaVA-Wild [23].

Data Composition. The training set of Antidote is built by the data synthetic pipeline introduced in Section 3.2. Initially, we generated 14k triplets of <𝒞i⁢m⁢g,𝒪p⁢r⁢e,𝒪h⁢a⁢l⁢l⁢u>\textless\mathcal{C}_{img},\mathcal{O}_{pre},\mathcal{O}_{hallu}\textgreater< caligraphic_C start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT , caligraphic_O start_POSTSUBSCRIPT italic_h italic_a italic_l italic_l italic_u end_POSTSUBSCRIPT > and filtered out approximately 4k triplets with Factual Accessor. Then, we generated the queries of CPQs, TPQs, object existence, and image descriptions of each remaining triplet. For each baseline LVLM, we applied response filtering after their inference and self-correction, discarding around 15% of the total samples. Finally, we sample 5,000 CPQs, 5,000 TPQs, 2,000 questions of object existence, and 8,000 image description queries, i.e., a total of 20k samples, for post-training. The discussion of data proportion settings can be viewed in Appendix.

5.2 Main Results

How does Antidote improve LVLMs’ ability to discriminate presuppositions? In Table 3, we compare the performance of closed-sourced models, open-sourced models, and baseline models post-trained with Antidote on CP-Bench. The results reveal that the closed-source LVLMs significantly outperform the open-source models in distinguishing CPQ and outputting factual responses. The optimal performance is achieved by Claude-3.5-Sonnet and GPT-4o, which recall nearly 90% of CPQs on the test set. For open-source models, LLaVA-Next-Vicuna-13B stands out, achieving an F1-score of 56.0% and a recall of 41.6%. Notably, Antidote brings substantial improvements. For example, it boosts LLaVA-1.5 7B and 13B models’ F1-scores from 5.7 and 17.3 to 78.4 and 83.5, respectively. As illustrated in Figure 7, we can observe that these models after Antidote can produce factual responses such as “I cannot answer … as there is no …”. These results highlight Antidote’s efficacy in enhancing model accuracy in recognizing presuppositions, pushing open-source models closer to closed-source counterparts on this challenging task.

Table 1: Performance Comparison on CP-Bench (test and dev). The detailed evaluation results of the dev set are provided in Appendix.

Table 2: Hallucination evaluation (POPE) on object existence. The performance is the average of the results across the three subsets.

Table 3: Hallucination evaluation on image description, CHAIR and SHR. Max new tokens is set as 64 for each model. Smaller values correspond to fewer hallucinations.

Does model size affect the ability to discriminate the correctness of presuppositions? (1) With identical architectures and training data, larger model sizes enhance the judgment of presupposition correctness. For instance, as Qwen2-VL [38] scales from 7B to 72B parameters, the recall increases from 13.2% to 33.3%, with a similar trend observed in the InternVL2 series. (2) Across different models, however, model size is not a decisive factor. Notably, MiniCPM-V2.5 [45], with only 8B parameters, achieves a recall that is 8.5% higher than Qwen2-VL-72B, demonstrating superior performance in recognizing CPQs. Moreover, InternVL2 [5] and Qwen2-VL, which surpass closed-source models in general performance, do not perform well on the VPF-Bench. Both models have utilized large-scale instruction fine-tuning datasets to enhance visual capabilities. We believe that their performance on the VPF-Bench is strongly correlated with over-learning of instruction tuning.

How do LVLMs with Antidote perform on typical hallucination benchmarks? Here, we compare various types of mitigation approaches, including contrastive decoding (e.g., VCD [23] and VDD [49]), auxiliary learning (e.g., HACL [15]), and post-training (e.g., Volcano [16] and SeVA [51]). 1) For object existence, we assess POPE, where the results (Table 3) are averaged across three evaluation sets: the random, popular, and adversarial sets (the results for each set can be found in Appendix. On LLaVA 1.5-7B, we improved its original F1-score from 86.07 to 87.89 (+1.82%), with an even greater improvement on its 13B version, from 85.67 to 88.99 (+3.32%). Notably, we observed significant improvements on the adversarial subset, where objects are first ranked based on co-occurrence frequencies, and the top-k frequent objects are sampled. On the original 7B and 13B versions, Antidote improves by 2.58% and 4.12%, respectively. This demonstrates that Antidote can effectively mitigate the statistical biases inherent in LVLMs, which substantially contribute to object hallucination issues [18]. 2) For image description, we first evaluated on CHAIR (Table 3), which quantifies the hallucination by calculating the ratio of objects mentioned in the description that are not present in the ground-truth. On the LLaVA-1.5 series, we observed a substantial reduction in hallucinations, decreasing its hallucination rates by over 50%. For the 7B version, we reduced CHAIR_s from the prior best score of 15.1 to 9.4. We also tested on LLaVA-Next-Mistral-7B, further improving its CHAIR_s and CHAIR_i scores to 10.7 and 3.5, respectively. Additionally, we evaluated SHR [50], an advanced benchmark that uses detailed object-level descriptions from the Visual-Genome dataset as factual information and relies on GPT-4 to judge hallucinations in descriptions. Similarly, Antidote significantly reduces hallucinations in comparison to baseline models on this metric as well.

Table 4: The evaluation on the benchmark of LVLMs’ general capabilities.

Refer to caption

Table 5: Attention visualization between the text tokens and vision tokens. The intensity of each text token’s background indicates the attention weight magnitude of image tokens, with darker highlights representing higher attention. The attention values above the 0.995th quantile are shown with the highest color intensity (such as shrimp and vegetables).

Table 6: The effect of hyper-parameter in LoRA during the post-training and the SFT alternative. The baseline LVLM is LLaVA-1.5-7B. F1-score is adopted in CP-Bench and POPE.

Refer to caption

Table 7: Comparison of model responses before and after Antidote post-training. The cases are selected from the proposed CP-Bench benchmark.

5.3 Analysis

Catastrophic Forgetting. Since Antidote is a post-training method that fine-tunes the baseline models’ parameters, we evaluated whether Antidote causes catastrophic forgetting by assessing the general capability of the post-trained LLaVA-1.5 series. From Table 7, it is evident that performance on these benchmarks did not significantly degrade and even improved on some benchmarks, such as a 2.8% and 2.6% increase on Science-QA [28]. This suggests suppressing object perception hallucinations and enhancing CPQ discrimination can generalize to improvements in overall capabilities. There was a slight decrease in performance on LLaVA-Wild [23], where we observed that the post-trained version was “cautious” when answering uncertain/challenging questions compared to the baseline model, which is not preferred by its GPT-4 evaluator.

Attention Visualization. We empirically investigate how attention from visual tokens contributes to important object-related text tokens before and after applying the proposed Antidote. We visualize some representative instances during training in Figure 7. For instance, when asked a CPQ: “What is the fork made of in the image?”, we observe that the original LLaVA-1.5, while outputting “fork”, does not significantly focus on visual tokens, and incorrectly attends to visual token information when outputting “metal”. However, after training with Antidote, the model’s attention to visual tokens becomes more accurate, focusing on the exact areas of the image corresponding to object-related text tokens, such as “shrimp” and “flowers”.

Compare with SFT. A straightforward alternative to Antidote’s preference optimization is direct supervised fine-tuning (SFT) using the self-corrected responses in Antidote. As shown in Table 7, while SFT shows effectiveness in addressing model hallucinations for CPQs and image descriptions, it significantly underperforms compared to Antidote, particularly on POPE and MMBench, and suffers from catastrophic forgetting to some extent. Unlike SFT, which merely increases the probability of self-corrected responses, Antidote’s preference alignment can be viewed as a form of contrastive learning (more discussions can be viewed in Appendix), where the model is trained to distinguish between self-corrected and hallucinatory responses. It exploits the preference information by increasing the model’s probability of self-corrected responses relative to hallucinatory ones, guiding the model to suppress hallucinations while reducing over-fitting to preference samples.

LoRA Fine-tuning. We evaluate the LoRA setting for Antidote. In parameter-efficient learning, this parameter determines the extent to which the model’s knowledge can be altered during post-training. As presented in Table 7, a relatively higher rank r𝑟ritalic_r signifies greater flexibility in adjusting the model’s knowledge. However, a larger r𝑟ritalic_r can lead to catastrophic forgetting (when r𝑟ritalic_r=128) and even cause over-optimization of Antidote, resulting in model collapse (when r𝑟ritalic_r=256). In conclusion, we set the rank r𝑟ritalic_r to 64 and the scaling factor a𝑎aitalic_a to 128 (2×rabsent𝑟\times r× italic_r as default) in experiments.

6 Conclusion

This paper addresses the hallucinations in Large Vision-Language Models (LVLMs), focusing both on Counterfactual Presupposition Questions (CPQs) and object perception. The primary contribution of this work is Antidote, a synthetic data-driven post-training framework that significantly mitigates hallucinations by enhancing LVLMs’ capability to discern counterfactual presuppositions and objects. To complement this, we introduce CP-Bench, a novel benchmark tailored to evaluate LVLMs’ performance on CPQs. Extensive experiments demonstrate that Antidote not only significantly improves performance in recognizing counterfactual presuppositions but also enhances performance across a range of hallucination-related benchmarks, including POPE, CHAIR, and SHR, all without inducing catastrophic forgetting. These results underscore the promise of Antidote as a robust solution for enhancing the reliability of LVLMs in diverse vision-language tasks.

References