Aligning Large Multi-Modal Model with Robust Instruction Tuning (original) (raw)

1University of Maryland, College Park, 2Microsoft Corporation

Abstract

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinate inconsistent descriptions with respect to the associated image and human instructions.

Hallucination Examples of LMMs

RED text is inconsistent with the image content. BLUE text is consistent with the image content.

Visual Instrucion-Following Data

Based on the Visual Genome dataset with bounding boxes and dense captions, we interact with langauge-only GPT4, and collect 120K visual instruction-following samples in total.LRV-Instruction includes both positive and negative instructions:

For more details about the text prompt for GPT4, please refer to our paper.

GPT4-Assisted Visual Instruction Evaluation

We introduce GPT4-Assisted Visual Instruction Evaluation (GAVIE) as a more flexible and robust approach to measure the hallucination generated by LMMs without the need for human-annotated groundtruth answers. GPT4 takes the dense captions with bounding box coordinates as the image content and compares human instructions and model response. Then we ask GPT4 to work as a smart teacher and score (0-10) students’ answers based on two criteria.

For more details about the text prompt for GAVIE, please refer to our paper.

More Examples from LRV-Instruction

We finetune MiniGPT4 on LRV-Instruction and successfully mitigate hallucination while improving performance. RED text is

inconsistent

with the image content. BLUE text is

consistent

with the image content.