Describing Differences in Image Sets with Natural Language (original) (raw)

What is the difference between two sets of images?

We all know it's valuable to understand your data, but sifting through thousands of images to find their differences is impractical. To aid in this discovery process, we explore the task of automatically describing the differences between two sets of images, which we term Set Difference Captioning. Given two sets of images 𝓓A and 𝓓B, output natural language descriptions of concepts that are more true for 𝓓A.

We introduce VisDiff, a set difference captioning system that utilizes VLM, LLM, and CLIP to discover and rank differences between image sets ranging from a few dozen to several thousand images. To evaluate set difference captioning, we also introduce VisDiffBench, a benchmark of 187 paired image sets, each with a ground-truth difference description. VisDiffBench consists of shifts from ImageNetR and ImageNet* as well our new dataset PairedImageSets that covers 150 diverse real-world differences spanning three difficulty levels.