GitHub - HanSolo9682/CounterCurate: This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs. (original) (raw)

CounterCurate [ACL 2024 Findings]

This is the implementation of CounterCurate, the data curation pipeline of both physical and semantic counterfactual image-caption pairs.

[Paper][Project Page][HuggingFace]

In our paper, we used Flickr30k-Images and the Visual Genome dataset. If you want to fully test out CounterCurate, you must download these two datasets from their authors into our datasets folder.

[Flickr30k-Images][VisualGenome]

We further employ the following GitHub repositories, please clone them into the datasets folder for demo purposes.

[Flickr30k-Entities][PointQA][SugarCrepe]

For further training and testing, please use the following:

[OpenCLIP][LLaVA][GLIGEN]

For all the code we provide, you can find a few lines of editable parameters right after the import statements. Please change them accordingly to adjust to your environment.

Datasets

Please clone our HuggingFace repository, which contains our generated training data (train_data.tar.gz), sample data that can be used to train CLIP (clip_sample_data/*.csv)/LLaVA (llava_sample_data/*.json), and our GPT-4V generated prompts which aided us in the data curation process (gpt4v_*.json).

Note that the data in HuggingFace does not contain any original images from Flickr30k-Images. You must first run

python datasets/prepare_training_data.py

to copy all the original images into their respective positions in the train_data folder. After this, you can technically begin training/testing. All other files in the datasets folder are used to curate the generated data.

Benchmarking

Training

Grouping

In the paper, we used grouping to force CLIP models to learn the positive and negative image/caption pairs of the same index within the same batch to improve training efficiency and accuracy. Here, we provide two versions of grouping we used to train our CLIP models. The folder train/grouping is the general grouping strategy for positive/negative image-caption pairs, or when there is only one positive and one negative per item; the folder train/grouping_attr is specifically taylored for training our model on Flickr30k-Attributes, where each positive (original) image-caption pair has 2-3 negative counterfactuals (noun, adjective, reverse). Both folders contain a data.py and a train.py. To reproduce, please replace the files under your OpenClip repository open_clip/src/training/.../py accordingly.

Sample Data

It is to note that the sample data files have specific path formats (such as ../../train_data). This is because of the testing environment we evaluated CounterCurate on. Please feel free alter the file paths.

Citation

If you find CounterCurate useful for your research and applications, please cite using this BibTeX:

@article{zhang2024countercurate, title={CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples}, author={Zhang, Jianrui and Cai, Mu and Xie, Tengyang and Lee, Yong Jae}, journal={Findings of the Association for Computational Linguistics: ACL 2024}, year={2024} }

Acknowledgement