GitHub - Deep-Agent/R1-V: Witness the aha moment of VLM with less than $3. (original) (raw)

R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3

News: We released new VLM-RL environments, training codebase and research paper G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning, check it out!

image

GitHub release

Roadmap for R1-V

We are building a general framework for RLVR in VLM. We believe in the power of trenches and longtermism.

Our Interest: General Vision-Language Intelligence & Visual/GUI Agent

Our Goal: 🔄 Algorithm Enhancement ⚡ Efficiency Optimization 🎯 Task Diversity 🌲 Impactful Open Source Research.

Welcome Ideas and Contribution. Stay tuned!

Blogs:

🎯 RLVR in Vision Language Models: Findings, Questions and Directions

Resources:

🤗 R1V Training Dataset: CLEVR-70k-Counting

🤗 R1V Training Dataset: CLEVR-70k-Complex

🤗 R1V Training Dataset: GEOQA-8k

🤗 R1-Distilled Visual Reasoning Dataset

R1-V Team:

Liang Chen · Lei Li · Haozhe Zhao · Yifan Song · Vinci · Zihao Yue

Contributors:


Updates

For contributors


Image

Image Note: In our later experiment, we found that letting the 2b base model directly output the result instead of following <think></think><answer></answer> would lead to a much higher score (86%) on SuperClevr. It suggests that enforcing Chain-of-Thought reasoning may be not only unnecessary but potentially detrimental to the 2B model performance.

image

Setup

conda create -n r1-v python=3.11 conda activate r1-v

bash setup.sh

Note

If you meet bug when running the script, first try align your environments with ./src/requirements.txt

Supported Models

  1. Qwen2-VL
  2. Qwen2.5-VL

Supported Training Datasets

  1. 🤗 R1V Training Dataset: CLEVR-70k-Counting: Item Counting Problems
  2. 🤗 R1V Training Dataset: CLEVR-70k-Complex: Number Related Reasoning
  3. 🤗 R1V Training Dataset: GEOQA-8k: Geometry Reasoning

Supported Evaluations

  1. SuperClevr-200: Item Counting Problems
  2. GeoQA-Test-Direct-Answer-735: Geometry Reasoning

Training

GRPO

cd src/r1-v

export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo.py
--output_dir
--model_name_or_path \ --dataset_name leonardPKU/clevr_cogen_a_train \
--deepspeed local_scripts/zero3.json
--max_prompt_length 512
--max_completion_length 512
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 2
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k
--save_steps 100
--save_only_model true
--num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance

Note

  1. To reproduce the result, keep the per_device_train_batch_size to 1 for now, as there is a revealed bug about batched training. See the reproduction report here. We realize it is important for effiency and are working on solving it with the community.
  2. If you meet OOM Error, you can try reduce --num_generations
  3. To use vLLM to speed up, please refer to this script.

SFT

We also provide SFT code, please follow the script and edit the config to customize the sft task.

accelerate launch --config_file src/r1-v/configs/zero2.yaml src/r1-v/src/open_r1/sft.py --config src/r1-v/configs/qwen2vl_sft_config.yaml

Evaluation

SuperCLEVR

image

We provide the example script to evaluate OOD counting performance on a subset of SuperCLEVR within 1 minute. You can also modify the script and dataset to test on your own dataset.

cd ./src/eval wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip unzip images.zip

change the model path in the script

python test_qwen2vl_counting_superclevr.py

tested scores:

Qwen2VL-2B-Instruct: 48.0%

Qwen2VL-2B-Instruct-GRPO-100step: 82.5%

GEOQA

截屏2025-02-11 13 38 50 截屏2025-02-11 14 54 16

We provide the example script to evaluate on the test set (direct answer form) of GEOQA.

prepare images for testing

cd ./src/eval git lfs install git clone https://huggingface.co/datasets/Luckyjhg/Geo170K cd Geo170K unzip images.zip

Evaluation Script

python test_qwen2vl_geoqa.py

tested scores:

Qwen2VL-7B-Instruct: 30.63%

Qwen2VL-7B-Instruct-GRPO-2epochs: 38.72%

Qwen2.5VL-3B-Instruct: 35.41%

Qwen2.5VL-3B-Instruct-GRPO-1epochs: 47.48%

To enable faster inference with multiple GPUs, you could also use the script in R1-V/src/scripts/test_grpo_geoqa_multigpu.sh

bash src/scripts/test_grpo_geoqa_multigpu.sh

Acknowledgements

We sincerely thank DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal (our initial codebase), CLEVR, SuperCLEVR, G-LLAVA for providing open source resources and to build the project. Special thanks to Kimi, bAInance Labs for supporting computation resources and Yuxin Wu, Xinyu Zhou, Baobao Chang for their valuable advice.

Star History Chart

Citation

@misc{chen2025r1v, author = {Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci}, title = {R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3}, howpublished = {\url{https://github.com/Deep-Agent/R1-V}}, note = {Accessed: 2025-02-02}, year = {2025} }