GitHub - Deep-Agent/R1-V: Witness the aha moment of VLM with less than $3. (original) (raw)

R1-V: Reinforcing Super Generalization Ability in Vision Language Models with Less Than $3

News: We released new VLM-RL environments, training codebase and research paper G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning, check it out!

Roadmap for R1-V

We are building a general framework for RLVR in VLM. We believe in the power of trenches and longtermism.

Our Interest: General Vision-Language Intelligence & Visual/GUI Agent

Our Goal: 🔄 Algorithm Enhancement ⚡ Efficiency Optimization 🎯 Task Diversity 🌲 Impactful Open Source Research.

Welcome Ideas and Contribution. Stay tuned!

Blogs:

🎯 RLVR in Vision Language Models: Findings, Questions and Directions

Resources:

🤗 R1V Training Dataset: CLEVR-70k-Counting

🤗 R1V Training Dataset: CLEVR-70k-Complex

🤗 R1V Training Dataset: GEOQA-8k

🤗 R1-Distilled Visual Reasoning Dataset

R1-V Team:

Liang Chen · Lei Li · Haozhe Zhao · Yifan Song · Vinci · Zihao Yue

Contributors:

Updates

2025-02-27: vLLM trainer supports Qwen2.5-VL now, refer to ./src/scripts/run_grpo_vllm_qwen25vl.sh for script and env update.
2025-02-21: We write a blog post summarizing the main findings and questions in our visual RLVR experimetns, check it out!
2025-02-12: We fixed the batched decoding error. The orignial RL training scirpt now is 3x speeded up.
2025-02-12: R1-V now supports vLLM to accelerate training (pip install vllm==0.7.2 before use) and SFT.
2025-02-11: R1-V now supports Qwen2.5-VL and GEOQA task.
2025-02-06: We upload the evaluation script and polish the README. We are writing a blog post summarizing the statistics, findings and underexplored questions.
2025-02-03: We upload the training codebase.
2025-02-03: We curate and upload some verified Deepseek-R1 visual reasoning traces with some special tricks (see R1-V/src/distill_r1/). Current training code does not rely on it, feel free to explore.
2025-02-03: We release the R1-V repo.

For contributors

Our top development priority is addressing the issues marked with help wanted labels, and we welcome ideas/PRs from the community to help solve them.

Note: In our later experiment, we found that letting the 2b base model directly output the result instead of following <think></think><answer></answer> would lead to a much higher score (86%) on SuperClevr. It suggests that enforcing Chain-of-Thought reasoning may be not only unnecessary but potentially detrimental to the 2B model performance.

Setup

conda create -n r1-v python=3.11 conda activate r1-v

bash setup.sh

Note

If you meet bug when running the script, first try align your environments with ./src/requirements.txt

Supported Models

Qwen2-VL
Qwen2.5-VL

Supported Training Datasets

🤗 R1V Training Dataset: CLEVR-70k-Counting: Item Counting Problems
🤗 R1V Training Dataset: CLEVR-70k-Complex: Number Related Reasoning
🤗 R1V Training Dataset: GEOQA-8k: Geometry Reasoning

Supported Evaluations

SuperClevr-200: Item Counting Problems
GeoQA-Test-Direct-Answer-735: Geometry Reasoning

Training

GRPO

cd src/r1-v

export DEBUG_MODE="true" # Enable Debug if you want to see the rollout of model during RL export LOG_PATH="./debug_log_2b.txt"

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo.py
--output_dir
--model_name_or_path \ --dataset_name leonardPKU/clevr_cogen_a_train \
--deepspeed local_scripts/zero3.json
--max_prompt_length 512
--max_completion_length 512
--per_device_train_batch_size 1
--gradient_accumulation_steps 2
--logging_steps 1
--bf16
--report_to wandb
--gradient_checkpointing false
--attn_implementation flash_attention_2
--max_pixels 401408
--num_train_epochs 2
--run_name Qwen2-VL-2B-GRPO-CLEVR-70k
--save_steps 100
--save_only_model true
--num_generations 8 # number of outputs G in grpo, reduce it would lead to faster training and smaller memory cost but higher variance

Note

To reproduce the result, keep the per_device_train_batch_size to 1 for now, as there is a revealed bug about batched training. See the reproduction report here. We realize it is important for effiency and are working on solving it with the community.
If you meet OOM Error, you can try reduce --num_generations
To use vLLM to speed up, please refer to this script.

SFT

We also provide SFT code, please follow the script and edit the config to customize the sft task.

accelerate launch --config_file src/r1-v/configs/zero2.yaml src/r1-v/src/open_r1/sft.py --config src/r1-v/configs/qwen2vl_sft_config.yaml

Evaluation

SuperCLEVR

We provide the example script to evaluate OOD counting performance on a subset of SuperCLEVR within 1 minute. You can also modify the script and dataset to test on your own dataset.

cd ./src/eval wget https://www.cs.jhu.edu/~zhuowan/zhuowan/SuperCLEVR/to_be_released/images.zip unzip images.zip

change the model path in the script

python test_qwen2vl_counting_superclevr.py

tested scores:

Qwen2VL-2B-Instruct: 48.0%

Qwen2VL-2B-Instruct-GRPO-100step: 82.5%

GEOQA

We provide the example script to evaluate on the test set (direct answer form) of GEOQA.

prepare images for testing

cd ./src/eval git lfs install git clone https://huggingface.co/datasets/Luckyjhg/Geo170K cd Geo170K unzip images.zip

Evaluation Script

python test_qwen2vl_geoqa.py

tested scores:

Qwen2VL-7B-Instruct: 30.63%

Qwen2VL-7B-Instruct-GRPO-2epochs: 38.72%

Qwen2.5VL-3B-Instruct: 35.41%

Qwen2.5VL-3B-Instruct-GRPO-1epochs: 47.48%

To enable faster inference with multiple GPUs, you could also use the script in R1-V/src/scripts/test_grpo_geoqa_multigpu.sh

bash src/scripts/test_grpo_geoqa_multigpu.sh

Acknowledgements

We sincerely thank DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal (our initial codebase), CLEVR, SuperCLEVR, G-LLAVA for providing open source resources and to build the project. Special thanks to Kimi, bAInance Labs for supporting computation resources and Yuxin Wu, Xinyu Zhou, Baobao Chang for their valuable advice.

Citation

@misc{chen2025r1v, author = {Chen, Liang and Li, Lei and Zhao, Haozhe and Song, Yifan and Vinci}, title = {R1-V: Reinforcing Super Generalization Ability in Vision-Language Models with Less Than $3}, howpublished = {\url{https://github.com/Deep-Agent/R1-V}}, note = {Accessed: 2025-02-02}, year = {2025} }