GitHub - om-ai-lab/VLM-R1: Solve Visual Understanding with Reinforced VLMs (original) (raw)

VLM-R1: A stable and generalizable R1-style Large Vision-Language Model

πŸŽ‰ Our VLM-R1 Math model reaches the top of the Open-Compass Math Leaderboard (under 4B parameters) and OVD model achieves the state-of-the-art performance on OVDEval.

Since the introduction of Deepseek-R1, numerous works have emerged focusing on reproducing and improving upon it. In this project, we propose VLM-R1, a stable and generalizable R1-style Large Vision-Language Model.

Specifically, for the task of Referring Expression Comprehension (REC), we trained Qwen2.5-VL using both R1 and SFT approaches. The results reveal that, on the in-domain test data, the performance of the SFT model shows little change compared to that of the R1 model base model when the number of training steps is relatively small (100–600 steps), while the R1 model shows a steady improvement (as shown at the left of the figure below). More importantly, on the out-of-domain test data, the SFT model’s performance deteriorates slightly as the number of steps increases. Nevertheless, the RL model generalizes its reasoning ability to the out-of-domain data (as shown at the right of the figure below).

image* We found previous REC SFT exps used a mismatch pixel config. Therefore, we re-run the study with the correct config on a more complex out-of-domain data. See our findings for details.

πŸš€ Features

This repository supports:

πŸ—žοΈ Update

πŸ€– Models

Version Base VLM Checkpoint Task Type
VLM-R1-Qwen2.5VL-3B-OVD-0321 Qwen2.5VL-3B omlab/VLM-R1-Qwen2.5VL-3B-OVD-0321 Open-Vocabulary Detection
VLM-R1-Qwen2.5VL-3B-Math-0305 Qwen2.5VL-3B omlab/VLM-R1-Qwen2.5VL-3B-Math-0305 Multi-Modal Math
VLM-R1-Qwen2.5VL-3B-REC-500steps Qwen2.5VL-3B omlab/Qwen2.5VL-3B-VLM-R1-REC-500steps REC/Reasoning-Grounding

🎯 ToDo

πŸ› οΈ Setup

conda create -n vlm-r1 python=3.10 conda activate vlm-r1 bash setup.sh

πŸ’ͺ🏻 Training

Referring Expression Comprehension (REC)

πŸ“š GRPO

  1. Download the COCO Train2014 image and unzip it, and we refer to the image dir as <your_image_root>.
  2. Download the RefCOCO/+/g and LISA-Grounding Annotation files and unzip it (LISA-Grounding is used for out-of-domain evaluation).
  3. Change the data_paths and image_folders in the run_scripts/run_grpo_rec.sh file.

These jsonl files are included in the annotation files at step 2.

Note: please use jsonl files instead of json files.

data_paths="path/to/refcoco_train.jsonl:path/to/refcocop_train.jsonl:path/to/refcocog_train.jsonl" image_folders="path/to/coco:path/to/coco:path/to/coco"

  1. bash run_scripts/run_grpo_rec.sh

Note

If you encounter 'CUDA out of memory' error, you can try to reduce the per_device_train_batch_size.

πŸ“š Multi-Node GRPO

For multi-node training, please refers to multinode_training_demo.sh.

πŸ“š SFT

We use LLaMA-Factory to train the SFT model.

  1. Clone the LLaMA-Factory repository and install the dependencies.

git clone https://github.com/hiyouga/LLaMA-Factory.git cd LLaMA-Factory pip install -e ".[torch,metrics]"

  1. Download the dataset_info.json, mllm_rec_json.json, and qwen2_5_vl_full_sft.yaml we provided here. Put the json files in the LLaMA-Factory/data directory and the yaml file in the LLaMA-Factory/examples/train_full directory.
  2. Run the following command to train the SFT model.

llamafactory-cli train examples/train_full/qwen2_5_vl_full_sft.yaml

For your own data

We support data loading the jsonl data of this format in src/open-r1-multimodal/src/open_r1/grpo_jsonl.py. Please note that you may need to use different reward functions for your specialized tasks. Welcome to PR to add your own reward functions or share any other interesting findings!

The jsonl has the format as follows:

{ "id": 1, "image": "Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16885.png", "conversations": [ {"from": "human", "value": "What number of purple metallic balls are there?"}, {"from": "gpt", "value": "0"} ] }

If you want to use multi-image input, you can use the following format:

{ "id": 1, "image": ["Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16885.png", "Clevr_CoGenT_TrainA_R1/data/images/CLEVR_trainA_000001_16886.png"], "conversations": [ {"from": "human", "value": "What number of purple metallic balls in total within the two images?"}, {"from": "gpt", "value": "3"} ] }

Note

The image path in the jsonl file should be relative to the image folder specified in --image_folders. The absolute path of the input image is constructed as os.path.join(image_folder, data['image']). For example:

Multiple data files and image folders can be specified using ":" as a separator:

--data_file_paths /path/to/data1.jsonl:/path/to/data2.jsonl
--image_folders /path/to/images1/:/path/to/images2/

The script can be run like this:

You could refer to the run_grpo_rec.sh for the example

torchrun --nproc_per_node="8"
--nnodes="1"
--node_rank="0"
--master_addr="127.0.0.1"
--master_port="12345"
src/open_r1/grpo_jsonl.py
--output_dir output/$RUN_NAME
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct
--deepspeed ${REPO_HOME}/src/open-r1-multimodal/local_scripts/zero3.json
--data_file_paths /path/to/your/data.jsonl \ # can be multiple, separated by ":" --image_folders /path/to/your/image/folder \ # can be multiple, separated by ":" ...

Multi-image Input

We provide an example of multi-image script run_grpo_gui.sh. This task requires the model to analyze two GUI screenshots, taken before and after a user action, to determine if any UI interaction defects are present, which is from GUI-Testing-Arena. Download the image and unzip it into the /path/to/images/. Then modify the image_folders parameter in the script and run it.

bash run_scripts/run_grpo_gui.sh

πŸ“Š Evaluation

image

  1. Download the provided LISA-Grounding images.

cd ./src/eval

Remember to change the model path, image root, and annotation path in the script

torchrun --nproc_per_node=X test_rec_r1.py # for GRPO. 'X' is the number of GPUs you have. torchrun --nproc_per_node=X test_rec_baseline.py # for SFT.

🀝 Acknowledgements

We would like to express our sincere gratitude to DeepSeek, Open-R1, QwenVL, Open-R1-Multimodal, R1-V, RefCOCO, RefGTA, LLaMA-Factory, OVDEval, GUI-Testing-Arena, and LISA for providing open-source resources that contributed to the development of this project.

⭐️ Citation

If you find this project useful, welcome to cite us.

@article{shen2025vlm, title={Vlm-r1: A stable and generalizable r1-style large vision-language model}, author={Shen, Haozhan and Liu, Peng and Li, Jingcheng and Fang, Chunxin and Ma, Yibo and Liao, Jiajia and Shen, Qiaoli and Zhang, Zilun and Zhao, Kangjia and Zhang, Qianqian and Xu, Ruochen and Zhao, Tiancheng }, journal={arXiv preprint arXiv:2504.07615}, year={2025} }