GitHub - 2U1/Qwen-VL-Series-Finetune: An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud. (original) (raw)

This repository contains a script for training Qwen2-VL, Qwen2.5-VL, Qwen3-VL and Qwen3.5 with only using HuggingFace and Liger-Kernel.

Other projects

[Phi3-Vision Finetuning]
[Llama3.2-Vision Finetuning]
[Molmo Finetune]
[Pixtral Finetune]
[SmolVLM Finetune]
[Gemma3 Finetune]

Update

[2026/05/18] 🔥Upgrade to liger_kernel==0.8.0. Liger 0.8.0 adds official patches for qwen3_5 / qwen3_5_moe and ships LigerExperts, a fused MoE expert kernel that auto-accelerates qwen3_vl_moe and qwen3_5_moe under --use_liger_kernel True. The 0.7-era hardcoded fallback that force-disabled Liger for Qwen3.5 in SFT/DPO/GRPO has been removed, and the mm_token_type_ids GRPO wrapper is now skipped automatically on Liger ≥ 0.8.0 (kept as a no-op shim for older installs).
[2026/03/07] 🔥Supports reasoning mode training for Qwen3-VL and Qwen3.5
[2026/03/07] 🔥Supports Qwen3.5 Series.
[2026/03/07] Supports Qwen3-VL classification
[2026/03/07] Update codebase to transformers==5.3.0
[2025/11/28] 🔥Supports video training with DPO and GRPO.
[2025/11/27] 🔥Supports Qwen3-VL-MoE
[2025/11/26] Update support for liger-kernel in Qwen3-VL.
[2025/10/16] 🔥Supports Qwen3-VL(non-moe)
[2025/08/21] Add option for using 2-layer mlp for classification.
[2025/08/21] Add option for unfreeze only few layers for llm and vision tower.
[2025/08/08] 🔥Monkey patch Qwen2.5-VL's window attention and forward for using less memory and speedups.
[2025/07/25] Updated Classification training script.
[2025/05/29] 🔥Supports GRPO training.
[2025/04/16] 🔥Supports DPO training.
[2025/03/04] Add Option for using liger kernel.
[2025/02/18] 🔥Supports mixed-modality dataset with zero3.
[2025/02/05] Fixed code for properly use image.
[2025/02/03] Support Liger-kernel for Qwen2.5-VL.
[2025/02/03] 🔥Supports Qwen2.5-VL.
[2025/01/24] Add option for using DoRA.
[2025/01/24] Fix error in LoRA training.
[2025/01/18] 🔥Supports mixed-modality data.
[2024/09/12] 🔥Now the model is trained using Liger-Kernel.
[2024/09/11] Supports setting different learning rates to projector and vision model.
[2024/09/11] 🔥Supports multi-image and video training.

Fine-tuning Qwen-VL Series

Warning

Read Training Notes before running any training script. It contains required settings and compatibility notes for Qwen3.5, QLoRA + vision, QLoRA + liger, DeepSpeed, and video training.

Supported Features

Deepspeed
LoRA/QLoRA
Full-finetuning
Enable finetuning vision_model while using LoRA
Unfreeze only top-k layer
Disable/enable Flash Attention 2
Multi-image and video training
Training optimized with liger kernel
Mixed-modality dataset
Direct Preference Optimization (DPO)
Group Relative Policy Optimization (GRPO)

Docker

To simplfy the setting process for training, you could use the provided pre-build environments.
The settings are done in the conda env named train.

You could find more information about the image here.

docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash

Installation

Environments

Ubuntu 22.04
Nvidia-Driver 550.120
Cuda version 12.8

Install the required packages using environment.yaml.

Using `requirements.txt`

pip install -r requirements.txt -f https://download.pytorch.org/whl/cu128 pip install qwen-vl-utils pip install flash-attn --no-build-isolation

Using `environment.yaml`

conda env create -f environment.yaml conda activate train pip install qwen-vl-utils pip install flash-attn --no-build-isolation

Note: You should install flash-attn after installing the other packages.

Training Notes

Qwen3.5 series: use --disable_flash_attn2 True for now. In local testing, Flash Attention 2 raised CUDA errors while sdpa was stable. This applies to SFT, CLS, DPO, and GRPO.
QLoRA + vision: do not combine quantization (--bits 4 / --bits 8) with vision training (--vision_lora True, --freeze_vision_tower False, or --unfreeze_topk_vision > 0). Use --bits 16 if you want to train vision-related modules.
QLoRA + liger: disable liger when using QLoRA.
DeepSpeed: zero2 is usually faster and often more stable than zero3, but it uses more memory.
Video: do not set fps and nframes at the same time.
Top-k unfreeze: if you use --unfreeze_topk_llm or --unfreeze_topk_vision, keep the corresponding base module frozen first with --freeze_llm True or --freeze_vision_tower True.
Learning rates: vision_model usually works better with a learning rate about 5x to 10x smaller than language_model.

Dataset Preparation

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list.

Please see the example below and follow format your data.

Example for single image dataset

[ { "id": "000000033471", "image": "000000033471.jpg", "conversations": [ { "from": "human", "value": "\nWhat are the colors of the bus in the image?" }, { "from": "gpt", "value": "The bus in the image is white and red." }, { "from": "human", "value": "What feature can be seen on the back of the bus?" }, { "from": "gpt", "value": "The back of the bus features an advertisement." }, { "from": "human", "value": "Is the bus driving down the street or pulled off to the side?" }, { "from": "gpt", "value": "The bus is driving down the street, which is crowded with people and other vehicles." } ] } ... ]

Example for multi image dataset

[ { "id": "000000033471", "image": ["000000033471.jpg", "000000033472.jpg"], "conversations": [ { "from": "human", "value": "\n\nIs the perspective of the camera differnt?" }, { "from": "gpt", "value": "Yes, It the perspective of the camera is different." } ] } ... ]

Example for video dataset

[ { "id": "sample1", "video": "sample1.mp4", "conversations": [ { "from": "human", "value": "

Example for DPO dataet

[ { "id": "e4cd017f-3854-4019-9cc1-9dacacb5d424", "image": "example_0_0.png", "prompt": "\nDescribe the key features of the image in great detail.", "chosen": "The image features a modern, multi-colored train parked on a train track, occupying a significant portion of the scene. The train has a unique and striking design, with red, white, and blue colors, as well as yellow accents. It appears to be a high-speed train, indicating its advanced engineering and potential for fast travel.\n\nThe train is situated next to a platform, suggesting that it may be at a train station. The overall scene conveys a sense of transportation and modern technology, with the vibrant and sleek train as the main subject.", "rejected": "The image features a modern, colorful train parked on a set of train tracks. The train has a striking design, with red, white, and blue colors as well as yellow accents. It appears to be a high-speed train, ready for departure.\n\nThe train spans a significant portion of the image, from the left to the right side. The train tracks can be seen clearly beneath the train, emphasizing its position on the railway line. The overall scene gives off an impression of a contemporary and efficient mode of transportation." }, { "id": "5e19e647-e5d3-4bcf-82e9-d262570743ae", "image": "example_1_0.png", "prompt": "\nIs this bus in the USA?", "chosen": "Yes, based on the image, it can be assumed that this bus is in the USA. The location of the bus cannot be accurately determined.", "rejected": "No, it's not in the USA. The image does not provide specific information on where the bus is located. However, we can say that it's not in the United States." } ... ]

Example for GRPO dataset

[ { "id": "06bc8a17-bb1c-4007-8c08-92c41e2628b2", "image": "image_2.jpg", "conversations": [ { "from": "human", "value": "\nBased on the image, which geometric method is used to determine the bearing angle, and why is it the most appropriate choice?" }, { "from": "gpt", "reasoning": "Let's analyze the image step-by-step. The image shows a right-angled triangle with points B, C, and A. The angle at point B is a right angle, indicating that trigonometric functions can be applied. To find the bearing angle, we need to relate the sides of the triangle. The tangent function is suitable here because it relates the opposite side (BC) to the adjacent side (AB) in a right-angled triangle. By using the tangent function, we can calculate the angle at point A, which is the bearing angle. Therefore, the most appropriate geometric method is the use of trigonometric functions.", "value": "A" } ] } ... ]

Reasoning Format

You can keep using the normal dataset format, but if you want to train with an explicit reasoning trace you should add a separate reasoning field instead of manually concatenating <think>...</think> into value.

Use --enable_reasoning True only for the following model families:

Qwen/Qwen3-VL-*-Thinking
Qwen/Qwen3.5-*

When --enable_reasoning True is enabled, the dataset pipeline follows the official chat template behavior for supported models:

The assistant prompt scaffold is treated as prompt-only and masked out from the loss.
If a reasoning field is present, the prompt is prefixed with the model's reasoning prefill, such as <|im_start|>assistant\n<think>\n, and the label starts from the reasoning body.
The reasoning field is inserted into the reasoning block.
The value field is treated as the final answer body after the reasoning block.

This is intended to make training-time formatting match the model's default inference-time chat template as closely as possible for supported reasoning models.

For unsupported models such as Qwen2-VL, Qwen2.5-VL, and non-thinking Qwen3-VL-Instruct, --enable_reasoning True raises an error on purpose.

Qwen3.5 special case

Qwen3.5 is the only supported family where samples may mix reasoning and non-reasoning data under --enable_reasoning True.
If a Qwen3.5 sample has a reasoning field, the prompt uses the open thinking scaffold and the label starts from the reasoning body.
If a Qwen3.5 sample does not have a reasoning field, the dataset uses the official non-thinking scaffold <think>\n\n</think>\n\n as prompt-only and trains only on the final answer.
Even with --enable_reasoning False, Qwen3.5 still uses the official non-thinking prompt scaffold so that training stays compatible with normal enable_thinking=False inference.

Qwen3-VL-Thinking restriction

Qwen3-VL-*-Thinking does not support reasoning-optional samples in this repo.
If you use --enable_reasoning True with Qwen3-VL-*-Thinking, every assistant sample must include a non-empty reasoning field.

SFT / GRPO format

Add reasoning to the assistant turn:

[ { "id": "sample_reasoning", "image": "example.jpg", "conversations": [ { "from": "human", "value": "\nDescribe what happened here." }, { "from": "gpt", "reasoning": "The vehicle is in a place where it normally should not be. It is partially submerged and visibly damaged, so an accident is the most likely explanation.", "value": "A damaged vehicle is partially submerged in a swimming pool." } ] } ]

DPO format

Add chosen_reasoning and rejected_reasoning alongside the corresponding answers:

[ { "id": "sample_dpo_reasoning", "image": "example.jpg", "prompt": "\nDescribe what happened here.", "chosen_reasoning": "The scene is unusual because the vehicle is in a pool and appears damaged, which strongly suggests an accident or deliberate crash scenario.", "chosen": "A damaged vehicle is partially submerged in a swimming pool.", "rejected_reasoning": "The image simply shows a vehicle near water, so there is not enough evidence to say anything unusual happened.", "rejected": "A car is parked beside a swimming pool." } ]

Notes

The position of the reasoning key inside the JSON object does not matter. It can appear before or after value.
Keep the final answer in value, chosen, and rejected. Do not manually wrap them with <think> when using --enable_reasoning True.
For DPO, each sample must provide both chosen_reasoning and rejected_reasoning, or neither of them.
Reasoning-optional samples are supported only for Qwen3.5. Qwen3-VL-*-Thinking requires reasoning on every sample when --enable_reasoning True is enabled.
For Qwen3.5, the scaffold inserted by the official chat template remains prompt-only. The loss starts from the reasoning body if present, otherwise from the final answer.
If you want complete manual control over the output format, leave --enable_reasoning False. Note that Qwen3.5 still uses the official non-thinking scaffold for compatibility.

Adding the new domain-specific data on top of the general data from open-source data will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to fine-tune solely on the new data based on your requirements.

Supervised Fine Tuning

⚠️

For dense Qwen3-VL models, full fine-tuning with liger-kernel can be slow. Consider turning off liger-kernel or switching to zero2 in that case.
✅Qwen3-VL-MoE and Qwen3.5-MoE use Liger 0.8.0's fused LigerExperts kernel automatically, so leaving --use_liger_kernel True is recommended for MoE variants.

Tip: You could use adamw_bnb_8bit for optimizer to save memory.

To run the training script, use the following command:

Full Finetuning

Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

bash scripts/finetune_lora.sh

If you want to train both the language model and the vision model with LoRA:

bash scripts/finetune_lora_vision.sh

IMPORTANT: If you want to tune the embed_token with LoRA, You need to tune lm_head together.

Training arguments

--deepspeed (str): Path to DeepSpeed config file (default: "scripts/zero2.json").
--data_path (str): Path to the LLaVA formatted training data (a JSON file). (Required)
--image_folder (str): Path to the images folder as referenced in the LLaVA formatted training data. (Required)
--model_id (str): Path to the Qwen2-VL model. (Required)
--use_liger (bool): Option for using liger kernel to save memory.
--output_dir (str): Output directory for model checkpoints
--num_train_epochs (int): Number of training epochs (default: 1).
--per_device_train_batch_size (int): Training batch size per GPU per forwarding step.
--gradient_accumulation_steps (int): Gradient accumulation steps (default: 4).
--freeze_vision_tower (bool): Option to freeze vision_model (default: False).
--freeze_llm (bool): Option to freeze LLM (default: False).
--freeze_merger (bool): Option to tune projector (default: False).
--num_lora_modules (int): Number of target modules to add LoRA (-1 means all layers).
--vision_lr (float): Learning rate for vision_model.
--merger_lr (float): Learning rate for merger(projector).
--learning_rate (float): Learning rate for language module.
--bf16 (bool): Option for using bfloat16.
--fp16 (bool): Option for using fp16.
--image_min_pixels (int): Option for minimum input tokens for image.
--image_max_pixles (int): Option for maximum maxmimum tokens for image.
--video_min_pixels (int): Option for minimum input tokens for video.
--video_max_pixles (int): Option for maximum maxmimum tokens for video.
--image_resized_width (int): Option for setting the width of the input image.
--image_resized_height (int): Option for setting the height of the input image.
--video_resized_width (int): Option for setting the width of the input video.
--video_resized_height (int): Option for setting the height of the input video.
--fps (float): Frames per second for video data.
--nframes (int): Number of frames for video data.
--enable_reasoning (bool): Enable structured reasoning fields for supported reasoning models (Qwen3-VL-Thinking, Qwen3.5). Qwen3.5 may mix reasoning and non-reasoning samples. Qwen3-VL-Thinking requires a non-empty reasoning field on every sample. For DPO, each sample must provide both chosen_reasoning and rejected_reasoning, or neither of them.
--unfreeze_topk_llm (int): Number of top layers to unfreeze in the language model.
--unfreeze_topk_vision (int): Number of top layers to unfreeze in the vision model.
--lora_enable (bool): Option for using LoRA.
--vision_lora (bool): Option for including vision_tower in LoRA module. lora_enable should be True to use this option.
--use_dora (bool): Option for using DoRA instead of LoRA. lora_enable should be True to use this option.
--lora_namespan_exclude (str): Exclude modules with namespans to add LoRA.
--max_seq_length (int): Maximum sequence length (default: 32K).
--bits (int): Quantization bits (default: 16).
--disable_flash_attn2 (bool): Disable Flash Attention 2.
--report_to (str): Reporting tool (choices: 'tensorboard', 'wandb', 'none') (default: 'tensorboard').
--logging_dir (str): Logging directory (default: "./tf-logs").
--lora_rank (int): LoRA rank (default: 128).
--lora_alpha (int): LoRA alpha (default: 256).
--lora_dropout (float): LoRA dropout (default: 0.05).
--logging_steps (int): Logging steps (default: 1).
--dataloader_num_workers (int): Number of data loader workers (default: 4).

Train with video dataset

You can train the model using a video dataset. You can set LoRA configs and use for LoRA too.

bash scripts/finetune_video.sh

When training with video, it behaves like multi-image input, so adjust max_pixels and fps based on the available VRAM.

If you run out of vram, you can use zero3_offload instead of zero3.
You could use zero2_offload for a bit faster training.

Image Resolution for vram usage

The model supprots a wide range of resolution inputs. By default, it uses the native resolution for input. For better performance using native or higer pixel numbers are recommended, however it takes too much memory and computation time for large images. So you could adjust the pixel numbers for it. The model splits the image into token * 28 * 28 so you could just change the the token_num part in the script.

⚠️

For Qwen3-VL models, it should be token * 32 * 32.

For example:

--image_min_pixels $((256 * 28 * 28))
--image_max_pixels $((1280 * 28 * 28))
--video_min_pixels $((128 * 28 * 28))
--video_max_pixels $((768 * 28 * 28))

Besides you could directly set the image/video height and width to control over the memory.

--image_resized_width 448
--image_resized_height 448
--video_resized_width 448
--video_resized_height 448

These values will be rounded to the nearest multiple of 28.

Merge LoRA Weights

bash scripts/merge_lora.sh

Note: Remember to replace the paths in finetune.sh or finetune_lora.sh with your specific paths. (Also in merge_lora.sh when using LoRA.)

Evaluation during Training

You can run generation-based evaluation during training by providing an evaluation dataset and a custom compute_metrics function. This allows you to monitor metrics like accuracy, BLEU, or any custom metric based on the model's generated text outputs.

Step 1: Prepare Evaluation Dataset

The evaluation dataset uses the same format as the training dataset. Place your evaluation data JSON file and specify the path using --eval_path.

[ { "id": "eval_001", "image": "test_image.jpg", "conversations": [ { "from": "human", "value": "\nWhat is shown in this image?" }, { "from": "gpt", "value": "A cat sitting on a couch." } ] } ]

Step 2: Define compute_metrics Function

Create a custom compute_metrics function in your training script. The function receives a GenerativeEvalPrediction object containing:

predictions: List of generated text strings from the model
references: List of ground truth answer strings

from src.trainer import GenerativeEvalPrediction

def compute_metrics(eval_pred: GenerativeEvalPrediction): predictions = eval_pred.predictions references = eval_pred.references

# Example: Exact match accuracy
correct = sum(
    1 for p, r in zip(predictions, references)
    if p.strip().lower() == r.strip().lower()
)
accuracy = correct / len(predictions) if predictions else 0

return {"accuracy": accuracy}

Step 3: Modify Training Script

Update your training script (src/train/train_sft.py) to pass compute_metrics to the trainer:

from src.trainer import QwenSFTTrainer, GenerativeEvalPrediction

def compute_metrics(eval_pred: GenerativeEvalPrediction): predictions = eval_pred.predictions references = eval_pred.references correct = sum(1 for p, r in zip(predictions, references) if p.strip() == r.strip()) return {"accuracy": correct / len(predictions)}

... (model and data setup code)

trainer = QwenSFTTrainer( model=model, processing_class=processor, args=training_args, compute_metrics=compute_metrics, # Add this line **data_module )

Step 4: Add Evaluation Arguments

Add these arguments to your training script:

--eval_path /path/to/eval.json
--eval_strategy steps
--eval_steps 500
--per_device_eval_batch_size 1
--generation_max_new_tokens 256
--prediction_loss_only False \

... other arguments

Evaluation Arguments

--eval_path (str): Path to the evaluation data JSON file.
--eval_strategy (str): Evaluation strategy - "steps" or "epoch" (default: "no").
--eval_steps (int): Number of steps between evaluations (when eval_strategy="steps").
--per_device_eval_batch_size (int): Batch size for evaluation (default: 8).
--generation_max_new_tokens (int): Maximum new tokens to generate during evaluation (default: 512).
--prediction_loss_only (bool): Set to False to enable generation-based evaluation (default: True). Example: Custom Metrics with Multiple Scores

from src.trainer import GenerativeEvalPrediction import re

def compute_metrics(eval_pred: GenerativeEvalPrediction): predictions = eval_pred.predictions references = eval_pred.references

# Exact match
exact_matches = sum(
    1 for p, r in zip(predictions, references)
    if p.strip().lower() == r.strip().lower()
)

# Contains match (reference appears in prediction)
contains_matches = sum(
    1 for p, r in zip(predictions, references)
    if r.strip().lower() in p.strip().lower()
)

n = len(predictions)
return {
    "exact_match": exact_matches / n if n > 0 else 0,
    "contains_match": contains_matches / n if n > 0 else 0,
}

Note: Generation-based evaluation is slower than loss-only evaluation because it runs model.generate() for each sample. Consider using a smaller evaluation dataset or less frequent evaluation steps.

DPO Finetuning

You can train the model using Direct Preference Optimization (DPO).
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during DPO training just like in SFT.

If you are training a supported reasoning model, add --enable_reasoning True and provide chosen_reasoning / rejected_reasoning in the dataset as described in Reasoning Format. Each DPO sample must contain both reasoning fields or neither of them, and reasoning-optional samples are supported only for Qwen3.5.

bash scripts/finetune_dpo.sh

Most of the training arugments are same as SFT, but few other arguments are added for DPO training.

Training arguments

--dpo_loss (str): Loss type for dpo. (default: 'sigmoid')
--precompute_ref_log_probs (bool): Wheter to precompute the reference log probs (default: False)
--beta (float): The beta value for DPO (default: 0.1)

GRPO Finetuning

You can traing the model using Group Relative Policy Optimization (GRPO)
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during GRPO training just like in SFT.

If you are training a supported reasoning model, add --enable_reasoning True and store the assistant reasoning in the reasoning field of the assistant turn as described in Reasoning Format. Reasoning-optional samples are supported only for Qwen3.5.

Prerequisites

What	Where	Notes
Reward functions	src/train/reward_funcs.py	Add any function that ends with _reward. The training script picks them up automatically.
Custom system prompts	src/constants.py	Append your own prompt strings here.

You could start training using this script.
Before training, Please check the dataset format once more. The format is a bit different from other training methods.

bash scripts/finetune_grpo.sh

Most of the training arugments are same as SFT, but few other arguments are added for GRPO training.

Training arguments

--temperature (float): Generation config (default: 0.9)
--top_p (float): Generation config (default: 1.0)
--top_k (int): Generation config (default: 50)
--min_p (float): Generation config (default: None)
--repetition_penalty (float): Generation config (default: 1.0)
--max_completion_length (int): Max length for the completion (default: 256)
--max_prompt_length (int): Max length for the prompt (default: 512)
--beta (float): KL Coefficient. (default: 0.04)
--liger_grpo_loss_type (str): When --use_liger_loss True, choose the GRPO loss variant exposed by liger-kernel 0.8.0. One of grpo, bnpo, dr_grpo, dapo (LigerFusedLinearGRPOLoss default), cispo, sapo, luspo. Defaults to None, which keeps Liger's built-in default. dr_grpo also requires --max_completion_length.

Classification Finetuning

The model is tailored for classification tasks, such as other SequenceClassification models.

For the classification task, you need to prepare the dataset in a specific format. The dataset should be a JSON file where each entry contains an image and its corresponding label. The labels should be integers starting from 0.
You can set the text in the filed prompt to provide a questions and options for the classification task. Also if your dataset dose not contain the prompt field, the script will automatically use the USER_MESSAGE from the cls_dataset.py.

Please see the example below for the dataset format.

Example for Classification Dataset

[ { "id": "06bc8a17-bb1c-4007-8c08-92c41e2628b2", "image": "image_2.jpg", "prompt": "Question: What is in the image? \n Options: \n 1. A train \n 2. A bus \n 3. A car \n 4. A bicycle", "label": "3", } ... ]

Note: You should set the CLASS_2_ID variable in the cls_dataset.py.

The dataset can contain single/multi-image or video data, and the model will be trained to classify the images/videos based on the provided labels.

For now, you can select loss from one of the following:

cross_entropy
focal_loss
class_balanced_cross_entropy
class_balanced_focal_loss

Also you can set early stopping patience and threshold for the training. For example, you can set --early_stopping_patience 5 and --early_stopping_threshold 0.01 to stop the training if the validation loss does not improve for 5 epochs with a threshold of 0.01.

Most of the training arugments are same as SFT, but few other arguments are added for classification training.

Tip: In models like the Qwen family, which have strong context embeddings, even a shallow nonlinearity (a 1-layer MLP) can often improve separability in the tail. This works by introducing a bit of curvature that a purely linear head cannot provide.

Training arguments

--loss_type (str): Loss type for classification (default: 'cross_entropy').
--focal_alpha (str): Focal Loss alpha value. If None use CrossEntropyLoss. ex '1.0,7.5' (default: None).
--focal_gamma (float): Focal Loss gamma value. (default: 0.0)
--num_labels (int): Number of labels for classification
--class_balanced_beta (float): Class Balanced beta value. (default: 0.999)
--early_stopping_patience (int): Early stopping patience (default: 0)
--early_stopping_threshold (float): Early stopping threshold (default: 0.01)
--mlp_head_dim (int): Dimension of the MLP head (default: 0)
--mlp_head_dropout (float): Dropout rate for the MLP head (default: 0.0)

You can run the training script using the following command:

bash scripts/finetune_cls.sh

Experimental Features

Sampler for the dataset. The trainer scripts supports the sampler for the dataset. You could make your own sampler with inherting DistributedSampler.

Inference

Note: You should use the merged weight when trained with LoRA.

Gradio Infernce (WebUI)

Install gradio
Launch app

python -m src.serve.app \
    --model-path /path/to/merged/weight

You can launch gradio based demo with this command. This can also set some other generation configs like repetition_penalty, temperature etc.

Issue for libcudnn error

Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8

You could run unset LD_LIBRARY_PATH for this error. You could see this issue

TODO

Support for video data
Add demo for multi-image and video
Handle mixed-modality data in dataset and collator
Support Qwen2.5-VL
Monkey-patch liger-kernel for Qwen2.5-VL
Update the code base to the latest transformers.
Add DPO
Add GRPO
Support Qwen3-VL(non-moe)
Support Qwen3-VL-Moe
Support Qwen3.5

Known Issues

libcudnn issue

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Citation

If you find this repository useful in your project, please consider giving a ⭐ and citing:

@misc{Qwen2-VL-Finetuning, author = {Yuwon Lee}, title = {Qwen2-VL-Finetune}, year = {2024}, publisher = {GitHub}, url = {https://github.com/2U1/Qwen2-VL-Finetune} }

Acknowledgement

This project is based on

LLaVA-NeXT: An amazing open-source project of LMM.
Mipha: Open-source projcet of SMM with amazing capabilites.
Qwen2-VL-7B-Instruct: Awesome pretrained MLLM based on Qwen2.
Liger-Kernel: Collection of Tirton kernels designed specifically for LLM training.
VLM-R1: Open-source project of Reinforcement Learning with VLMs.

GitHub - 2U1/Qwen-VL-Series-Finetune: An open-source implementaion for fine-tuning Qwen-VL series by Alibaba Cloud. (original) (raw)

Other projects

Update

Table of Contents

Supported Features

Docker

Installation

Environments

Using requirements.txt

Using environment.yaml

Training Notes

Dataset Preparation

Reasoning Format

Supervised Fine Tuning

Full Finetuning

Finetune with LoRA

Train with video dataset

Image Resolution for vram usage

Merge LoRA Weights

Evaluation during Training

Step 1: Prepare Evaluation Dataset

Step 2: Define compute_metrics Function

Step 3: Modify Training Script

... (model and data setup code)

Step 4: Add Evaluation Arguments

... other arguments

DPO Finetuning

GRPO Finetuning

Prerequisites

Classification Finetuning

Experimental Features

Inference

Gradio Infernce (WebUI)

Issue for libcudnn error

TODO

Known Issues

License

Citation

Acknowledgement

Using `requirements.txt`

Using `environment.yaml`