GitHub - 2U1/Gemma3-Finetune: An open-source implementaion for Gemma3 series by Google. (original) (raw)

Fine-tuning Gemma3

This repository contains a script for training Gemma3 with only using HuggingFace.

Other projects

[Phi3-Vision Finetuning]
[Llama3.2-Vision Finetuning]
[Qwen2-VL Finetuning]
[Molmo Finetuning]
[Pixtral Finetune]
[SmolVLM Finetune]

Update

Table of Contents

Supported Features

Docker

To simplfy the setting process for training, you could use the provided pre-build environments.
The settings are done in the conda env named train.

You could find more information about the image here.

docker pull john119/vlm
docker run --gpus all -it -v /host/path:/docker/path --name vlm --ipc=host john119/vlm /bin/bash

Installation

Environments

Install the required packages using environment.yml.

Using environment.yaml

conda env create -f environment.yaml conda activate train

Using requirements.txt

pip install -r requirements.txt -f https://download.pytorch.org/whl/cu124

Note: It is strongly recommended to train Gemma3 models with the eager attention implementation instead of flash_attention_2

Dataset Preparation

Note: You could only use the text data to finetune the model.

The script requires a dataset formatted according to the LLaVA specification. The dataset should be a JSON file where each entry contains information about conversations and images. Ensure that the image paths in the dataset match the provided --image_folder.

When using a multi-image dataset, the image tokens should all be <image>, and the image file names should have been in a list. Please see the example below and follow format your data.

Example for text only data

[ { "id": "000000033471", "conversations": [ { "from": "human", "value": "Identify the odd one out: Twitter, Instagram, Telegram" }, { "from": "gpt", "value": "Telegram" }, { "from": "human", "value": "What makes Telegram different from Twitter and Instagram?" }, { "from": "gpt", "value": "Telegram is a cloud-based instant messaging app that focuses on privacy and security. Unlike Twitter and Instagram which are mainly used for following news, celebrities, and sharing images, Telegram was created as a secure messaging app for private and group communication. Telegram also offers more advanced features than Twitter and Instagram, such as the ability to schedule messages, create bots, and send encrypted messages." } ] } ... ]

Example for single image dataset

[ { "id": "000000033471", "image": "000000033471.jpg", "conversations": [ { "from": "human", "value": "\nWhat are the colors of the bus in the image?" }, { "from": "gpt", "value": "The bus in the image is white and red." }, { "from": "human", "value": "What feature can be seen on the back of the bus?" }, { "from": "gpt", "value": "The back of the bus features an advertisement." }, { "from": "human", "value": "Is the bus driving down the street or pulled off to the side?" }, { "from": "gpt", "value": "The bus is driving down the street, which is crowded with people and other vehicles." } ] } ... ]

Example for multi image dataset

[ { "id": "000000033471", "image": ["000000033471.jpg", "000000033472.jpg"], "conversations": [ { "from": "human", "value": "\n\nIs the perspective of the camera differnt?" }, { "from": "gpt", "value": "Yes, It the perspective of the camera is different." } ] } ... ]

Example for video dataset

[ { "id": "sample1", "video": "sample1.mp4", "conversations": [ { "from": "human", "value": "

Note: Gemma3 uses a video as a sequential of images.

Example for DPO dataet

[ { "id": "e4cd017f-3854-4019-9cc1-9dacacb5d424", "image": "example_0_0.png", "prompt": "\nDescribe the key features of the image in great detail.", "chosen": "The image features a modern, multi-colored train parked on a train track, occupying a significant portion of the scene. The train has a unique and striking design, with red, white, and blue colors, as well as yellow accents. It appears to be a high-speed train, indicating its advanced engineering and potential for fast travel.\n\nThe train is situated next to a platform, suggesting that it may be at a train station. The overall scene conveys a sense of transportation and modern technology, with the vibrant and sleek train as the main subject.", "rejected": "The image features a modern, colorful train parked on a set of train tracks. The train has a striking design, with red, white, and blue colors as well as yellow accents. It appears to be a high-speed train, ready for departure.\n\nThe train spans a significant portion of the image, from the left to the right side. The train tracks can be seen clearly beneath the train, emphasizing its position on the railway line. The overall scene gives off an impression of a contemporary and efficient mode of transportation." }, { "id": "5e19e647-e5d3-4bcf-82e9-d262570743ae", "image": "example_1_0.png", "prompt": "\nIs this bus in the USA?", "chosen": "Yes, based on the image, it can be assumed that this bus is in the USA. The location of the bus cannot be accurately determined.", "rejected": "No, it's not in the USA. The image does not provide specific information on where the bus is located. However, we can say that it's not in the United States." } ... ]

Example for GRPO dataset

[ { "id": "06bc8a17-bb1c-4007-8c08-92c41e2628b2", "image": "image_2.jpg", "conversations": [ { "from": "human", "value": "Based on the image, which geometric method is used to determine the bearing angle, and why is it the most appropriate choice?" }, { "from": "gpt", "value": "Let's analyze the image step-by-step. The image shows a right-angled triangle with points B, C, and A. The angle at point B is a right angle, indicating that trigonometric functions can be applied. To find the bearing angle, we need to relate the sides of the triangle. The tangent function is suitable here because it relates the opposite side (BC) to the adjacent side (AB) in a right-angled triangle. By using the tangent function, we can calculate the angle at point A, which is the bearing angle. Therefore, the most appropriate geometric method is the use of trigonometric functions.\n\nA" } ] } ... ]

Adding the new domain-specific data on top of the general data from open-source data will enhance downstream capabilities while retaining the foundational skills. Of course, you can also choose to fine-tune solely on the new data based on your requirements.

Supervised Fine Tuning

Note: Deepspeed zero2 is faster than zero3, however it consumes more memory. Also, most of the time zero2 is more stable than zero3.

Tip: You could use adamw_bnb_8bit for optimizer to save memory.

To run the training script, use the following command:

Full Finetuning

Finetune with LoRA

If you want to train only the language model with LoRA and perform full training for the vision model:

bash scripts/finetune_lora.sh

If you want to train both the language model and the vision model with LoRA:

bash scripts/finetune_lora_vision.sh

IMPORTANT: If you want to tune the embed_token with LoRA, You need to tune lm_head together.

Training arguments

Note: The learning rate of vision_model should be 10x ~ 5x smaller than the language_model.

Train with video dataset

You can train the model using a video dataset. However, Gemma3 processes videos as a sequence of images, so you’ll need to select specific frames and treat them as multiple images for training. You can set LoRA configs and use for LoRA too.

bash scripts/finetune_video.sh

If you run out of vram, you can use zero3_offload instead of zero3. However, using zero3 is preferred.

Merge LoRA Weights

bash scripts/merge_lora.sh

Note: Remember to replace the paths in finetune.sh or finetune_lora.sh with your specific paths. (Also in merge_lora.sh when using LoRA.)

DPO Finetuning

You can train the model using Direct Preference Optimization (DPO).
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during DPO training just like in SFT.

bash scripts/finetune_dpo.sh

Most of the training arugments are same as SFT, but few other arguments are added for DPO training.

Training arguments

GRPO Finetuning

You can traing the model using Group Relative Policy Optimization (GRPO)
The process is quite similar to Supervised Fine-Tuning (SFT), and you can also apply LoRA during GRPO training just like in SFT.

For the video data, you should preprocess the video and save the frames into the local directory. Then use customize your data as a multi-image dataset.

Prerequisites

What Where Notes
Reward functions src/train/reward_funcs.py Add any function that ends with _reward. The training script picks them up automatically.
Custom system prompts src/constants.py Append your own prompt strings here.

You could start training using this script.

bash scripts/finetune_grpo.sh

Most of the training arugments are same as SFT, but few other arguments are added for GRPO training.

Training arguments

Note: Liger GRPO loss and vLLM back-end are not yet supported. Both will be added soon.

Issue for libcudnn error

Could not load library libcudnn_cnn_train.so.8. Error: /usr/local/cuda-12.1/lib/libcudnn_cnn_train.so.8: undefined symbol: _ZN5cudnn3cnn34layerNormFwd_execute_internal_implERKNS_7backend11VariantPackEP11CUstream_stRNS0_18LayerNormFwdParamsERKNS1_20NormForwardOperationEmb, version libcudnn_cnn_infer.so.8

You could run unset LD_LIBRARY_PATH for this error. You could see this issue

TODO

Known Issues

License

This project is licensed under the Apache-2.0 License. See the LICENSE file for details.

Citation

If you find this repository useful in your project, please consider giving a ⭐ and citing:

@misc{Gemma3-Finetuning, author = {Yuwon Lee}, title = {Gemma3-Finetune}, year = {2025}, publisher = {GitHub}, url = {https://github.com/2U1/Gemma3-Finetune} }

Acknowledgement

This project is based on