Qwen3: How to Run & Fine-tune | Unsloth Documentation (original) (raw)

Basics

Qwen3: How to Run & Fine-tune

Learn to run & fine-tune Qwen3 locally with Unsloth + our Dynamic 2.0 quants

Last updated 7 minutes ago

Qwen's new Qwen3 models deliver state-of-the-art advancements in reasoning, instruction-following, agent capabilities, and multilingual support. All uploads use Unsloth for SOTA 5-shot MMLU and KL Divergence performance, meaning you can run & fine-tune quantized Qwen LLMs with minimal accuracy loss.

We also uploaded Qwen3 with native 128K context length. Qwen achieves this by using YaRN to extend its original 40K window to 128K.

also now supports fine-tuning and GRPO of Qwen3 and Qwen3 MOE models — 2x faster, with 70% less VRAM, and 8x longer context lengths. Fine-tune Qwen3 (14B) for free using our

Qwen3 - Unsloth Dynamic 2.0 with optimal configs:

Dynamic 2.0 GGUF (to run)

Dynamic 4-bit Safetensor (to finetune/deploy)

Official Recommended Settings

According to Qwen, these are the recommended settings for inference:

Non-Thinking Mode Settings:

Min_P = 0.0 (optional, but 0.01 works well, llama.cpp default is 0.1)

Chat template/prompt format:

<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n

For NON thinking mode, we purposely enclose and with nothing:

<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n

For Thinking-mode, DO NOT use greedy decoding, as it can lead to performance degradation and endless repetitions.

Switching Between Thinking and Non-Thinking Mode

Instructions for llama.cpp and Ollama:

You can add /think and /no_think to user prompts or system messages to switch the model's thinking mode from turn to turn. The model will follow the most recent instruction in multi-turn conversations.

Here is an example of multi-turn conversation:

> Who are you /no_think

<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. [...]

> How many 'r's are in 'strawberries'? /think

<think>
Okay, let's see. The user is asking how many times the letter 'r' appears in the word "strawberries". [...]
</think>

The word strawberries contains 3 instances of the letter r. [...]

Instructions for transformers and vLLM:

Thinking mode:

enable_thinking=True

By default, Qwen3 has thinking enabled. When you call tokenizer.apply_chat_template, you don’t need to set anything manually.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Default is True
)

In thinking mode, the model will generate an extra <think>...</think> block before the final answer — this lets it "plan" and sharpen its responses.

Non-thinking mode:

enable_thinking=False

Enabling non-thinking will make Qwen3 will skip all the thinking steps and behave like a normal LLM.

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False  # Disables thinking mode
)

This mode will provide final responses directly — no <think> blocks, no chain-of-thought.

🦙 Ollama: Run Qwen3 Tutorial

apt-get update
apt-get install pciutils -y
curl -fsSL https://ollama.com/install.sh | sh

Run the model! Note you can call ollama servein another terminal if it fails! We include all our fixes and suggested parameters (temperature etc) in params in our Hugging Face upload!

ollama run hf.co/unsloth/Qwen3-8B-GGUF:UD-Q4_K_XL

To disable thinking, use (or you can set it in the system prompt):

>>> Write your prompt here /nothink

If you're experiencing any looping, Ollama might have set your context length window to 2,048 or so. If this is the case, bump it up to 32,000 and see if the issue still persists.

📖 Llama.cpp: Run Qwen3 Tutorial

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose Q4_K_M, or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Qwen3-14B-GGUF",
    local_dir = "unsloth/Qwen3-14B-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

Run the model and try any prompt. To disable thinking, use (or you can set it in the system prompt):

>>> Write your prompt here /nothink

For Qwen3-235B-A22B, we will specifically use Llama.cpp for optimized inference and a plethora of options.

We're following similar steps to above however this time we'll also need to perform extra steps because the model is so big.
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q2_K_XL, or other quantized versions..

# !pip install huggingface_hub hf_transfer  
import os  
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  
from huggingface_hub import snapshot_download  
snapshot_download(  
    repo_id = "unsloth/Qwen3-235B-A22B-GGUF",  
    local_dir = "unsloth/Qwen3-235B-A22B-GGUF",  
    allow_patterns = ["*UD-Q2_K_XL*"],  
)

Run the model and try any prompt.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 99 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

Use -ot ".ffn_.*_exps.=CPU" to offload all MoE layers to the CPU! This effectively allows you to fit all non MoE layers on 1 GPU, improving generation speeds. You can customize the regex expression to fit more layers if you have more GPU capacity.

./llama.cpp/llama-cli \
    --model unsloth/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-UD-Q2_K_XL.gguf \
    --threads 32 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    -ot ".ffn_.*_exps.=CPU" \
    --seed 3407 \
    --prio 3 \
    --temp 0.6 \
    --min-p 0.0 \
    --top-p 0.95 \
    --top-k 20 \
    -no-cnv \
    --prompt "<|im_start|>user\nCreate a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|im_end|>\n<|im_start|>assistant\n"

🦥 Fine-tuning Qwen3 with Unsloth

Unsloth makes Qwen3 fine-tuning 2x faster, use 70% less VRAM and supports 8x longer context lengths. Qwen3 (14B) fits comfortably in a Google Colab 16GB VRAM Tesla T4 GPU.

Because Qwen3 supports both reasoning and non-reasoning, you can fine-tune it with a non-reasoning dataset, but this may affect its reasoning ability. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

Our Conversational notebook uses a combo of 75% NVIDIA’s open-math-reasoning dataset and 25% Maxime’s FineTome dataset (non-reasoning). Here's free Unsloth Colab notebooks to fine-tune Qwen3:

If you have an old version of Unsloth and/or are fine-tuning locally, install the latest version of Unsloth:

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Qwen3 MOE models fine-tuning

Fine-tuning support includes MOE models: 30B-A3B and 235B-A22B. Qwen3-30B-A3B works on just 17.5GB VRAM with Unsloth. On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default.

The 30B-A3B fits in 17.5GB VRAM, but you may lack RAM or disk space since the full 16-bit model must be downloaded and converted to 4-bit on the fly for QLoRA fine-tuning. This is due to issues importing 4-bit BnB MOE models directly. This only affects MOE models.

If you're fine-tuning the MOE models, please use FastModel and not FastLanguageModel

from unsloth import FastModel
import torch
model, tokenizer = FastModel.from_pretrained(
    model_name = "unsloth/Qwen3-30B-A3B",
    max_seq_length = 2048, # Choose any for long context!
    load_in_4bit = True,  # 4 bit quantization to reduce memory
    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory
    full_finetuning = False, # [NEW!] We have full finetuning now!
    # token = "hf_...", # use one if using gated models
)

To use the notebooks, just click Runtime, then Run all. You can change settings in the notebook to whatever you desire. We have set them automatically by default. Change model name to whatever you like by matching it with model's name on Hugging Face e.g. 'unsloth/Qwen3-8B' or 'unsloth/Qwen3-0.6B-unsloth-bnb-4bit'.

There are other settings which you can toggle:

**max_seq_length = 2048** – Controls context length. While Qwen3 supports 40960, we recommend 2048 for testing. Unsloth enables 8× longer context fine-tuning.
**load_in_4bit = True** – Enables 4-bit quantization, reducing memory use 4× for fine-tuning on 16GB GPUs.
For full-finetuning - set full_finetuning = True and 8-bit finetuning - set load_in_8bit = True

We made a new advanced GRPO notebook for fine-tuning Qwen3. Learn to use our new proximity-based reward function (closer answers = rewarded) and Hugging Face's Open-R1 math dataset. Unsloth now also has better evaluations and uses the latest version of vLLM.

Learn about:

Enabling reasoning in Qwen3 (Base)+ guiding it to do a specific task
Pre-finetuning to bypass GRPO's tendency to learn formatting
Improved evaluation accuracy via new regex matching
Custom GRPO templates beyond just 'think' e.g. <start_working_out></end_working_out>
Proximity-based scoring: better answers earn more points (e.g., predicting 9 when the answer is 10) and outliers are penalized

Qwen3 models come with built-in "thinking mode" to boost reasoning and improve response quality - similar to how worked. Instructions for switching will differ depending on the inference engine you're using so ensure you use the correct instructions.

Install ollama if you haven't already! You can only run models up to 32B in size. To run the full 235B-A22B model, .

Obtain the latest llama.cpp on . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

If you'd like to read a full end-to-end guide on how to use Unsloth notebooks for fine-tuning or just learn about fine-tuning, creating etc., view our :

notebook - Advanced GRPO LoRA