FineTuning Large Language Models (LLMs) Using QLoRA (original) (raw)

Last Updated : 23 Jul, 2025

Fine-tuning large language models (LLMs) is used for adapting LLM's to specific tasks, improving their accuracy and making them more efficient. However full fine-tuning of LLMs can be computationally expensive and memory-intensive. QLoRA (Quantized Low-Rank Adapters) is a technique used to significantly reduces the computational cost while maintaining model quality.

What is QLoRA?

QLoRA is a advanced fine-tuning method that quantizes LLMs to reduce memory usage and applies Low-Rank Adaptation (LoRA) to train a subset of model parameters. This allows:

Fine-Tunning-LLMS-with-Qlora

QloRa Techinique

Before going into QLoRA, it is important to understand Parameter Efficient Fine-Tuning (PEFT) techniques which aim to fine-tune large models efficiently by reducing the number of trainable parameters. LoRA (Low-Rank Adaptation) and **QLoRA are two prominent PEFT methods that significantly lower memory usage while retaining fine-tuning effectiveness.

Key Components of QLoRA

  1. **4-bit Quantization (NF4): QLoRA uses Normalized Float 4-bit (NF4) quantization which is optimized for deep learning. Unlike traditional quantization techniques that may introduce numerical instability, NF4 maintains precision by normalizing values in a way that aligns well with deep neural networks.
  2. **LoRA Adapters: Instead of modifying the full model, LoRA introduces small low-rank matrices into specific layers allowing efficient adaptation with fewer parameters. These adapters fine-tune only critical layers such as query and value projections in transformer models. These layers are chosen because they play a central role in attention mechanisms making fine-tuning more effective without modifying the entire model.
  3. **Memory: Efficient Training: By combining quantization with LoRA, QLoRA significantly reduces VRAM usage making fine-tuning feasible on consumer-grade GPUs. It achieves this by minimizing activation storage, reducing gradient computation and enabling large-scale training on limited hardware.

Fine-Tuning LLMs using QLoRA in Python

1. Install Required Libraries

We will install following libraries: **torch, transformers, peft, datasets, accelerate and **bitsandbytes .

Python `

!pip install torch transformers peft bitsandbytes accelerate datasets

`

2. Import Necessary Libraries

**AutoModelForCausalLM loads a pre-trained causal language model. The libraries have the following functions:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer from peft import LoraConfig, get_peft_model from datasets import load_dataset import bitsandbytes as bnb

`

3. Load a Pretrained Quantized Model

Let's loads a 1.3B parameter model with 4-bit quantization to save memory. The **device_map="auto" argumentautomatically assigns the model to the available GPU.

Python `

model_name = "meta-llama/Llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained( model_name, load_in_4bit=True, # Enables 4-bit quantization device_map="auto" )

tokenizer = AutoTokenizer.from_pretrained(model_name)

`

4. Define LoRA Configuration

We will configure a LoRA (Low-Rank Adaptation) for a model and printing its trainable parameters. **LoraConfig() sets up the configuration for LoRA

where:

lora_config = LoraConfig( r=8, # Low-rank dimension lora_alpha=16, lora_dropout=0.05, target_modules=["q_proj", "v_proj"], # Fine-tuning specific layers )

model = get_peft_model(model, lora_config) model.print_trainable_parameters()

`

5. Load and Prepare Dataset

In this step , we load the **wikitext dataset and define **tokenize_function to preprocess text. The **dataset.map() function applies tokenization to all examples.

Python `

dataset = load_dataset("imdb", split="train[:10000]") # Sentiment analysis dataset

def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

`

6. Set Training Arguments

We set the following arguments:

training_args = TrainingArguments( output_dir="./results", per_device_train_batch_size=4, evaluation_strategy="epoch", save_strategy="epoch", logging_steps=10, num_train_epochs=3, fp16=True, # Enable mixed precision training push_to_hub=False, )

`

7. Fine-Tune the Model

We will use Trainer class to streamline the training process of a model in HuggingFace system:

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_dataset, )

trainer.train()

`

**Output:

Trainable parameters: 0.02M (0.3% of full model parameters)
Training...
Epoch 1: Loss 1.23
Epoch 2: Loss 0.89
Epoch 3: Loss 0.75
Training complete.

This output shows that only 0.3% of model parameters were trained and hence showing us QLoRA’s efficiency.

Advantages of Using QLoRA

Limitations and Trade-offs of QLoRA

By using 4-bit quantization and LoRA adapters, QLoRA helps researchers and developers to fine-tune massive models on consumer-grade GPUs efficiently. This technique makes it easier to adapt LLMs for specific tasks without requiring expensive hardware.