What is Quantization (original) (raw)

Last Updated : 6 Nov, 2025

Quantization is a model optimization technique that reduces the precision of numerical values such as weights and activations in models to make them faster and more efficient. It helps lower memory usage, model size, and computational cost while maintaining almost the same level of accuracy.

weights_32_bit_float_

Quantization

Need of Quantization

In Large Language Models which contain billions of parameters, quantization plays a crucial role. These parameters are typically stored as 32-bit (FP32) or 16-bit (FP16) values that require significant computational resources. Quantization converts them into lower-precision formats like 8-bit (INT8) or 4-bit (INT4) which allows faster inference and reduced hardware requirements making large models more practical for real-world deployment.

How It Works

Quantization is all about making your model lighter and faster without hurting its accuracy too much. It does this by converting high-precision numbers like floats into lower-precision integers (like INT8).

The process revolves around two key parts:

Now lets see its working:

**1. Range Determination

Model figures out the range of values (minimum and maximum) that each weight or activation can take. In Static Quantization, this range is calculated beforehand using a calibration dataset. In Dynamic Quantization, the range is determined on the fly during inference.

**2. Scaling and Zero-Point Calculation

Once the range is known, the next step is to map floating-point values into an integer space (for example, from -1 to 1 → 0 to 255).

S = \frac{x_{\text{max}} - x_{\text{min}}}{q_{\text{max}} - q_{\text{min}}}

Z = q_{\text{min}} - \frac{x_{\text{min}}}{S}

where

**3. Quantization and Dequantization

**Quantization: Converts floats to integers using

x_q = \text{round}\left(\frac{x}{S} + Z\right)

**Dequantization: Converts them back for interpretation using

x = S \times (x_q - Z)

where

Types of Quantization

Types of Quantization refer to the different techniques used to reduce model size and computation needs while maintaining accuracy. Each type balances precision, speed and memory efficiency differently, depending on the target hardware and task requirements.

1. Post-Training Quantization (PTQ)

1

PTQ

2. Quantization-Aware Training (QAT)

2

QAT

Quantization Technique

Lets see various Quantization Techniques that we can use:

**1. QLoRA (Quantized Low-Rank Adaptation)

**2. GPTQ (General Pre-Trained Transformer Quantization)

**3. Uniform Quantization

**4. Non-Uniform Quantization

**5. Min-Max Quantization

**6. Logarithmic Quantization

Quantization in ML, DL and LLMs

Quantization plays a slightly different role across Machine Learning (ML), Deep Learning (DL) and Large Language Models (LLMs). While the core idea remains the same i.e reducing precision to save resources but the impact vary depending on the model type.

Aspect Quantization in ML Quantization in DL Quantization in LLMs
**Purpose Simplify models for faster inference and deployment Reduce model size and computation during training and inference Make large-scale models efficient for inference and deployment
**Data Type Precision Usually converts float to integer (e.g FP32 → INT16) Converts weights and activations to lower precision (e.g FP32 → INT8) Converts billions of parameters to 8-bit or 4-bit precision (e.g FP16 → INT4)
**Impact on Accuracy Minor or negligible Slight accuracy drop if not calibrated well Accuracy may drop slightly but can be managed with fine-tuning
**Use Case Small or traditional ML models like regression or decision trees. Neural networks and CNNs for vision or speech tasks Transformer-based models like GPT, LLaMA or BERT
**Goal Reduce latency and improve inference speed Optimize GPU/TPU usage and training efficiency Enable deployment on limited hardware like consumer GPUs or edge devices.

Step-By-Step Implementation

Here we load a lightweight language model in 4-bit quantized mode (QLoRA) to reduce memory usage. Then, we ask it questions and generate natural text answers directly from the quantized model.

Step 1: Importing Required Libraries

import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

`

Step 2: Selecting a LLM

TinyLlama is a 1.1B parameter model.

Python `

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

`

Step 3: Defining the Quantization Configuration

bnb_config = BitsAndBytesConfig( load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16 )

`

Step 4: Loading the Model and Tokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=bnb_config, device_map="auto" )

`

**Output:

Load_model

Load Quantized Model

Step 5: Creating a Function to Ask Questions

def ask_question(question, max_new_tokens=128): prompt = f"<|system|>\nYou are a helpful assistant.<|user|>\n{question}<|assistant|>\n" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=True, top_p=0.9, temperature=0.7, repetition_penalty=1.1, pad_token_id=tokenizer.eos_token_id ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) if "<|assistant|>" in response: response = response.split("<|assistant|>")[-1].strip() return response

`

Step 6: Testing the Quantized Model

questions = [ "What are the advantages of using 4-bit quantization in large language models?" ]

for q in questions: print("\nQuestion:", q) print("Answer:", ask_question(q))

`

**Output:

Screenshot-2025-11-06-145339

Output

You can download the complete code file from here.

Applications of Quantization

  1. **Embedded Systems: Quantization is widely used in embedded systems for real-time inference tasks such as anomaly detection in industrial IoT systems, facial recognition in surveillance systems or voice processing in smart speakers.
  2. **Healthcare Devices: AI models in portable medical devices such as wearable health monitors or diagnostic tools, uses quantization to ensure fast and efficient operation.
  3. **Model Optimization for Inference: It speeds up inference by reducing computational load. This makes models like GPT and BERT more responsive during real-time interactions.
  4. **Edge and Mobile Deployment: By lowering precision to INT8 or INT4, large models can run efficiently on edge devices, smartphones and IoT hardware with limited resources.
  5. **Latency Reduction: It speeds up matrix multiplications and attention mechanisms. This enhances user experience in chatbots and virtual assistants.
  6. **Cost-Effective Scaling: Companies use quantized LLMs to lower cloud inference costs while keeping similar accuracy to full-precision models.

Benefits of Quantization

  1. **Memory Efficiency: Quantized models have a much smaller memory footprint. For example, an INT8 model can be 4 times smaller than an FP32 model which is significant for deploying models on devices with limited memory.
  2. **Inference Speed: Quantized models run faster on hardware with specialized support for low-precision operations such as NVIDIA TensorRT or Google TPU, resulting in reduced inference time and improved user experience on mobile applications.
  3. **Power Efficiency: Quantization significantly reduces power consumption during inference which is vital for edge devices like smart cameras, drones or wearables.

Challenges

  1. **Accuracy Drop: Lowering precision from FP32 to INT8 may cause quantization errors, reducing model accuracy in complex reasoning tasks.
  2. **Architecture Sensitivity: Transformer-based LLMs often show instability when quantized aggressively due to sensitive attention layers.
  3. **Calibration Difficulty: Determining correct activation and weight ranges is challenging and can lead to distorted outputs if done poorly.
  4. **Training Overhead: Quantization-Aware Training (QAT) improves accuracy but demands heavy computation and large datasets.
  5. **Hardware Constraints: Some devices lack efficient low-bit arithmetic support, limiting deployment of advanced quantization schemes like INT4.