GitHub - intel/auto-round: Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Transformers, and vLLM. (original) (raw)

What's New

[2025.07] AutoRound now offers experimental support for GGUF format, and recommends using RTN mode (--iters 0) for all bits other than 3 bits. A more advanced algorithm tailored for specific configurations may be available in upcoming release. Example models: Intel/Qwen3-235B-A22B-q2ks-mixed-AutoRound-inc-v0 and Intel/DeepSeek-R1-0528-q2ks-mixed-AutoRound-inc-v0.
[2025.05] AutoRound provides some recipes for DeepSeek-R1-0528, please refer to DeepSeek-R1-0528-int2-mixed-sym-inc, DeepSeek-R1-0528-int4-sym-gptq-incand DeepSeek-R1-0528-int4-asym-awq-inc for more details.
[2025/05] AutoRound has been integrated into vLLM. You can now run models in the AutoRound format directly with vLLM versions later than v0.85.post1.
[2025/04] AutoRound has been integrated into Transformers. You can run models in the AutoRound format directly with Transformers versions later than 4.51.3.
[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy. Check out OPEA/DeepSeek-R1-int2-mixed-sym-inc.

Installation

Install from pypi

CPU/Intel GPU/CUDA

pip install auto-round

HPU

pip install auto-round-lib

Build from Source

CPU/Intel GPU/CUDA

pip install .

HPU

python setup.py install lib

Model Quantization

Command Line Usage (Gaudi/CPU/Intel GPU/CUDA)

A user guide detailing the full list of supported arguments is provided by calling auto-round -h on the terminal. Set the format you want in format and multiple formats exporting has been supported. Please check out step-by-step-instruction for more details about calibration dataset or evaluation.

auto-round
--model Qwen/Qwen3-0.6B
--bits 4
--group_size 128
--format "auto_gptq,auto_awq,auto_round"
--output_dir ./tmp_autoround

We offer another two configurations, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.

Other Recipes

best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower

auto-round-best
--model Qwen/Qwen3-0.6B
--bits 4
--group_size 128
--low_gpu_mem_usage

light accuracy, 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2

auto-round-light
--model Qwen/Qwen3-0.6B
--bits 4
--group_size 128 \

In conclusion, we recommend using auto-round for INT4 and auto-round-best for INT2. However, you may adjust the configuration to suit your specific requirements and available resources.

W4G128 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile):

Model	Qwen2.5-0.5B-Instruct	Falcon3-3B	Qwen2.5-7B-Instruct	Meta-Llama-3.1-8B-Instruct	Falcon3-10B	Qwen2.5-72B-Instruct
16bits	0.4192	0.5203	0.6470	0.6212	0.6151	0.7229
Best	0.4137(7m)	0.5142(23m)	0.6426(58m)	0.6116(65m)	0.6092(81m)	0.7242(575m)
Default	0.4129(2m)	0.5133(6m)	0.6441(13m)	0.6106(13m)	0.6080(18m)	0.7252(118m)
Light	0.4052(2m)	0.5108(3m)	0.6453(5m)	0.6104(6m)	0.6063(6m)	0.7243(37m)

W2G64 resultsW2G64 Average Accuracy of 13 tasks and Time Cost Results(Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile). We recommend using higher precision for the head, tail, and non-expert modules to alleviate the significant accuracy drop.

Model	Qwen2.5-0.5B-Instruct	Falcon3-3B	Qwen2.5-7B-Instruct	Falcon3-10B	Qwen2.5-72B-Instruct
16bits	0.4192	0.5203	0.6470	0.6151	0.7229
Best	0.2989(6m)	0.4267(24m)	0.5343(56m)	0.5207(79m)	0.6715(564m)
Default	0.2878(2m)	0.4219(6m)	0.5209(13m)	0.5133(18m)	0.6713(122m)
Light	0.2760(2m)	0.4063(3m)	0.4764(5m)	0.4810(7m)	0.6581(38m)

API Usage (HPU/CPU/XPU/CUDA)

from transformers import AutoModelForCausalLM, AutoTokenizer from auto_round import AutoRound

model_name = "Qwen/Qwen3-0.6B" model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name)

bits, group_size, sym = 4, 128, True autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)

the best accuracy, 4-5X slower, low_gpu_mem_usage could save ~20G but ~30% slower

autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)

2-3X speedup, slight accuracy drop at W4G128

autoround = AutoRound(model, tokenizer, nsamples=128, iters=50, lr=5e-3, bits=bits, group_size=group_size, sym=sym )

output_dir = "./tmp_autoround"

format= 'auto_round'(default), 'auto_gptq', 'auto_awq'

autoround.quantize_and_save(output_dir, format='auto_round')

Detailed Hyperparameters

model: The PyTorch model to be quantized.
tokenizer: An optional tokenizer for processing input data. If none, a dataset must be provided.
bits (int): Number of bits for quantization (default is 4).
group_size (int): Size of the quantization group (default is 128).
sym (bool): Whether to use symmetric quantization (default is True).
enable_quanted_input (bool): Whether to use the output of the previous quantized block as the input for the current block for tuning (default is True).
enable_minmax_tuning (bool): Whether to enable weight min-max tuning (default is True).
iters (int): Number of tuning iterations (default is 200).
lr (float): The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
minmax_lr (float): The learning rate for min-max tuning (default is None, it will be set to lr automatically).
nsamples (int): Number of samples for tuning (default is 128).
seqlen (int): Data length of the sequence for tuning (default is 2048).
batch_size (int): Batch size for training (default is 8).
scale_dtype (str): The data type of quantization scale to be used (default is "float16"), different kernels have different choices.
amp (bool): Whether to use automatic mixed precision (default is True).
nblocks (int): Packing several blocks as one for tuning together (default is 1).
gradient_accumulate_steps (int): Number of gradient accumulation steps (default is 1).
low_gpu_mem_usage (bool): Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
dataset Union[str, list, tuple, torch.utils.data.DataLoader]: The dataset name for tuning (default is " NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. " ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
layer_config (dict): Configuration for weight quantization (default is None), mainly for mixed bits or mixed precision.
device: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.

API Usage for VLMs

If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better results.

Click to expand

This feature is experimental and may be subject to changes, including potential bug fixes, API modifications, or adjustments to default hype-parameters

By default, AutoRoundMLLM only quantizes the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRoundMLLM readme.

from auto_round import AutoRoundMLLM from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer

load the model

model_name = "Qwen/Qwen2-VL-2B-Instruct" model = Qwen2VLForConditionalGeneration.from_pretrained( model_name, trust_remote_code=True, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_name) processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

quantize the model

bits, group_size, sym = 4, 128, True autoround = AutoRoundMLLM(model, tokenizer, processor, bits=bits, group_size=group_size, sym=sym) autoround.quantize()

save the quantized model, set format='auto_gptq' or 'auto_awq' to use other formats

output_dir = "./tmp_autoround" autoround.save_quantized(output_dir, format='auto_round', inplace=True)

Export Formats

AutoRound Format: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision inference. [2,3,4,8] bits are supported.

AutoGPTQ Format: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the community, [2,3,4,8] bits are supported. However, the asymmetric kernel has issues that can cause considerable accuracy drops, particularly at 2-bit quantization and small models. Besides, recently 3 bits may have some accuracy issues in Transformers.

AutoAWQ Format: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely adopted within the community, only 4-bits quantization is supported.

llmcompressor Format: This format is for reusing llmcompressor format, only INT8 W8A8 dynamic quantization is supported.

GGUF Format: Experimental feature. This format is well-suited for CPU devices and is widely adopted by the community.

Quantization Costs

Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note that data loading and packing costs have been excluded from the evaluation. We recommend enabling torch.compile for PyTorch versions 2.6 and above.

To optimize GPU memory usage, in addition to activating low_gpu_mem_usage, you can set gradient_accumulate_steps=8and abatch_size=1, though this may increase tuning time.

The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA 3.1.

Torch version/Config W4G128	3B	8B	14B	70B	8X7B
2.6 with torch compile	7min10GB	12min18GB	23min22GB	120min42GB	28min46GB
2.6 with torch compile low_gpu_mem_usage=True	12min6GB	19min10GB	33min11GB	140min25GB	38min36GB
2.6 with torch compile low_gpu_mem_usage=True gradient_accumulate_steps=8,bs=1	15min3GB	25min6GB	45min7GB	187min19GB	75min36GB
2.5 w/o torch compile	8min10GB	16min20GB	30min25GB	140min49GB	50min49GB

Model Inference

Please run the quantization code first

AutoRound format

CPU: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip install intel-extension-for-transformers,

HPU: docker image with Gaudi Software Stack is recommended. More details can be found in Gaudi Guide.

CUDA: no extra operations for sym quantization, for asym quantization, need to install auto-round from source

HPU/CPU/XPU/CUDA

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as this may cause unexpected exceptions.

from transformers import AutoModelForCausalLM, AutoTokenizer

quantized_model_path = "./tmp_autoround" model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(quantized_model_path) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Specify backend

AutoRound automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found. On CUDA, the default priority is Marlin > ExLLaMAV2 > Triton, but the final choice depends on factors such as bits, group_size, packing format compatibility, etc. And the backend may not always be the most suitable for certain devices. Please refer to the following table for the details and specify the backend you want.

Name	Devices	Bits	Dtypes	Priority	Packing format	Requirements
ipex	cpu/xpu	4	BF16/FP16	5	gptq_zp+-1/awq	intel-extension-for-pytorch
itrex	cpu	2,4,8	BF16/FP16	1	gptq_zp+-1/awq	intel-extension-for-transformers
marlin	cuda	4,8	BF16/FP16	6	gptq/gptq_zp+-1	gptqmodel
exllamav2 orgptqmodel:exllamav2	cuda	4	BF16/FP16	5	gptq	gptqmodel
exllamav2 orgptq:exllamav2	cuda	4	FP16	5	gptq_zp+-1	auto-gptq
gptq:cuda	cuda	2,3,4,8	FP16	1	gptq_zp+-1	auto-gptq
triton	xpu/cuda	2,4,8	BF16/FP16	2	gptq/gptq_zp+-1	auto-round
awq	cuda	4	FP16	5	awq	auto-awq
hpu	hpu	4	BF16	0	gptq/gptq_zp+-1	auto-round
torch	xpu/cpu/cuda	2,3,4,8	BF16/FP16	0	gptq/gptq_zp+-1	auto-round

from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoRoundConfig

quantized_model_path = "./tmp_autoround" quantization_config = AutoRoundConfig(backend="auto") model = AutoModelForCausalLM.from_pretrained(quantized_model_path, device_map="auto", torch_dtype="auto", quantization_config=quantization_config) tokenizer = AutoTokenizer.from_pretrained(quantized_model_path) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Convert GPTQ/AWQ format to AutoRound

Most GPTQ/AWQ models can be converted to the AutoRound format for better compatibility and support with Intel devices. Please note that the quantization config will be changed if the model is serialized.

from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoRoundConfig

model_name = "ybelkada/opt-125m-gptq-4bit" quantization_config = AutoRoundConfig() model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", torch_dtype="auto", quantization_config=quantization_config) tokenizer = AutoTokenizer.from_pretrained(model_name) text = "There is a girl who likes adventure," inputs = tokenizer(text, return_tensors="pt").to(model.device) print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))

Evaluation

Click to expand

auto-round --model saved_quantized_model
--eval
--task lambada_openai
--eval_bs 1

Support List

AutoRound supports basically all the major large language models.

Supported Models List

Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot release most of the models ourselves.

Model	Supported
nvidia/Llama-3.1-Nemotron-70B-Instruct-HF	model-opea-int4-sym-autoround, model-opea-int4-sym-autogptq,
meta-llama/Llama-3.2-90B-Vision-Instruct	model-opea-int4-sym-autoround, model-opea-int4-sym-autogptq
Qwen/QwQ-32B-Preview	model-opea-int4-sym-autoround-mixed,model-opea-int4-sym-autoawq-mixed
THUDM/cogvlm2-llama3-chat-19B	model-opea-int4-sym-autoround
Qwen/Qwen2-VL-Instruct	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
meta-llama/Llama-3.2-11B-Vision	model-opea-int4-sym-autoround, model-opea-int4-sym-autogptq
microsoft/Phi-3.5-vision-instruct	model-opea-int4-sym-autoround, model-opea-int4-sym-gptq
liuhaotian/llava-v1.5-7b	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2.5-7B-Instruct	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq model-kaitchup-autogptq-int4*, recipe
Qwen/Qwen2.5-14B-Instruct	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2.5-32B-Instruct	model-opea-int4-sym-autoround
Qwen/Qwen2.5-Coder-32B-Instruct	model-kaitchup-autogptq-int4*
Qwen/Qwen2.5-72B-Instruct	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq, model-kaitchup-autogptq-int4, model-kaitchup-autogptq-int2, recipe
meta-llama/Meta-Llama-3.1-70B-Instruct	model-opea-int4-sym-autoround, model-opea-int4-sym-autogptq,model-opea-int4-asym-autoround
meta-llama/Meta-Llama-3.1-8B-Instruct	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq,model-kaitchup-autogptq-int4, model-kaitchup-autogptq-sym-int4, recipe
meta-llama/Meta-Llama-3.1-8B	model-kaitchup-autogptq-sym-int4*
Qwen/Qwen2-7B	model-autoround-sym-int4, model-autogptq-sym-int4
THUDM/glm-4-9b-chat	model-opea-int4-sym-autoround,model-opea-int4-sym-autogptq
Qwen/Qwen2-57B-A14B-Instruct	model-autoround-sym-int4,model-autogptq-sym-int4
01-ai/Yi-1.5-9B	model-LnL-AI-autogptq-int4*
01-ai/Yi-1.5-9B-Chat	model-LnL-AI-autogptq-int4*
Intel/neural-chat-7b-v3-3	model-autogptq-int4
Intel/neural-chat-7b-v3-1	model-autogptq-int4
TinyLlama-1.1B-intermediate	model-LnL-AI-autogptq-int4*
mistralai/Mistral-7B-v0.1	model-autogptq-lmhead-int4, model-autogptq-int4
google/gemma-2b	model-autogptq-int4
tiiuae/falcon-7b	model-autogptq-int4-G64
sapienzanlp/modello-italia-9b	model-fbaldassarri-autogptq-int4*
microsoft/phi-2	model-autoround-sym-int4 model-autogptq-sym-int4
microsoft/Phi-3.5-mini-instruct	model-kaitchup-autogptq-sym-int4*
mistralai/Mistral-7B-Instruct-v0.2	outdated-recipe
mistralai/Mixtral-8x7B-Instruct-v0.1	outdated-recipe
mistralai/Mixtral-8x7B-v0.1	outdated-recipe
meta-llama/Meta-Llama-3-8B-Instruct	outdated-recipe
google/gemma-7b	outdated-recipe
meta-llama/Llama-2-7b-chat-hf	outdated-recipe
baichuan-inc/Baichuan2-7B-Chat	outdated-recipe
01-ai/Yi-6B-Chat	outdated-recipe
facebook/opt-2.7b	outdated-recipe
bigscience/bloom-3b	outdated-recipe
EleutherAI/gpt-j-6b	outdated-recipe

VLM Support Matrix

For most VLMs, we typically support the default quantization configuration, which involves quantizing only the language component while excluding the visual component. Besides, we also support quantizing non-text modules of models that follow the Hugging Face standard, i.e., those with a typical processor, though inference may have some issues due to model architecture or kernel limitations.

Model	calibration dataset	quant nontext module	Quantized Model Link
allenai/Molmo	pile	X	Molmo-7B-D-0924-int4-sym, Molmo-72B-0924-int4-sym-gptq, Molmo-72B-0924-int4-sym
deepseek-ai/deepseek-vl2	pile/llava	√	deepseek-vl2-int4-sym-gptq
fancyfeast/llama-joycaption-beta-one-hf-llava	pile	X	NeoChen1024-int4-gptq
google/gemma-3	pile/llava	√	gemma-3-12b-it-AutoRound-gguf-q4-0, gemma-3-27b-it-AutoRound-gguf-q4-0, gemma-3-12b-it-int4-AutoRound, gemma-3-27b-it-int4-AutoRound
HuggingFaceTB/SmolVLM	pile/llava	√	SmolVLM-Instruct-int4-sym
ibm-granite/granite-vision-3.2	pile/llava	-
liuhaotian/Llava-v1.5	pile/llava	X	llava-v1.5-7b-int4-sym
meta-llama/Llama-3.2-Vision	llava	√	Llama-3.2V-11B-cot-int4-sym, Llama-3.2-11B-Vision-Instruct-qvision-int4-sym, Llama-3.2-90B-Vision-Instruct-int4-sym, Llama-3.2-11B-Vision-Instruct-int4-sym
microsoft/Phi3.5-Vision	pile/llava	√	Phi-3.5-vision-instruct-int4-sym, Phi-3.5-vision-instruct-qvision-int4-sym
mistralai/Mistral-Small-3.1	pile/llava	X	Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-gptq-sym, Mistral-Small-3.1-24B-Instruct-2503-int4-AutoRound-awq-sym
moonshotai/Kimi-VL	pile/llava	√
Qwen/Qwen2-VL	pile/llava	-	Qwen2-VL-7B-Instruct-int4-sym, Qwen2-VL-72B-Instruct-int4-sym, Qwen2-VL-72B-Instruct-int2-sym
rhymes-ai/Aria	pile/llava	√
THUDM/CogVLM2	pile/llava	√	cogvlm2-llama3-chat-19B-int4-sym, cogvlm2-llama3-chat-19B-qvision-int4-sym
THUDM/glm-4v	pile	X	glm-4v-9b-int4-sym

√ means support, - means support to export but cannot infer, X means not support.

Integration

AutoRound has been integrated into multiple repositories.

huggingface/transformers

Intel Neural Compressor

ModelCloud/GPTQModel

pytorch/ao

Reference

If you find AutoRound useful for your research, please cite our paper:

@article{cheng2023optimize, title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }