Support Matrix — Model Optimizer 0.31.0 (original) (raw)

Feature Support Matrix

Linux

Quantization Format	Details	Supported Model Formats	Deployment
FP4	Per-Block FP4 Weight & Activations GPUs: Blackwell and Later	PyTorch	TensorRT, TensorRT-LLM
FP8	Per-Tensor FP8 Weight & Activations GPUs: Ada and Later	PyTorch, ONNX*	TensorRT*, TensorRT-LLM
INT8	Per-channel INT8 Weights, Per-Tensor INT8 Activations Uses Smooth Quant Algorithm GPUs: Ampere and Later	PyTorch, ONNX*	TensorRT*, TensorRT-LLM
W4A16 (INT4 Weights Only)	Block-wise INT4 Weights, F16 Activations Uses AWQ Algorithm GPUs: Ampere and Later	PyTorch, ONNX	TensorRT, TensorRT-LLM
W4A8 (INT4 Weights, FP8 Activations)	Block-wise INT8 Weights, Per-Tensor FP8 Activations Uses AWQ Algorithm GPUs: Ada and Later	PyTorch, ONNX	TensorRT-LLM

Windows

Quantization Format	Details	Supported Model Formats	Deployment
W4A16 (INT4 Weights Only)	Block-wise INT4 Weights, F16 Activations Uses AWQ Algorithm GPUs: Ampere and Later	PyTorch*, ONNX	ORT-DirectML, TensorRT, TensorRT-LLM
W4A8 (INT4 Weights, FP8 Activations)	Block-wise INT8 Weights, Per-Tensor FP8 Activations Uses AWQ Algorithm GPUs: Ada and Later	PyTorch*	TensorRT-LLM*
FP8	Per-Tensor FP8 Weight & Activations (PyTorch) Per-Tensor Activation and Per-Channel Weights quantization (ONNX) Uses Max calibration GPUs: Ada and Later	PyTorch*, ONNX	TensorRT, TensorRT-LLM, ORT-CUDA
INT8	Per-Channel INT8 Weights, Per-Tensor INT8 Activations Uses Smooth Quant (PyTorch)*, Max calibration (ONNX) GPUs: Ada and Later	PyTorch*, ONNX	TensorRT, TensorRT-LLM, ORT-CUDA

Note

Features marked with an asterisk (*) are considered experimental.

Model Support Matrix

Linux

Please checkout the model support matrix here.

Windows

Model ONNX INT4 AWQ (W4A16) ONNX INT8 Max (W8A8) ONNX FP8 Max (W8A8)

Llama3.1-8B-Instruct Yes No No

Phi3.5-mini-Instruct Yes No No

Mistral-7B-Instruct-v0.3 Yes No No

Llama3.2-3B-Instruct Yes No No

Gemma-2b-it Yes No No

Gemma-2-2b Yes No No

Gemma-2-9b Yes No No

Nemotron Mini 4B Instruct Yes No No

Qwen2.5-7B-Instruct Yes No No

DeepSeek-R1-Distill-Llama-8B Yes No No

DeepSeek-R1-Distil-Qwen-1.5B Yes No No

DeepSeek-R1-Distil-Qwen-7B Yes No No

DeepSeek-R1-Distill-Qwen-14B Yes No No

Mistral-NeMo-Minitron-2B-128k-Instruct Yes No No

Mistral-NeMo-Minitron-4B-128k-Instruct Yes No No

Mistral-NeMo-Minitron-8B-128k-Instruct Yes No No

whisper-large No Yes Yes

sam2-hiera-large No Yes Yes

Model	ONNX INT4 AWQ (W4A16)	ONNX INT8 Max (W8A8)	ONNX FP8 Max (W8A8)
Llama3.1-8B-Instruct	Yes	No	No
Phi3.5-mini-Instruct	Yes	No	No
Mistral-7B-Instruct-v0.3	Yes	No	No
Llama3.2-3B-Instruct	Yes	No	No
Gemma-2b-it	Yes	No	No
Gemma-2-2b	Yes	No	No
Gemma-2-9b	Yes	No	No
Nemotron Mini 4B Instruct	Yes	No	No
Qwen2.5-7B-Instruct	Yes	No	No
DeepSeek-R1-Distill-Llama-8B	Yes	No	No
DeepSeek-R1-Distil-Qwen-1.5B	Yes	No	No
DeepSeek-R1-Distil-Qwen-7B	Yes	No	No
DeepSeek-R1-Distill-Qwen-14B	Yes	No	No
Mistral-NeMo-Minitron-2B-128k-Instruct	Yes	No	No
Mistral-NeMo-Minitron-4B-128k-Instruct	Yes	No	No
Mistral-NeMo-Minitron-8B-128k-Instruct	Yes	No	No
whisper-large	No	Yes	Yes
sam2-hiera-large	No	Yes	Yes

Note

ONNX INT8 Max means INT8 (W8A8) quantization of ONNX model using Max calibration. Similar holds true for the term ONNX FP8 Max.
The LLMs in above table are GenAI built LLMs unless specified otherwise.
Check examples for specific instructions and scripts.