Quantization (Windows) — Model Optimizer 0.27.1 (original) (raw)

Quick Start: Quantization (Windows)

Quantization is a crucial technique for reducing memory usage and speeding up inference in deep learning models.

The ONNX quantization API in ModelOpt-Windows offers advanced Post-Training Quantization (PTQ) options like Activation-Aware Quantization (AWQ).

ONNX Model Quantization (PTQ)

The ONNX quantization API requires a model, calibration data, along with quantization settings like algorithm, calibration-EPs etc. Here’s an example snippet to apply INT4 AWQ quantization:

from modelopt.onnx.quantization.int4 import quantize as quantize_int4

import other packages as needed

calib_inputs = get_calib_inputs(dataset, model_name, cache_dir, calib_size, batch_size,...) quantized_onnx_model = quantize_int4( onnx_path, calibration_method="awq_lite", calibration_data_reader=None if use_random_calib else calib_inputs, calibration_eps=["dml", "cpu"] ) onnx.save_model( quantized_onnx_model, output_path, save_as_external_data=True, location=os.path.basename(output_path) + "_data", size_threshold=0, )

Check modelopt.onnx.quantization.quantize_int4 for details about INT4 quantization API.

Refer Support Matrix for details about supported features and models.

To know more about ONNX PTQ, refer ONNX Quantization - Windows and example scripts.

Deployment

The quantized onnx model can be deployed using frameworks like onnxruntime. Ensure that model’s opset is 19+ for FP8 quantization, and it is 21+ for INT4 quantization. This is needed due to different opset requirements of ONNX’s Q/DQ nodes for INT4, FP8 data-types support. Refer Apply Post Training Quantization (PTQ) for details.

write steps (say, upgrade_opset() method) to upgrade or patch opset of the model, if needed

the opset-upgrade, if needed, can be done on either base ONNX model or on the quantized model

finally, save the quantized model

quantized_onnx_model = upgrade_opset(quantized_onnx_model) onnx.save_model( quantized_onnx_model, output_path, save_as_external_data=True, location=os.path.basename(output_path) + "_data", size_threshold=0, )

For detailed instructions about deployment of quantized models with DirectML backend (ORT-DML), see the DirectML. Also, refer example scripts for any possible model-specific inference guidance or script (if any).

Note

The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace NVIDIA collections.