modelopt.onnx.quantization.quantize — Model Optimizer 0.31.0 (original) (raw)
quantize(onnx_path, quantize_mode='int8', calibration_data=None, calibration_method=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cpu', 'cuda:0', 'trt'], override_shapes=None, op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, log_level='INFO', log_file=None, trt_plugins=None, trt_plugins_precision=None, high_precision_dtype=None, mha_accumulation_dtype='fp16', disable_mha_qdq=False, dq_only=True, block_size=None, use_zero_point=False, passes=['concat_elimination'], simplify=False, **kwargs)
Quantizes the provided ONNX model.
Parameters:
- onnx_path (str) – Path to the input ONNX model.
- quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.
- calibration_data (ndarray | dict [ str , ndarray ]) – Calibration data, either a numpy array or list/dict of numpy arrays.
- calibration_method (str | None) – Calibration method choices. Options are int8: ‘entropy’ (default) and ‘max’, fp8: ‘max’ (default) and int4: ‘awq_clip’ (default), ‘awq_lite’, ‘awq_full’ and ‘rtn_dq’.
- calibration_cache_path (str | None) – Path to pre-calculated activation tensor ranges, also known as calibration cache.
- calibration_shapes (str | None) – Input shapes used for calibration process.
- calibration_eps (list [ str ]) –
Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘trt’, ‘cuda:x’, ‘dml:x’, ‘cpu’], where ‘x’ is the device id.
Note
If a custom op is detected in the model, ‘trt’ will automatically be added to the EP list. - override_shapes (str | None) – Override model input shapes with static shapes.
- op_types_to_quantize (list [ str ] | None) – List of op types to quantize. If None (default), all supported operators are quantized. This flag does not support regular expression.
- op_types_to_exclude (list [ str ] | None) – List of op types to exclude from quantization. This flag does not support regular expression.
- nodes_to_quantize (list [ str ] | None) – List of node names to quantize. If None (default), all supported nodes are quantized. This flag supports regular expression.
- nodes_to_exclude (list [ str ] | None) – List of node names to exclude from quantization. This flag supports regular expression.
- use_external_data_format (bool) – If True, separate data path will be used to store the weights of the quantized model.
- keep_intermediate_files (bool) – If True, keep all intermediate files generated during the ONNX model’s conversion/calibration.
- output_path (str | None) – Output filename to save the quantized ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.
- log_level (str) – Log level. One of ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’.
- log_file (str | None) – Path to the log file for the quantization process.
- trt_plugins (str | None) – Specifies custom TensorRT plugin library paths in .so format (compiled shared library). For multiple paths, separate them with a semicolon, i.e.: “lib_1.so;lib_2.so”. If this is not None or the model has custom ops, TensorrtExecutionProvider becomes the first choice as calibration execution provider, meaning that the TensorRT is a requirement.
- trt_plugins_precision (list [ str ] | None) – A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.
- high_precision_dtype (str | None) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.
- mha_accumulation_dtype (str) – MHA accumulation dtype. One of [‘fp32’, ‘fp16’]. ‘fp16’ by default. If quantize_mode == ‘fp8’ and mha_accumulation_dtype == ‘fp32’, Cast nodes will be added to MHA’s bmm1 and bmm2’s input and output tensors.
- disable_mha_qdq (bool) – Don’t add Q/DQ layers to MatMuls in MHA pattern.
- dq_only (bool) – If True (default), only add DQ nodes to the model. If False, add Q/DQ nodes to the model.
- block_size (int | None) – Block size parameter for int4 quantization.
- use_zero_point (bool) – Use zero-point based quantization, if True.
- passes (list [ str ]) – List of optimization passes name, if set, appropriate pre/post-processing passes will be invoked.
- simplify (bool) – Simplify the given model before quantization.
- kwargs (Any) – Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.
Returns:
None, writes the quantized onnx model in the supplied output_path or writes to the same directory with filename like “<model_name>.quant.onnx”.
Return type:
None