config — Model Optimizer 0.31.0 (original) (raw)

This document lists the quantization formats supported by Model Optimizer and example quantization configs.

Quantization Formats

The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.

Please see choosing the right quantization formats to learn more about the formats and their use-cases.

Note

The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG for CNN models.

Quantization Format	Model Optimizer config
INT8	INT8_SMOOTHQUANT_CFG
FP8	FP8_DEFAULT_CFG
INT4 Weights only AWQ (W4A16)	INT4_AWQ_CFG
INT4-FP8 AWQ (W4A8)	W4A8_AWQ_BETA_CFG

Quantization Configs

Quantization config is dictionary specifying the values for keys "quant_cfg" and"algorithm". The "quant_cfg" key specifies the quantization configurations. The"algorithm" key specifies the algorithm argument tocalibrate. Please see QuantizeConfigfor the quantization config definition.

‘Quantization configurations’ is a dictionary mapping wildcards or filter functions to its ‘quantizer attributes’. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending withweight_quantizer and input_quantizer and they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are generally instances ofTensorQuantizer. The quantizer attributes are defined by QuantizerAttributeConfig. See QuantizerAttributeConfigfor details on the quantizer attributes and their values.

The key “default” from the quantization configuration dictionary is applied if no other wildcard or filter functions match the quantizer module name.

The quantizer attributes are applied in the order they are specified. For the missing attributes, the default attributes as defined by QuantizerAttributeConfig are used.

Quantizer attributes can also be a list of dictionaries. In this case, the matched quantizer module is replaced with aSequentialQuantizermodule which is used to quantize a tensor in multiple formats sequentially. Each quantizer attribute dictionary in the list specifies the quantization formats for each quantization step of the sequential quantizer. For example, SequentialQuantizer is used in ‘INT4 Weights, FP8 Activations’ quantization in which the weights are quantized in INT4 followed by FP8.

In addition, the dictionary entries could also be pytorch module class names mapping the class specific quantization configurations. The pytorch modules should have a quantized equivalent.

To get the string representation of a module class, do:

from modelopt.torch.quantization import QuantModuleRegistry

Get the class name for nn.Conv2d

class_name = QuantModuleRegistry.get_key(nn.Conv2d)

Here is an example of a quantization config:

MY_QUANT_CFG = { "quant_cfg": { # Quantizer wildcard strings mapping to quantizer attributes "*weight_quantizer": {"num_bits": 8, "axis": 0}, "*input_quantizer": {"num_bits": 8, "axis": None},

    # Module class names mapping to quantizer configurations
    "nn.LeakyReLU": {"*input_quantizer": {"enable": False}},

}

}

Example Quantization Configurations

These example configs can be accessed as attributes of modelopt.torch.quantization and can be given as input to mtq.quantize(). For example:

import modelopt.torch.quantization as mtq model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)

You can also create your own config by following these examples. For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing the layer named lm_head, you can create a custom config and quantize your model as following:

Create custom config

CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG) CUSTOM_INT4_AWQ_CFG["quant_cfg"]["lm_head"] = {"enable": False}

quantize model

model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)

Functions

need_calibration	Check if calibration is needed for the given config.

ModeloptConfig AWQClipCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_clip (AWQ clip) algorithm.

AWQ clip searches clipped amax for per-group quantization, This search requires much more compute compared to AWQ lite. To avoid any OOM, the linear layer weights are batched along the out_featuresdimension of batch size max_co_batch_size. AWQ clip calibration also takes longer than AWQ lite.

Show default config as JSON

Default config (JSON):

{ "method": "awq_clip", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false }

field debug_: bool | None_

Show details

If True, module’s search metadata will be kept as a module attribute named awq_clip.

field max_co_batch_size_: int | None_

Show details

Reduce this number if CUDA Out of Memory error occurs.

field max_tokens_per_batch_: int | None_

Show details

The total tokens used for clip search would be max_tokens_per_batch * number of batches. Original AWQ uses a total of 512 tokens to search for clip values.

field method_: Literal['awq_clip']_

field min_clip_ratio_: float | None_

Show details

It should be in (0, 1.0). Clip will search for the optimal clipping value in the range[original block amax * min_clip_ratio, original block amax].

Constraints:

gt = 0.0
lt = 1.0

field shrink_step_: float | None_

Show details

It should be in range (0, 1.0]. The clip ratio will be searched from min_clip_ratio to 1 with the step size specified.

Constraints:

gt = 0.0
le = 1.0

ModeloptConfig AWQFullCalibConfig

Bases: AWQLiteCalibConfig, AWQClipCalibConfig

The config for awq or awq_full algorithm (AWQ full).

AWQ full performs awq_lite followed by awq_clip.

Show default config as JSON

Default config (JSON):

{ "method": "awq_full", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false, "alpha_step": 0.1 }

field debug_: bool | None_

Show details

If True, module’s search metadata will be kept as module attributes named awq_lite and awq_clip.

field method_: Literal['awq_full']_

ModeloptConfig AWQLiteCalibConfig

Bases: QuantizeAlgorithmConfig

The config for awq_lite (AWQ lite) algorithm.

AWQ lite applies a channel-wise scaling factor which minimizes the output difference after quantization. See AWQ paper for more details.

Show default config as JSON

Default config (JSON):

{ "method": "awq_lite", "alpha_step": 0.1, "debug": false }

field alpha_step_: float | None_

Show details

The alpha will be searched from 0 to 1 with the step size specified.

Constraints:

gt = 0.0
le = 1.0

field debug_: bool | None_

Show details

If True, module’s search metadata will be kept as a module attribute named awq_lite.

field method_: Literal['awq_lite']_

ModeloptConfig CompressConfig

Bases: ModeloptBaseConfig

Default configuration for compress mode.

Show default config as JSON

Default config (JSON):

{ "compress": { "*": true } }

field compress_: dict[str, bool]_

ModeloptConfig MaxCalibConfig

Bases: QuantizeAlgorithmConfig

The config for max calibration algorithm.

Max calibration estimates max values of activations or weights and use this max values to set the quantization scaling factor. See Integer Quantization for the concepts.

Show default config as JSON

Default config (JSON):

{ "method": "max", "distributed_sync": true }

field distributed_sync_: bool | None_

Show details

If True, the amax will be synced across the distributed processes.

field method_: Literal['max']_

ModeloptConfig QuantizeAlgorithmConfig

Bases: ModeloptBaseConfig

Calibration algorithm config base.

Show default config as JSON

Default config (JSON):

field method_: Literal[None]_

ModeloptConfig QuantizeConfig

Bases: ModeloptBaseConfig

Default configuration for quantize mode.

Show default config as JSON

Default config (JSON):

{ "quant_cfg": { "default": { "num_bits": 8, "axis": null } }, "algorithm": "max" }

ModeloptConfig QuantizerAttributeConfig

Bases: ModeloptBaseConfig

Quantizer attribute type.

Show default config as JSON

Default config (JSON):

{ "enable": true, "num_bits": 8, "axis": null, "fake_quant": true, "unsigned": false, "narrow_range": false, "learn_amax": false, "type": "static", "block_sizes": null, "bias": null, "trt_high_precision_dtype": "Float", "calibrator": "max", "rotate": false }

field axis_: int | tuple[int, ...] | None_

Show details

This field is for static per-channel quantization. It cannot coexist with `block_sizes`. You should set axis if you want a fixed shape of scale factor.

For example, if axis is set to None, the scale factor will be a scalar (per-tensor quantization) if the axis is set to 0, the scale factor will be a vector of shape (dim0, ) (per-channel quantization). if the axis is set to (-2, -1), the scale factor will be a vector of shape (dim-2, dim-1)

axis value must be in the range [-rank(input_tensor), rank(input_tensor))

Show details

Configuration for bias handling in affine quantization. The keys are: - “enable”: Boolean to enable/disable bias handling, default is False - “type”: Specify the type of bias [“static”, “dynamic”], default is “static” - “method”: Specify the method of bias calibration [“mean”, “max_min”], default is “mean” - “axis”: Tuple of integers specifying axes for bias computation, default is None

Examples: bias = {“enable”: True} bias = {“enable”: True, “type”: “static”, “axis”: -1} bias = {“enable”: True, “type”: “dynamic”, “axis”: (-1, -3)}

Show details

This field is for static or dynamic block quantization. It cannot coexist with ``axis``. You should set block_sizes if you want fixed number of elements to share every scale factor.

The keys are the axes for block quantization and the values are block sizes for quantization along the respective axes. Keys must be in the range [-tensor.dim(), tensor.dim()). Values, which are the block sizes for quantization must be positive integers or None. A positive block size specifies the block size for quantization along that axis. None means that the block size will be the maximum possible size in that dimension - this is useful for specifying certain quantization formats such per-token dynamic quantization which has the amaxshared along the last dimension.

In addition, there can be special string keys "type", "scale_bits" and "scale_block_sizes".

Key "type" should map to "dynamic" or "static" where "dynamic"indicates dynamic block quantization and “static” indicates static calibrated block quantization. By default, the type is "static".

Key "scale_bits" specify the quantization bits for the per-block quantization scale factor (i.e a double quantization scheme).

Key "scale_block_sizes" specify the block size for double quantization. By default per-block quantization scale is not quantized.

For example, block_sizes = {-1: 32} will quantize the last axis of the input tensor in blocks of size 32 with static calibration, with a total of numel(tensor) / 32 scale factors.block_sizes = {-1: 32, "type": "dynamic"} will perform dynamic block quantization.block_sizes = {-1: None, "type": "dynamic"} can be used to specify per-token dynamic quantization.

field calibrator_: str | Callable | tuple_

Show details

The calibrator can be a string from ["max", "histogram"] or a constructor to create a calibrator which subclasses _Calibrator. See standardize_constructor_argsfor more information on how to specify the constructor.

field enable_: bool_

Show details

If True, enables the quantizer. If False, by-pass the quantizer and returns the input tensor.

field fake_quant_: bool_

Show details

If True, enable fake quantization.

field learn_amax_: bool_

Show details

learn_amax is deprecated and reserved for backward compatibility.

field narrow_range_: bool_

Show details

If True, enable narrow range quantization. Used only for integer quantization.

field num_bits_: int | tuple[int, int]_

Show details

num_bits can be:

A positive integer argument for integer quantization. num_bits specify
the number of bits used for integer quantization.
Constant integer tuple (E,M) for floating point quantization emulating
Nvidia’s FPx quantization. E is the number of exponent bits and M is the number of mantissa bits. Supported FPx quantization formats: FP8 (E4M3, E5M2), FP6(E3M2, E2M3), FP4(E2M1).

field rotate_: bool_

Show details

“If true, the input of the quantizer will be rotated with a hadamard matrix given by scipy.linalg.hadamard, i.e.input = input @ scipy.linalg.hadamard(input.shape[-1]) / sqrt(input.shape[-1]).

This can be used for ratation based PTQ methods, e.g. QuaRot or SpinQuant. See https://arxiv.org/abs/2404.00456 for example.

field trt_high_precision_dtype_: str_

Show details

The value is a string from ["Float", "Half", "BFloat16"]. The QDQs will be assigned the appropriate data type, and this variable will only be used when the user is exporting the quantized ONNX model.

Constraints:

pattern = ^Float$|^Half$|^BFloat16$

field type_: str_

Show details

The value is a string from ["static", "dynamic"]. If "dynamic", dynamic quantization will be enabled which does not collect any statistics during calibration.

Constraints:

pattern = ^static$|^dynamic$

field unsigned_: bool_

Show details

If True, enable unsigned quantization. Used only for integer quantization.

ModeloptConfig SVDQuantConfig

Bases: QuantizeAlgorithmConfig

The config for SVDQuant.

Refer to the SVDQuant paper for more details.

Show default config as JSON

Default config (JSON):

{ "method": "svdquant", "lowrank": 32 }

field lowrank_: int | None_

Show details

Specifies the rank of the LoRA used in the SVDQuant method, which captures outliers from the original weights.

field method_: Literal['svdquant']_

ModeloptConfig SmoothQuantCalibConfig

Bases: QuantizeAlgorithmConfig

The config for smoothquant algorithm (SmoothQuant).

SmoothQuant applies a smoothing factor which balances the scale of outliers in weights and activations. See SmoothQuant paper for more details.

Show default config as JSON

Default config (JSON):

{ "method": "smoothquant", "alpha": 1.0 }

field alpha_: float | None_

Show details

This hyper-parameter controls the migration strength.The migration strength is within [0, 1], a larger value migrates more quantization difficulty to weights.

Constraints:

ge = 0.0
le = 1.0

field method_: Literal['smoothquant']_

need_calibration(config)

Check if calibration is needed for the given config.