config — Model Optimizer 0.31.0 (original) (raw)
This document lists the quantization formats supported by Model Optimizer and example quantization configs.
Quantization Formats
The following table lists the quantization formats supported by Model Optimizer and the corresponding quantization config. See Quantization Configs for the specific quantization config definitions.
Please see choosing the right quantization formats to learn more about the formats and their use-cases.
Note
The recommended configs given below are for LLM models. For CNN models, only INT8 quantization is supported. Please use quantization config INT8_DEFAULT_CFG
for CNN models.
Quantization Format | Model Optimizer config |
---|---|
INT8 | INT8_SMOOTHQUANT_CFG |
FP8 | FP8_DEFAULT_CFG |
INT4 Weights only AWQ (W4A16) | INT4_AWQ_CFG |
INT4-FP8 AWQ (W4A8) | W4A8_AWQ_BETA_CFG |
Quantization Configs
Quantization config is dictionary specifying the values for keys "quant_cfg"
and"algorithm"
. The "quant_cfg"
key specifies the quantization configurations. The"algorithm"
key specifies the algorithm
argument tocalibrate
. Please see QuantizeConfigfor the quantization config definition.
‘Quantization configurations’ is a dictionary mapping wildcards or filter functions to its ‘quantizer attributes’. The wildcards or filter functions are matched against the quantizer module names. The quantizer modules have names ending withweight_quantizer
and input_quantizer
and they perform weight quantization and input quantization (or activation quantization) respectively. The quantizer modules are generally instances ofTensorQuantizer. The quantizer attributes are defined by QuantizerAttributeConfig. See QuantizerAttributeConfigfor details on the quantizer attributes and their values.
The key “default” from the quantization configuration dictionary is applied if no other wildcard or filter functions match the quantizer module name.
The quantizer attributes are applied in the order they are specified. For the missing attributes, the default attributes as defined by QuantizerAttributeConfig are used.
Quantizer attributes can also be a list of dictionaries. In this case, the matched quantizer module is replaced with aSequentialQuantizermodule which is used to quantize a tensor in multiple formats sequentially. Each quantizer attribute dictionary in the list specifies the quantization formats for each quantization step of the sequential quantizer. For example, SequentialQuantizer is used in ‘INT4 Weights, FP8 Activations’ quantization in which the weights are quantized in INT4 followed by FP8.
In addition, the dictionary entries could also be pytorch module class names mapping the class specific quantization configurations. The pytorch modules should have a quantized equivalent.
To get the string representation of a module class, do:
from modelopt.torch.quantization import QuantModuleRegistry
Get the class name for nn.Conv2d
class_name = QuantModuleRegistry.get_key(nn.Conv2d)
Here is an example of a quantization config:
MY_QUANT_CFG = { "quant_cfg": { # Quantizer wildcard strings mapping to quantizer attributes "*weight_quantizer": {"num_bits": 8, "axis": 0}, "*input_quantizer": {"num_bits": 8, "axis": None},
# Module class names mapping to quantizer configurations
"nn.LeakyReLU": {"*input_quantizer": {"enable": False}},
}
}
Example Quantization Configurations
These example configs can be accessed as attributes of modelopt.torch.quantization
and can be given as input to mtq.quantize(). For example:
import modelopt.torch.quantization as mtq model = mtq.quantize(model, mtq.INT8_DEFAULT_CFG, forward_loop)
You can also create your own config by following these examples. For instance, if you want to quantize a model with int4 AWQ algorithm, but need to skip quantizing the layer named lm_head
, you can create a custom config and quantize your model as following:
Create custom config
CUSTOM_INT4_AWQ_CFG = copy.deepcopy(mtq.INT4_AWQ_CFG) CUSTOM_INT4_AWQ_CFG["quant_cfg"]["lm_head"] = {"enable": False}
quantize model
model = mtq.quantize(model, CUSTOM_INT4_AWQ_CFG, forward_loop)
Functions
need_calibration | Check if calibration is needed for the given config. |
---|
ModeloptConfig AWQClipCalibConfig
Bases: QuantizeAlgorithmConfig
The config for awq_clip
(AWQ clip) algorithm.
AWQ clip searches clipped amax for per-group quantization, This search requires much more compute compared to AWQ lite. To avoid any OOM, the linear layer weights are batched along the out_features
dimension of batch size max_co_batch_size
. AWQ clip calibration also takes longer than AWQ lite.
Show default config as JSON
Default config (JSON):
{ "method": "awq_clip", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false }
field debug_: bool | None_
Show details
If True, module’s search metadata will be kept as a module attribute named awq_clip
.
field max_co_batch_size_: int | None_
Show details
Reduce this number if CUDA Out of Memory error occurs.
field max_tokens_per_batch_: int | None_
Show details
The total tokens used for clip search would be max_tokens_per_batch * number of batches
. Original AWQ uses a total of 512 tokens to search for clip values.
field method_: Literal['awq_clip']_
field min_clip_ratio_: float | None_
Show details
It should be in (0, 1.0). Clip will search for the optimal clipping value in the range[original block amax * min_clip_ratio, original block amax]
.
Constraints:
- gt = 0.0
- lt = 1.0
field shrink_step_: float | None_
Show details
It should be in range (0, 1.0]. The clip ratio will be searched from min_clip_ratio
to 1 with the step size specified.
Constraints:
- gt = 0.0
- le = 1.0
ModeloptConfig AWQFullCalibConfig
Bases: AWQLiteCalibConfig, AWQClipCalibConfig
The config for awq
or awq_full
algorithm (AWQ full).
AWQ full performs awq_lite
followed by awq_clip
.
Show default config as JSON
Default config (JSON):
{ "method": "awq_full", "max_co_batch_size": 1024, "max_tokens_per_batch": 64, "min_clip_ratio": 0.5, "shrink_step": 0.05, "debug": false, "alpha_step": 0.1 }
field debug_: bool | None_
Show details
If True, module’s search metadata will be kept as module attributes named awq_lite
and awq_clip
.
field method_: Literal['awq_full']_
ModeloptConfig AWQLiteCalibConfig
Bases: QuantizeAlgorithmConfig
The config for awq_lite
(AWQ lite) algorithm.
AWQ lite applies a channel-wise scaling factor which minimizes the output difference after quantization. See AWQ paper for more details.
Show default config as JSON
Default config (JSON):
{ "method": "awq_lite", "alpha_step": 0.1, "debug": false }
field alpha_step_: float | None_
Show details
The alpha will be searched from 0 to 1 with the step size specified.
Constraints:
- gt = 0.0
- le = 1.0
field debug_: bool | None_
Show details
If True, module’s search metadata will be kept as a module attribute named awq_lite.
field method_: Literal['awq_lite']_
ModeloptConfig CompressConfig
Bases: ModeloptBaseConfig
Default configuration for compress
mode.
Show default config as JSON
Default config (JSON):
{ "compress": { "*": true } }
field compress_: dict[str, bool]_
ModeloptConfig MaxCalibConfig
Bases: QuantizeAlgorithmConfig
The config for max calibration algorithm.
Max calibration estimates max values of activations or weights and use this max values to set the quantization scaling factor. See Integer Quantization for the concepts.
Show default config as JSON
Default config (JSON):
{ "method": "max", "distributed_sync": true }
field distributed_sync_: bool | None_
Show details
If True, the amax will be synced across the distributed processes.
field method_: Literal['max']_
ModeloptConfig QuantizeAlgorithmConfig
Bases: ModeloptBaseConfig
Calibration algorithm config base.
Show default config as JSON
Default config (JSON):
field method_: Literal[None]_
ModeloptConfig QuantizeConfig
Bases: ModeloptBaseConfig
Default configuration for quantize
mode.
Show default config as JSON
Default config (JSON):
{ "quant_cfg": { "default": { "num_bits": 8, "axis": null } }, "algorithm": "max" }
field algorithm_: str | dict | QuantizeAlgorithmConfig | None | list[str | dict | QuantizeAlgorithmConfig | None]_
field quant_cfg_: dict[str | Callable, QuantizerAttributeConfig | list[QuantizerAttributeConfig] | dict[str | Callable, QuantizerAttributeConfig | list[QuantizerAttributeConfig]]]_
ModeloptConfig QuantizerAttributeConfig
Bases: ModeloptBaseConfig
Quantizer attribute type.
Show default config as JSON
Default config (JSON):
{ "enable": true, "num_bits": 8, "axis": null, "fake_quant": true, "unsigned": false, "narrow_range": false, "learn_amax": false, "type": "static", "block_sizes": null, "bias": null, "trt_high_precision_dtype": "Float", "calibrator": "max", "rotate": false }
field axis_: int | tuple[int, ...] | None_
Show details
This field is for static per-channel quantization. It cannot coexist with `block_sizes`. You should set axis if you want a fixed shape of scale factor.
For example, if axis is set to None, the scale factor will be a scalar (per-tensor quantization) if the axis is set to 0, the scale factor will be a vector of shape (dim0, ) (per-channel quantization). if the axis is set to (-2, -1), the scale factor will be a vector of shape (dim-2, dim-1)
axis value must be in the range [-rank(input_tensor), rank(input_tensor))
field bias_: dict[int | str, Literal['static', 'dynamic'] | Literal['mean', 'max_min'] | tuple[int, ...] | bool | int | None] | None_
Show details
Configuration for bias handling in affine quantization. The keys are: - “enable”: Boolean to enable/disable bias handling, default is False - “type”: Specify the type of bias [“static”, “dynamic”], default is “static” - “method”: Specify the method of bias calibration [“mean”, “max_min”], default is “mean” - “axis”: Tuple of integers specifying axes for bias computation, default is None
Examples: bias = {“enable”: True} bias = {“enable”: True, “type”: “static”, “axis”: -1} bias = {“enable”: True, “type”: “dynamic”, “axis”: (-1, -3)}
field block_sizes_: dict[int | str, int | tuple[int, int] | str | dict[int, int] | None] | None_
Show details
This field is for static or dynamic block quantization. It cannot coexist with ``axis``. You should set block_sizes if you want fixed number of elements to share every scale factor.
The keys are the axes for block quantization and the values are block sizes for quantization along the respective axes. Keys must be in the range [-tensor.dim(), tensor.dim())
. Values, which are the block sizes for quantization must be positive integers or None
. A positive block size specifies the block size for quantization along that axis. None
means that the block size will be the maximum possible size in that dimension - this is useful for specifying certain quantization formats such per-token dynamic quantization which has the amaxshared along the last dimension.
In addition, there can be special string keys "type"
, "scale_bits"
and "scale_block_sizes"
.
Key "type"
should map to "dynamic"
or "static"
where "dynamic"
indicates dynamic block quantization and “static” indicates static calibrated block quantization. By default, the type is "static"
.
Key "scale_bits"
specify the quantization bits for the per-block quantization scale factor (i.e a double quantization scheme).
Key "scale_block_sizes"
specify the block size for double quantization. By default per-block quantization scale is not quantized.
For example, block_sizes = {-1: 32}
will quantize the last axis of the input tensor in blocks of size 32 with static calibration, with a total of numel(tensor) / 32
scale factors.block_sizes = {-1: 32, "type": "dynamic"}
will perform dynamic block quantization.block_sizes = {-1: None, "type": "dynamic"}
can be used to specify per-token dynamic quantization.
field calibrator_: str | Callable | tuple_
Show details
The calibrator can be a string from ["max", "histogram"]
or a constructor to create a calibrator which subclasses _Calibrator
. See standardize_constructor_argsfor more information on how to specify the constructor.
field enable_: bool_
Show details
If True, enables the quantizer. If False, by-pass the quantizer and returns the input tensor.
field fake_quant_: bool_
Show details
If True, enable fake quantization.
field learn_amax_: bool_
Show details
learn_amax
is deprecated and reserved for backward compatibility.
field narrow_range_: bool_
Show details
If True, enable narrow range quantization. Used only for integer quantization.
field num_bits_: int | tuple[int, int]_
Show details
num_bits can be:
- A positive integer argument for integer quantization. num_bits specify
the number of bits used for integer quantization. - Constant integer tuple (E,M) for floating point quantization emulating
Nvidia’s FPx quantization. E is the number of exponent bits and M is the number of mantissa bits. Supported FPx quantization formats: FP8 (E4M3, E5M2), FP6(E3M2, E2M3), FP4(E2M1).
field rotate_: bool_
Show details
“If true, the input of the quantizer will be rotated with a hadamard matrix given by scipy.linalg.hadamard, i.e.input = input @ scipy.linalg.hadamard(input.shape[-1]) / sqrt(input.shape[-1])
.
This can be used for ratation based PTQ methods, e.g. QuaRot or SpinQuant. See https://arxiv.org/abs/2404.00456 for example.
field trt_high_precision_dtype_: str_
Show details
The value is a string from ["Float", "Half", "BFloat16"]
. The QDQs will be assigned the appropriate data type, and this variable will only be used when the user is exporting the quantized ONNX model.
Constraints:
- pattern = ^Float$|^Half$|^BFloat16$
field type_: str_
Show details
The value is a string from ["static", "dynamic"]
. If "dynamic"
, dynamic quantization will be enabled which does not collect any statistics during calibration.
Constraints:
- pattern = ^static$|^dynamic$
field unsigned_: bool_
Show details
If True, enable unsigned quantization. Used only for integer quantization.
ModeloptConfig SVDQuantConfig
Bases: QuantizeAlgorithmConfig
The config for SVDQuant.
Refer to the SVDQuant paper for more details.
Show default config as JSON
Default config (JSON):
{ "method": "svdquant", "lowrank": 32 }
field lowrank_: int | None_
Show details
Specifies the rank of the LoRA used in the SVDQuant method, which captures outliers from the original weights.
field method_: Literal['svdquant']_
ModeloptConfig SmoothQuantCalibConfig
Bases: QuantizeAlgorithmConfig
The config for smoothquant
algorithm (SmoothQuant).
SmoothQuant applies a smoothing factor which balances the scale of outliers in weights and activations. See SmoothQuant paper for more details.
Show default config as JSON
Default config (JSON):
{ "method": "smoothquant", "alpha": 1.0 }
field alpha_: float | None_
Show details
This hyper-parameter controls the migration strength.The migration strength is within [0, 1], a larger value migrates more quantization difficulty to weights.
Constraints:
- ge = 0.0
- le = 1.0
field method_: Literal['smoothquant']_
need_calibration(config)
Check if calibration is needed for the given config.