Quantization — coremltools API Reference 8.0b1 documentation (original) (raw)

coremltools.optimize.coreml.linear_quantize_weights(*args, **kwargs)[source]

Utility function to convert a float precision MLModel of type mlprogram, which uses float-precision weights, into a compressed MLModel that uses n-bit weights (currently only support n=4 and n=8). This is achieved by converting the float weight values that are stored in the const op into the constexpr_affine_dequantize or constexpr_blockwise_shift_scaleop (based on model’s minimum deployment target).

This function uses linear quantization on the float weights, providing up to 4x (for 4-bit) savings in storage compared to float 16, or up to 4x savings compared to float 32. All computation at runtime uses float precision; the precision of the intermediate tensors and the compute precision of the ops are not altered.

For each weight, this utility function converts the weight into the int4/8 or uint4/8 type using either linear interpolation ("linear" mode) or linear symmetric interpolation("linear_symmetric" mode, the default).

Linear interpolation

The following description uses 8-bit quantization to illustrate, and 4-bit is similar to it.

Linear interpolation ("linear" mode) maps the min/max of the float range to the 8-bit integer range [low, high] using a zero point (also called quantization bias, or offset) and a scale factor. For the int8 quantization, [low, high] = [-128, 127], while uint8 quantization uses range [0, 255].

"linear" mode uses the quantization formula:

\[w_r = s * (w_q - z)\]

Where:

Quantized weights are computed as follows:

\[w_q = cast\_to\_8\_bit\_integer(w_r / s + cast\_to\_float(z))\]

Note: \(cast\_to\_8\_bit\_integer\) is the process of clipping the input to range [low, high] followed by rounding and casting to 8-bit integer.

In "linear" mode, s, z are computed by mapping the original float range[A, B] into the 8-bit integer range [-128, 127] or [0, 255]. That is, you are solving the following linear equations:

The equations result in the following:

When the rank of weight w is 1, then s and z are both scalars. When the rank of the weight is greater than 1, then s and z are both vectors. In that case, scales are computed per channel, in which channel is the output dimension, which corresponds to the first dimension for ops such as conv and linear, and the second dimension for the conv_transpose op.

For "linear" mode, \(A = min(w_r)\), \(B = max(w_r)\).

Linear symmetric interpolation

With linear symmetric interpolation ("linear_symmetric" mode, the default), rather than mapping the exact min/max of the float range to the quantized range, the function chooses the maximum absolute value between the min/max, which results in a floating-point range that is symmetric with respect to zero. This also makes the resulting zero point 0 for int8 weight and 127 for uint8 weight.

For "linear_symmetric" mode:

Parameters:

mlmodel: MLModel

Model to be quantized. This MLModel should be of type mlprogram.

config: OptimizationConfig

An OptimizationConfig object that specifies the parameters for weight quantization.

joint_compression: bool

Specification of whether or not to further compress the already-compressed input MLModel to a jointly compressed MLModel. See the blockwise_palettize_weights graph pass for information about which compression schemas could be further jointly palettized.

Take “palettize + quantize” as an example of joint compression, where the input MLModel is already palettized, and the palettization’s lookup table will be further quantized. In such an example, the weight values are represented by constexpr_blockwise_shift_scale + constexpr_lut_to_dense ops: lut(int8) -> constexpr_blockwise_shift_scale -> lut(fp16) -> constexpr_lut_to_dense -> dense(fp16)

Returns:

model: MLModel

The quantized MLModel instance.

Examples

import coremltools as ct import coremltools.optimize as cto

model = ct.coreml.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpLinearQuantizerConfig(mode="linear_symmetric") ) compressed_model = cto.coreml.linear_quantize_weights(model, config)

coremltools.optimize.coreml.experimental.linear_quantize_activations(mlmodel: MLModel, config: OptimizationConfig, sample_data: List[Dict[str | None, ndarray]])[source]

Utility function to convert a float precision MLModel of type mlprogram, which uses float-precision activations, into a compressed MLModel that uses n-bit activations. Currently, only n=8 is suppported.

This is achieved by feeding real sample data into the input MLModel, calibrating the resulting float activation values, converting the calibrated values into quantize and dequantize op pairs, and inserting those op pairs into the new MLModel instance where activations get quantized.

Use this function with linear_quantize_weights for 8-bit activation and 8-bit weight linear quantization. It’s also compatible for use with other weight compression methods.

Parameters:

mlmodel: MLModel

Model to be quantized. This MLModel should be of type mlprogram.

config: OptimizationConfig

An OptimizationConfig object that specifies the parameters for activation quantization.

sample_data: List

Data used to characterize statistics of the activation values of the original float precision model. Expects a list of sample input dictionaries, which should have the same format as the data used in .predictmethod for the mlmodel. More specifically, the input name need to be specified in the data, unless it’s a single input model where the name will be auto inferred.

Returns:

model: MLModel

The activation quantized MLModel instance.

Examples

import coremltools as ct import coremltools.optimize as cto

model = ct.coreml.models.MLModel("my_model.mlpackage") activation_config = cto.coreml.OptimizationConfig( global_config=cto.coreml.experimental.OpActivationLinearQuantizerConfig( mode="linear_symmetric" ) ) compressed_model_a8 = cto.coreml.experimental.linear_quantize_activations( model, activation_config, sample_data )

(Optional) It's recommended to use with linear_quantize_weights.

weight_config = cto.coreml.OptimizationConfig( global_config=cto.OpLinearQuantizerConfig(mode="linear_symmetric") ) compressed_model_w8a8 = cto.linear_quantize_weights(compressed_model_a8, weight_config)

class coremltools.optimize.coreml.OpLinearQuantizerConfig(mode: str = 'linear_symmetric', dtype: str | type = <class 'coremltools.converters.mil.mil.types.type_int.make_int..int'>, granularity: str | ~coremltools.optimize.coreml._config.CompressionGranularity = CompressionGranularity.PER_CHANNEL, block_size: int | ~typing.List[int] | ~typing.Tuple[int, ...] = 32, weight_threshold: int | None = 2048)[source]

Parameters:

mode: str

Mode for linear quantization:

dtype: str or np.generic or mil.type

Determines the quantized data type (int8/uint8/int4/uint4).

granularity: str

Granularity for quantization.

block_size: int or List/Tuple of int

The tuple input of block_size provides users fully control about the block. Here are some examples about how different granularities could be achieved:

Given the weight of a 2D Conv which has shape [C_out, C_in, KH, KW]:|------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block ||------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | [C_out, C_in, KH, KW] | | Per Input Channel | 0 | 1 | [C_out, 1, KH, KW] | | Per Output Channel | 1 | 0 | [1, C_in, KH, KW] | | Per Block | 1 | 32 | [1, 32, KH, KW] ||------------------------|————————–|---------------------------|—————————-|

Given the weight of a linear layer which has shape [C_out, C_in]:|------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block ||------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | [C_out, C_in] | | Per Input Channel | 0 | 1 | [C_out, 1] | | Per Output Channel | 1 | 0 | [1, C_in] | | Per Block | 1 | 32 | [1, 32] ||------------------------|————————–|---------------------------|—————————-|

Given the weight of matmul’s y (transpose_y=False) which has shape […, C_in, C_out]:|------------------------|————————–|---------------------------|—————————-| | Granularity | output_channel_block_size| input_channel_block_size | Weight Shape of Each Block ||------------------------|————————–|---------------------------|—————————-| | Per Tensor | 0 | 0 | […, C_in, C_out] | | Per Input Channel | 0 | 1 | […, 1, C_out] | | Per Output Channel | 1 | 0 | […, C_in, 1] | | Per Block | 1 | 32 | […, 32, 1] ||------------------------|————————–|---------------------------|—————————-|

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than weight_threshold. Default to 2048.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200elements, it will not be pruned.