Pruning — coremltools API Reference 8.0b1 documentation (original) (raw)

coremltools.optimize.coreml.prune_weights(*args, **kwargs)[source]

Utility function to convert a float precision MLModel of type mlprogram to a compressed MLModel using sparse representation. The const ops storing weight values are replaced by constexpr_sparse_to_dense ops.

This function is useful if the model is trained with pruning techniques so that a lot of weights have zero values. If a large percentage of weight values are zero, a sparse representation is more efficient than a dense one (the default).

The sparsified weights are stored in a bit mask. If the weight values are{0, 0, 0, 0, 0, 0, 0, 56.3}, its sparse representation contains a bit mask with ones on locations where the value is non-zero: 00000001b. This is accompanied by non-zero data, which is a size-1 vector of value {56.3}.

For example, given the following:

weight = [0.3, 0, 0, 0.5, 0, 0]

non_zero_data, bit_mask = sparsify(weight)

The indices of the non-zero elements are:

non_zero_data = [0.3, 0.5]

bit_mask = "100100"

Parameters:

mlmodel: MLModel

Model to be sparsified. This MLModel should be of type mlprogram.

config: OptimizationConfig

An OptimizationConfig object that specifies the parameters for weight pruning.

joint_compression: bool

Specification of whether or not to further prune the already-compressed input MLModel to a jointly compressed MLModel. See the prune_weights graph pass for information about which compression schemas could be further pruned.

Take “quantize + prune” as an example of joint compression, where the input MLModel is already quantized, and it will be further pruned. In such an example, the weight values are represented by constexpr_sparse_blockwise_shift_scale + constexpr_sparse_to_dense ops: quantized(sparse) -> constexpr_sparse_blockwise_shift_scale -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)

Returns:

model: MLModel

The sparse MLModel instance.

Examples

import coremltools as ct import coremltools.optimize as cto

model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpThresholdPrunerConfig(threshold=1e-12) ) compressed_model = cto.coreml.prune_weights(model, config)

class coremltools.optimize.coreml.OpThresholdPrunerConfig(threshold: float = 1e-12, minimum_sparsity_percentile: float = 0.5, weight_threshold: int | None = 2048)[source]

All weights with absolute value smaller than threshold are changed to 0, and the tensor is stored in a sparse format.

For example, given the following:

weight = [0.3, -0.2, -0.01, 0.05]

threshold = 0.03

The sparsified weight would be [0.3, -0.2, 0, 0.05].

Parameters:

threshold: float

All weight values above this threshold are set to 0.

Default value is 1e-12.

minimum_sparsity_percentile: float

The sparsity level must be above this value for the weight representation to be stored in the sparse format rather than the dense format.

For example, if minimum_sparsity_percentile = 0.6 and the sparisty level is 0.54; that is, 54% of the weight values are exactly 0, then the resulting weight tensor will be stored as a dense const op, and not converted to the constsexpr_sparse_to_dense op (which stores the weight values in a sparse format).

Must be a value between 0 and 1.
Default value is 0.5.

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than weight_threshold.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200elements, it will not be pruned.

If not provided, it will be set to 2048, in which weights bigger than 2048 elements are compressed.

Prune the weight with a constant sparsity percentile, which can be specified by either target_sparsity or n_m_ratio.

If target_sparsity is set, where n = floor(size_of_weight_tensor * target_sparsity), the n lowest absolute weight values are changed to 0. For example, given the following:

weight = [0.3, -0.2, -0.01, 0.05]

target_sparsity = 0.75

The sparsified weight would be [0.3, 0, 0, 0].

If block_size is set, then weights are pruned in a block structured manner; that is, chunks of weight values, as big as the block_size, will be set to 0. Block sparsity can only be applied to linear and conv layers. For example:

Given a 4 x 2 weight with the following value, and block_size = 2, dim = 0.
[ [1, 3], [-6, -7], [0, 3], [-9, 2], ]

We first flatten the matrix along axis = 0.
[1, -6, 0, -9, 3, -7, 3, 2]

For block size 2, the L2 norm will be compute of first 2 elements, then the second and 3rd element and so on.
[6.08, 9.00, 7.62, 3.61]

Then the smallest values will be picked to prune. So if target_sparsity = 0.5, then the blocks that will be
pruned will be with ones with L2 norm value of 6.08 and 3.61. And hence, the elements in the first and third
block are pruned. Resulting in the following flatten pruned tensor:
[0, 0, 0, -9, 3, -7, 0, 0]

The final pruned tensor is:
[ [0, 3], [0, -7], [0, 0], [-9, 0], ]

The n_m_ratio triggers n:m pruning along the dim axis. In n:m pruning, out of every m elements, n with lowest magnitude are set to 0. For more information, seeLearning N:M Fine-Grained Structured Sparse Neural Networks From Scratch.

n:m pruning can be applied only to linear and conv layers.

Example

Given a 4 x 4 weight of

[ [3, 4, 7, 6], [1, 8, -3, -8], [-2, -3, -4, 0], [5, 4, -3, -2], ]

For n_m_ratio = (1, 2) with axis = 1 (default), the resulting pruned weight is

[ [0, 4, 7, 0], [0, 8, 0, -8], [0, -3, -4, 0], [5, 0, -3, 0], ]

For axis = 0, we get

[ [3, 0, 7, 0], [0, 8, 0, -8], [0, 0, -4, 0], [5, 4, 0, -2], ]

Parameters:

target_sparsity: float

The percentage of sparsity for compression, which needs to be in the range [0, 1]. When 0, no sparsification occurs. For 1, all weights become 0.

block_size: int

Block size for inducing block sparsity. This is applied on the dim dimension of the parameter. Having the zeros aligned in the parameter helps gain latency/memory performance on-device.

If set, must be greater than 1 to enable block sparsity.
Block sparsity can be applied only to linear and conv layers.
The channel will be padded with 0 if it is not divisible by block_size.

n_m_ratio: tuple[int]

A tuple of two integers which specify the ratio for n:m pruning.

n must be smaller or equal to m.
The channel would be padded with 0 if it is not divisible by m.

dim: int

Dimension where the block sparsity or n:m sparsity is applied.

Must be either 0 or 1.
The default value for block sparsity is 0 (output channel).
The default value for n:m sparsity is 1 (input channel).

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements is greater than weight_threshold.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200elements, it will not be pruned.

If not provided, it will be set to 2048, in which weights bigger than 2048 elements are compressed.