Pruning — coremltools API Reference 8.0b1 documentation (original) (raw)
coremltools.optimize.coreml.prune_weights(*args, **kwargs)[source]
Utility function to convert a float precision MLModel of type mlprogram
to a compressed MLModel using sparse representation. The const
ops storing weight values are replaced by constexpr_sparse_to_dense
ops.
This function is useful if the model is trained with pruning techniques so that a lot of weights have zero values. If a large percentage of weight values are zero, a sparse representation is more efficient than a dense one (the default).
The sparsified weights are stored in a bit mask. If the weight values are{0, 0, 0, 0, 0, 0, 0, 56.3}
, its sparse representation contains a bit mask with ones on locations where the value is non-zero: 00000001b
. This is accompanied by non-zero data, which is a size-1 vector of value {56.3}
.
For example, given the following:
weight = [0.3, 0, 0, 0.5, 0, 0]
non_zero_data, bit_mask = sparsify(weight)
The indices of the non-zero elements are:
non_zero_data = [0.3, 0.5]
bit_mask = "100100"
Parameters:
mlmodel: MLModel
Model to be sparsified. This MLModel should be of type mlprogram
.
config: OptimizationConfig
An OptimizationConfig object that specifies the parameters for weight pruning.
joint_compression: bool
Specification of whether or not to further prune the already-compressed input MLModel to a jointly compressed MLModel. See the prune_weights graph pass for information about which compression schemas could be further pruned.
Take “quantize + prune” as an example of joint compression, where the input MLModel is already quantized, and it will be further pruned. In such an example, the weight values are represented by constexpr_sparse_blockwise_shift_scale
+ constexpr_sparse_to_dense
ops: quantized(sparse) -> constexpr_sparse_blockwise_shift_scale -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)
Returns:
model: MLModel
The sparse MLModel instance.
Examples
import coremltools as ct import coremltools.optimize as cto
model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpThresholdPrunerConfig(threshold=1e-12) ) compressed_model = cto.coreml.prune_weights(model, config)
class coremltools.optimize.coreml.OpThresholdPrunerConfig(threshold: float = 1e-12, minimum_sparsity_percentile: float = 0.5, weight_threshold: int | None = 2048)[source]
All weights with absolute value smaller than threshold
are changed to 0
, and the tensor is stored in a sparse format.
For example, given the following:
weight = [0.3, -0.2, -0.01, 0.05]
threshold = 0.03
The sparsified weight would be [0.3, -0.2, 0, 0.05]
.
Parameters:
threshold: float
All weight values above this threshold are set to 0
.
- Default value is
1e-12
.
minimum_sparsity_percentile: float
The sparsity level must be above this value for the weight representation to be stored in the sparse format rather than the dense format.
For example, if minimum_sparsity_percentile = 0.6
and the sparisty level is 0.54
; that is, 54%
of the weight values are exactly 0
, then the resulting weight tensor will be stored as a dense const op, and not converted to the constsexpr_sparse_to_dense
op (which stores the weight values in a sparse format).
- Must be a value between
0
and1
. - Default value is
0.5
.
weight_threshold: int
The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than weight_threshold
.
For example, if weight_threshold = 1024
and a weight tensor is of shape [10, 20, 1, 1]
, hence 200
elements, it will not be pruned.
- If not provided, it will be set to
2048
, in which weights bigger than2048
elements are compressed.
class coremltools.optimize.coreml.OpMagnitudePrunerConfig(target_sparsity: float | None = None, block_size: int | None = None, n_m_ratio: Tuple[int, int] | None = None, dim: int | None = None, weight_threshold: int | None = 2048)[source]
Prune the weight with a constant sparsity percentile, which can be specified by either target_sparsity
or n_m_ratio
.
If target_sparsity
is set, where n = floor(size_of_weight_tensor * target_sparsity)
, the n
lowest absolute weight values are changed to 0
. For example, given the following:
weight = [0.3, -0.2, -0.01, 0.05]
target_sparsity = 0.75
The sparsified weight would be [0.3, 0, 0, 0]
.
If block_size
is set, then weights are pruned in a block structured manner; that is, chunks of weight values, as big as the block_size
, will be set to 0
. Block sparsity can only be applied to linear
and conv
layers. For example:
Given a 4 x 2 weight with the following value, and block_size = 2, dim = 0.
[ [1, 3], [-6, -7], [0, 3], [-9, 2], ]
We first flatten the matrix along axis = 0.
[1, -6, 0, -9, 3, -7, 3, 2]
For block size 2, the L2 norm will be compute of first 2 elements, then the second and 3rd element and so on.
[6.08, 9.00, 7.62, 3.61]
Then the smallest values will be picked to prune. So if target_sparsity = 0.5, then the blocks that will be
pruned will be with ones with L2 norm value of 6.08 and 3.61. And hence, the elements in the first and third
block are pruned. Resulting in the following flatten pruned tensor:
[0, 0, 0, -9, 3, -7, 0, 0]
The final pruned tensor is:
[ [0, 3], [0, -7], [0, 0], [-9, 0], ]
The n_m_ratio
triggers n:m
pruning along the dim
axis. In n:m
pruning, out of every m
elements, n
with lowest magnitude are set to 0
. For more information, seeLearning N:M Fine-Grained Structured Sparse Neural Networks From Scratch.
n:m
pruning can be applied only to linear
and conv
layers.
Example
Given a 4 x 4 weight of
[ [3, 4, 7, 6], [1, 8, -3, -8], [-2, -3, -4, 0], [5, 4, -3, -2], ]
For n_m_ratio = (1, 2) with axis = 1 (default), the resulting pruned weight is
[ [0, 4, 7, 0], [0, 8, 0, -8], [0, -3, -4, 0], [5, 0, -3, 0], ]
For axis = 0, we get
[ [3, 0, 7, 0], [0, 8, 0, -8], [0, 0, -4, 0], [5, 4, 0, -2], ]
Parameters:
target_sparsity: float
The percentage of sparsity for compression, which needs to be in the range [0, 1]
. When 0
, no sparsification occurs. For 1
, all weights become 0
.
block_size: int
Block size for inducing block sparsity. This is applied on the dim
dimension of the parameter. Having the zeros aligned in the parameter helps gain latency/memory performance on-device.
- If set, must be greater than
1
to enable block sparsity. - Block sparsity can be applied only to
linear
andconv
layers. - The channel will be padded with
0
if it is not divisible byblock_size
.
n_m_ratio: tuple[int]
A tuple of two integers which specify the ratio for n:m
pruning.
n
must be smaller or equal tom
.- The channel would be padded with
0
if it is not divisible bym
.
dim: int
Dimension where the block sparsity or n:m
sparsity is applied.
- Must be either
0
or1
. - The default value for block sparsity is
0
(output channel). - The default value for
n:m
sparsity is1
(input channel).
weight_threshold: int
The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements is greater than weight_threshold
.
For example, if weight_threshold = 1024
and a weight tensor is of shape [10, 20, 1, 1]
, hence 200
elements, it will not be pruned.
- If not provided, it will be set to
2048
, in which weights bigger than2048
elements are compressed.