Palettization — coremltools API Reference 8.3 documentation (original) (raw)

coremltools.optimize.coreml.palettize_weights(*args, **kwargs)[source]

Utility function to convert a float precision MLModel of type mlprogram to a compressed MLModel by reducing the overall number of weights using one or more lookup tables (LUT). A LUT contains a list of float values. An n-bit LUT has \(2^{n-bits}\) entries.

For example, a float weight vector such as {0.3, 0.3, 0.5, 0.5} can be compressed using a 1-bit LUT: {0.3, 0.5}. In this case the float vector can be replaced with a 1-bit vector {0, 0, 1, 1}.

This function iterates over all the weights in the mlprogram, discretizes its values, and constructs the LUT according to the algorithm specified in mode. The float values are then converted to the n-bit values, and the LUT is saved alongside each weight. The const ops storing weight values are replaced byconstexpr_lut_to_dense ops.

At runtime, the LUT and the n-bit values are used to reconstruct the float weight values, which are then used to perform the float operation the weight is feeding into.

Consider the following example of "uniform" mode (a linear histogram):

The weight can be converted to a palette with indices [0, 1, 2, 3] (2 bits). The indices are a byte array.

The data range [0.0, 0.3] is divided into four partitions linearly, which is[0.0, 0.1, 0.2, 0.3].

Parameters:

mlmodel: MLModel

Model to be converted by a LUT. This MLModel should be of type mlprogram.

config: OptimizationConfig

An OptimizationConfig object that specifies the parameters for weight palettization.

joint_compression: bool

Specification of whether or not to further compress the already-compressed input MLModel to a jointly compressed MLModel. See the channelwise_palettize_weights graph pass for information about which compression schemas could be further jointly palettized.

Take “prune + palettize” as an example of joint compression, where the input MLModel is already pruned, and the non-zero entries will be further palettized. In such an example, the weight values are represented by constexpr_lut_to_sparse + constexpr_sparse_to_dense ops:lut(sparse) -> constexpr_lut_to_sparse -> weight(sparse) -> constexpr_sparse_to_dense -> weight(dense)

Returns:

model: MLModel

The palettized MLModel instance.

Examples

import coremltools as ct import coremltools.optimize as cto

model = ct.models.MLModel("my_model.mlpackage") config = cto.coreml.OptimizationConfig( global_config=cto.coreml.OpPalettizerConfig(mode="kmeans", nbits=4) ) compressed_model = cto.coreml.palettize_weights(model, config)

class coremltools.optimize.coreml.OpPalettizerConfig(mode: str = 'kmeans', nbits: int | None = None, lut_function: Callable | None = None, granularity: str | CompressionGranularity = CompressionGranularity.PER_TENSOR, group_size: int = 32, channel_axis: int | None = None, cluster_dim: int = 1, enable_per_channel_scale: bool = False, num_kmeans_workers: int = 1, weight_threshold: int | None = 2048)[source]

Parameters:

nbits: int

Number of bits per weight. Required for kmeans or uniform mode, but must not be set for unique or custom mode. A LUT would have 2nbits entries, where nbits can be {1, 2, 3, 4, 6, 8}.

mode: str

Determine how the LUT is constructed by specifying one of the following:

lut_function: callable

A callable function which computes the weight palettization parameters. This must be provided if the mode is set to "custom".

weight: np.ndarray

A float precision numpy array.

Returns: lut: list[float]

The lookup table.

indices: list[int]

A list of indices for each element.

The following is an example that extract the top_k elements as the LUT. Given that weight = [0.1, 0.5, 0.3, 0.3, 0.5, 0.6, 0.7], the lut_functionproduces lut = [0, 0.5, 0.6, 0.7], indices = [0, 1, 0, 0, 2, 3].

def lut_function(weight): # In this example, we assume elements in the weights >= 0 weight = weight.flatten() nbits = 4

# Get the LUT, from extracting top k maximum unique elements in the weight to be the LUT
# Note that k = 1 << nbits - 1, so we have the first element be 0
unique_elements = np.unique(weight)
k = (1 << nbits) - 1
top_k = np.partition(weight, -k)[-k:]
np.sort(top_k)
lut = [0.0] + top_k.tolist()

# Compute the indices
mapping = {v: idx for idx, v in enumerate(lut)}
indices = [mapping[v] if v in mapping else 0 for v in weight]

return lut, indices

granularity: str

Granularity for quantization. * "per_tensor" (default) * "per_grouped_channel"

group_size: int

channel_axis: Optional[int] = None

cluster_dim: int

enable_per_channel_scale: bool

num_kmeans_workers: int

weight_threshold: int

The size threshold, above which weights are pruned. That is, a weight tensor is pruned only if its total number of elements are greater than weight_threshold.

For example, if weight_threshold = 1024 and a weight tensor is of shape [10, 20, 1, 1], hence 200elements, it will not be pruned.