Quantization - Neural Network Distiller (original) (raw)

Quantization Algorithms

Note:
For any of the methods below that require quantization-aware training, please see here for details on how to invoke it using Distiller's scheduling mechanism.

Range-Based Linear Quantization

Let's break down the terminology we use here:

Asymmetric vs. Symmetric

In this method we can use two modes - asymmetric and symmetric.

Asymmetric Mode

In asymmetric mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a zero-point (also called quantization bias, or offset) in addition to the scale factor.

Let us denote the original floating-point tensor by , the quantized tensor by , the scale factor by , the zero-point by and the number of bits used for quantization by . Then, we get:

In practice, we actually use . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively "nudge" the min/max values in the float range a little bit, in order to gain this exact quantization of zero.

Note that in the derivation above we use unsigned integer to represent the quantized range. That is, . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting .

Let's see how a convolution or fully-connected (FC) layer is quantized in asymmetric mode: (we denote input, output, weights and bias with and respectively)

Therefore:

Notes:

Symmetric Mode

In symmetric mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range.

There's a nuance in the symmetric case with regards to the quantized range. Assuming , we can use either a "full" or "restricted" quantized range:

| Full Range | Restricted Range | | | --------------- | ------------------------- | | | Quantized Range | | | | 8-bit example | (As shown in image above) | | | Scale Factor | | |

The restricted range is less accurate on-paper, and is usually used when specific HW considerations require it. Implementations of quantization "in the wild" that use a full range include PyTorch's native quantization (from v1.3 onwards) and ONNX. Implementations that use a restricted range include TensorFlow, NVIDIA TensorRT and Intel DNNL (aka MKL-DNN). Distiller can emulate both modes.

Using the same notations as above, we get (regardless of full/restricted range):

Again, let's see how a convolution or fully-connected (FC) layer is quantized, this time in symmetric mode:

Therefore:

Comparing the Two Modes

The main trade-off between these two modes is simplicity vs. utilization of the quantized range.

Other Features

Implementation in Distiller

Post-Training

For post-training quantization, this method is implemented by wrapping existing modules with quantization and de-quantization operations. The wrapper implementations are in range_linear.py.

Quantization-Aware Training

To apply range-based linear quantization in training, use the QuantAwareTrainRangeLinearQuantizer class. As it is now, it will apply weights quantization to convolution, FC and embedding modules. For activations quantization, it will insert instances FakeLinearQuantization module after ReLUs. This module follows the methodology described in Benoit et al., 2018 and uses exponential moving averages to track activation ranges.
Note that the current implementation of QuantAwareTrainRangeLinearQuantizer supports training with single GPU only.

Similarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created.

Note that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.

DoReFa

(As proposed in DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients)

In this method, we first define the quantization function , which takes a real value and outputs a discrete-valued , where is the number of bits used for quantization.

Activations are clipped to the range and then quantized as follows:

For weights, we define the following function , which takes an unbounded real valued input and outputs a real value in :

Now we can use to get quantized weight values, as follows:

This method requires training the model with quantization-aware training, as discussed here. Use the DorefaQuantizer class to transform an existing model to a model suitable for training with quantization using DoReFa.

Notes

PACT

(As proposed in PACT: Parameterized Clipping Activation for Quantized Neural Networks)

This method is similar to DoReFa, but the upper clipping values, , of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, is shared per layer.

This method requires training the model with quantization-aware training, as discussed here. Use the PACTQuantizer class to transform an existing model to a model suitable for training with quantization using PACT.

WRPN

(As proposed in WRPN: Wide Reduced-Precision Networks)

In this method, activations are clipped to and quantized as follows ( is the number of bits used for quantization):

Weights are clipped to and quantized as follows:

Note that bits are used to quantize weights, leaving one bit for sign.

This method requires training the model with quantization-aware training, as discussed here. Use the WRPNQuantizer class to transform an existing model to a model suitable for training with quantization using WRPN.

Notes