tensor_quant — Model Optimizer 0.27.1 (original) (raw)

Basic tensor quantization functions.

Classes

DynamicBlockQuantizationFunction Dynamic block quantization functional.
FakeAffineTensorQuantFunction Fake version of affine quantization.
FakeTensorQuantFunction Fake version of TensorQuantFunction use CUDA extension.
LegacyFakeTensorQuantFunction Fake version of TensorQuantFunction.
ScaledE4M3Function E4M3fy input with scale.
StaticBlockQuantizationFunction Static block quantization functional.
TensorQuantFunction A universal tensor quantization function.

Functions

fake_quant_impl Implementation of fake quantizing input according to number of bits.
scaled_e4m3_impl Implementation of fake quantizing input to FP8.

class DynamicBlockQuantizationFunction

Bases: Function

Dynamic block quantization functional.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, block_size, amax, num_bits, scale_bits, trt_high_precision_dtype='Half', onnx_quantizer_type='dynamic')

Forward method.

static symbolic(g, inputs, block_size, amax, num_bits, scale_bits, trt_high_precision_dtype='Half', onnx_quantizer_type='dynamic')

ONNX symbolic function.

class FakeAffineTensorQuantFunction

Bases: Function

Fake version of affine quantization.

gemmlowp style scale+shift quantization. See more details inhttps://github.com/google/gemmlowp/blob/master/doc/quantization.md.

We DO NOT recommend affine quantization on weights for performance reason. There might be value to affine quantize activation as it can be cancelled by bias and comes with no performance penalty. This functionality is only added for experimental purpose.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

Parameters:

Returns:

A tensor of gradient

Return type:

grad_inputs

static forward(ctx, inputs, min_range, max_range, num_bits=8)

As it will be only applied on activation with per tensor granularity, broadcast is not needed.

Parameters:

Returns:

A Tensor of type output_dtype

Return type:

outputs

class FakeTensorQuantFunction

Bases: Function

Fake version of TensorQuantFunction use CUDA extension.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')

Forward method.

static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')

ONNX symbolic function.

class LegacyFakeTensorQuantFunction

Bases: Function

Fake version of TensorQuantFunction.

See comments of TensorQuantFunction, arguments are the same.

static backward(ctx, grad_outputs)

Implements straight through estimation.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

Forward method.

class ScaledE4M3Function

Bases: Function

E4M3fy input with scale.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, amax, E, M, trt_high_precision_dtype='Float')

Forward method.

static symbolic(g, inputs, amax=None, E=4, M=3, trt_high_precision_dtype='Float')

ONNX symbolic function.

class StaticBlockQuantizationFunction

Bases: FakeTensorQuantFunction

Static block quantization functional.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float', block_size=None)

Forward method.

class TensorQuantFunction

Bases: Function

A universal tensor quantization function.

Take an input tensor, output an quantized tensor. The granularity of scale can be interpreted from the shape of amax. output_dtype indicates whether the quantized value will be stored in integer or float. The reason we want to store it in float is the pytorch function takes the quantized value may not accept integer input, e.g. Conv2D.

It uses 2^num_bits -1 values instead of 2^num_bits. e.g., for num_bits=8, it uses [-127, 127] instead of [-128, 127]

static backward(ctx, grad_outputs, grad_scale)

Implements straight through estimation with clipping.

For -amax <= input <= amax the gradient passes straight through, otherwise the gradient is zero.

Parameters:

Returns:

A tensor of gradient.

Return type:

grad_inputs

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')

Forward method.

Follow tensorflow convention, max value is passed in and used to decide scale, instead of inputing scale directly. Though inputing scale directly may be more natural to use.

Parameters:

Returns:

A Tensor of type output_dtype. scale: A Tensor of type float32. outputs / scale will dequantize outputs tensor.

Return type:

outputs

Raises:

ValueError

static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')

ONNX symbolic function.

fake_quant_impl(inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

Implementation of fake quantizing input according to number of bits.

Parameters:

scaled_e4m3_impl(inputs, amax, disable_fused_kernel=True)

Implementation of fake quantizing input to FP8.

Parameters:

Returns:

Input tensors faked quantized to FP8.

Return type:

Tensor