Quantization aware training (original) (raw)

Quantization aware training

Stay organized with collections Save and categorize content based on your preferences.

Maintained by TensorFlow Model Optimization

There are two forms of quantization: post-training quantization and quantization aware training. Start with post-training quantizationsince it's easier to use, though quantization aware training is often better for model accuracy.

This page provides an overview on quantization aware training to help you determine how it fits with your use case.

Overview

Quantization aware training emulates inference-time quantization, creating a model that downstream tools will use to produce actually quantized models. The quantized models use lower-precision (e.g. 8-bit instead of 32-bit float), leading to benefits during deployment.

Deploy with quantization

Quantization brings improvements via model compression and latency reduction. With the API defaults, the model size shrinks by 4x, and we typically see between 1.5 - 4x improvements in CPU latency in the tested backends. Eventually, latency improvements can be seen on compatible machine learning accelerators, such as the EdgeTPU and NNAPI.

The technique is used in production in speech, vision, text, and translate use cases. The code currently supports asubset of these models.

Experiment with quantization and associated hardware

Users can configure the quantization parameters (e.g. number of bits) and to some degree, the underlying algorithms. Note that with these changes from the API defaults, there is currently no supported path for deployment to a backend. For instance, TFLite conversion and kernel implementations only support 8-bit quantization.

APIs specific to this configuration are experimental and not subject to backward compatibility.

API compatibility

Users can apply quantization with the following APIs:

It is on our roadmap to add support in the following areas:

General support matrix

Support is available in the following areas:

It is on our roadmap to add support in the following areas:

Results

Image classification with tools

Model Non-quantized Top-1 Accuracy 8-bit Quantized Accuracy
MobilenetV1 224 71.03% 71.06%
Resnet v1 50 76.3% 76.1%
MobilenetV2 224 70.77% 70.01%

The models were tested on Imagenet and evaluated in both TensorFlow and TFLite.

Image classification for technique

Model Non-quantized Top-1 Accuracy 8-Bit Quantized Accuracy
Nasnet-Mobile 74% 73%
Resnet-v2 50 75.6% 75%

The models were tested on Imagenet and evaluated in both TensorFlow and TFLite.

Examples

In addition to thequantization aware training example, see the following examples:

For background on something similar, see the Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference paper. This paper introduces some concepts that this tool uses. The implementation is not exactly the same, and there are additional concepts used in this tool (e.g. per-axis quantization).