Ease-of-use quantization for PyTorch with Intel® Neural Compressor — PyTorch Tutorials 2.7.0+cu126 documentation (original) (raw)

recipes/intel_neural_compressor_for_pytorch

Run in Google Colab

Colab

Download Notebook

Notebook

View on GitHub

GitHub

Created On: Jan 11, 2022 | Last Updated: Aug 27, 2024 | Last Verified: Not Verified

Overview¶

Most deep learning applications are using 32-bits of floating-point precision for inference. But low precision data types, especially int8, are getting more focus due to significant performance boost. One of the essential concerns on adopting low precision is how to easily mitigate the possible accuracy loss and reach predefined accuracy requirement.

Intel® Neural Compressor aims to address the aforementioned concern by extending PyTorch with accuracy-driven automatic tuning strategies to help user quickly find out the best quantized model on Intel hardware, including Intel Deep Learning Boost (Intel DL Boost) and Intel Advanced Matrix Extensions (Intel AMX).

Intel® Neural Compressor has been released as an open-source project at Github.

Features¶

Ease-of-use Python API: Intel® Neural Compressor provides simple frontend Python APIs and utilities for users to do neural network compression with few line code changes. Typically, only 5 to 6 clauses are required to be added to the original code.
Quantization: Intel® Neural Compressor supports accuracy-driven automatic tuning process on post-training static quantization, post-training dynamic quantization, and quantization-aware training on PyTorch fx graph mode and eager model.

This tutorial mainly focuses on the quantization part. As for how to use Intel® Neural Compressor to do pruning and distillation, please refer to corresponding documents in the Intel® Neural Compressor github repo.

Getting Started¶

Installation¶

install stable version from pip

pip install neural-compressor

install nightly version from pip

pip install -i https://test.pypi.org/simple/ neural-compressor

install stable version from from conda

conda install neural-compressor -c conda-forge -c intel

Supported python versions are 3.6 or 3.7 or 3.8 or 3.9

Usages¶

Minor code changes are required for users to get started with Intel® Neural Compressor quantization API. Both PyTorch fx graph mode and eager mode are supported.

Intel® Neural Compressor takes a FP32 model and a yaml configuration file as inputs. To construct the quantization process, users can either specify the below settings via the yaml configuration file or python APIs:

Calibration Dataloader (Needed for static quantization)
Evaluation Dataloader
Evaluation Metric

Intel® Neural Compressor supports some popular dataloaders and evaluation metrics. For how to configure them in yaml configuration file, user could refer to Built-in Datasets.

If users want to use a self-developed dataloader or evaluation metric, Intel® Neural Compressor supports this by the registration of customized dataloader/metric using python code.

For the yaml configuration file format please refer to yaml template.

The code changes that are required for Intel® Neural Compressor are highlighted with comments in the line above.

Model¶

In this tutorial, the LeNet model is used to demonstrate how to deal with Intel® Neural Compressor.

main.py

import torch import torch.nn as nn import torch.nn.functional as F

LeNet Model definition

class Net(nn.Module): def init(self): super(Net, self).init() self.conv1 = nn.Conv2d(1, 10, kernel_size=5) self.conv2 = nn.Conv2d(10, 20, kernel_size=5) self.conv2_drop = nn.Dropout2d() self.fc1 = nn.Linear(320, 50) self.fc1_drop = nn.Dropout() self.fc2 = nn.Linear(50, 10)

def forward(self, x):
    x = F.relu(F.max_pool2d(self.conv1(x), 2))
    x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
    x = x.reshape(-1, 320)
    x = F.relu(self.fc1(x))
    x = self.fc1_drop(x)
    x = self.fc2(x)
    return F.log_softmax(x, dim=1)

model = Net() model.load_state_dict(torch.load('./lenet_mnist_model.pth', weights_only=True))

The pretrained model weight lenet_mnist_model.pth comes fromhere.

Accuracy driven quantization¶

Intel® Neural Compressor supports accuracy-driven automatic tuning to generate the optimal int8 model which meets a predefined accuracy goal.

Below is an example of how to quantize a simple network on PyTorchFX graph mode by auto-tuning.

conf.yaml

model: name: LeNet framework: pytorch_fx

evaluation: accuracy: metric: topk: 1

tuning: accuracy_criterion: relative: 0.01

main.py

model.eval()

from torchvision import datasets, transforms test_loader = torch.utils.data.DataLoader( datasets.MNIST('./data', train=False, download=True, transform=transforms.Compose([ transforms.ToTensor(), ])), batch_size=1)

launch code for Intel® Neural Compressor

from neural_compressor.experimental import Quantization quantizer = Quantization("./conf.yaml") quantizer.model = model quantizer.calib_dataloader = test_loader quantizer.eval_dataloader = test_loader q_model = quantizer() q_model.save('./output')

In the conf.yaml file, the built-in metric top1 of Intel® Neural Compressor is specified as the evaluation method, and 1% relative accuracy loss is set as the accuracy target for auto-tuning. Intel® Neural Compressor will traverse all possible quantization config combinations on per-op level to find out the optimal int8 model that reaches the predefined accuracy target.

Besides those built-in metrics, Intel® Neural Compressor also supports customized metric through python code:

conf.yaml

model: name: LeNet framework: pytorch_fx

tuning: accuracy_criterion: relative: 0.01

main.py

model.eval()

define a customized metric

class Top1Metric(object): def init(self): self.correct = 0 def update(self, output, label): pred = output.argmax(dim=1, keepdim=True) self.correct += pred.eq(label.view_as(pred)).sum().item() def reset(self): self.correct = 0 def result(self): return 100. * self.correct / len(test_loader.dataset)

launch code for Intel® Neural Compressor

In the above example, a class which contains update() and result() function is implemented to record per mini-batch result and calculate final accuracy at the end.

Quantization aware training¶

Besides post-training static quantization and post-training dynamic quantization, Intel® Neural Compressor supports quantization-aware training with an accuracy-driven automatic tuning mechanism.

Below is an example of how to do quantization aware training on a simple network on PyTorchFX graph mode.

conf.yaml

model: name: LeNet framework: pytorch_fx

quantization: approach: quant_aware_training

evaluation: accuracy: metric: topk: 1

tuning: accuracy_criterion: relative: 0.01

main.py

model.eval()

from torchvision import datasets, transforms train_loader = torch.utils.data.DataLoader( datasets.MNIST('./data', train=True, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=64, shuffle=True) test_loader = torch.utils.data.DataLoader( datasets.MNIST('./data', train=False, download=True, transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ])), batch_size=1)

import torch.optim as optim optimizer = optim.SGD(model.parameters(), lr=0.0001, momentum=0.1)

def training_func(model): model.train() for epoch in range(1, 3): for batch_idx, (data, target) in enumerate(train_loader): optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step() print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format( epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.item()))

launch code for Intel® Neural Compressor

from neural_compressor.experimental import Quantization quantizer = Quantization("./conf.yaml") quantizer.model = model quantizer.q_func = training_func quantizer.eval_dataloader = test_loader q_model = quantizer() q_model.save('./output')

Performance only quantization¶

Intel® Neural Compressor supports directly yielding int8 model with dummy dataset for the performance benchmarking purpose.

Below is an example of how to quantize a simple network on PyTorchFX graph mode with a dummy dataset.

conf.yaml

model: name: lenet framework: pytorch_fx

main.py

model.eval()

launch code for Intel® Neural Compressor

from neural_compressor.experimental import Quantization, common from neural_compressor.experimental.data.datasets.dummy_dataset import DummyDataset quantizer = Quantization("./conf.yaml") quantizer.model = model quantizer.calib_dataloader = common.DataLoader(DummyDataset([(1, 1, 28, 28)])) q_model = quantizer() q_model.save('./output')

Quantization outputs¶

Users could know how many ops get quantized from log printed by Intel® Neural Compressor like below:

2021-12-08 14:58:35 [INFO] |*Mixed Precision Statistics| 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+ 2021-12-08 14:58:35 [INFO] | Op Type | Total | INT8 | 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+ 2021-12-08 14:58:35 [INFO] | quantize_per_tensor | 2 | 2 | 2021-12-08 14:58:35 [INFO] | Conv2d | 2 | 2 | 2021-12-08 14:58:35 [INFO] | max_pool2d | 1 | 1 | 2021-12-08 14:58:35 [INFO] | relu | 1 | 1 | 2021-12-08 14:58:35 [INFO] | dequantize | 2 | 2 | 2021-12-08 14:58:35 [INFO] | LinearReLU | 1 | 1 | 2021-12-08 14:58:35 [INFO] | Linear | 1 | 1 | 2021-12-08 14:58:35 [INFO] +------------------------+--------+-------+

The quantized model will be generated under ./output directory, in which there are two files: 1. best_configure.yaml 2. best_model_weights.pt

The first file contains the quantization configurations of each op, the second file contains int8 weights and zero point and scale info of activations.

Deployment¶

Users could use the below code to load quantized model and then do inference or performance benchmark.

from neural_compressor.utils.pytorch import load int8_model = load('./output', model)