Quantizing Convolutional Neural Networks for Low-Power High-Throughput Inference Engines (original) (raw)

Low-bit Quantization of Neural Networks for Efficient Inference

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019

Recent machine learning methods use increasingly large deep neural networks to achieve state of the art results in various tasks. The gains in performance come at the cost of a substantial increase in computation and storage requirements. This makes real-time implementations on limited resources hardware a challenging task. One popular approach to address this challenge is to perform lowbit precision computations via neural network quantization. However, aggressive quantization generally entails a severe penalty in terms of accuracy, and often requires retraining of the network, or resorting to higher bit precision quantization. In this paper, we formalize the linear quantization task as a Minimum Mean Squared Error (MMSE) problem for both weights and activations, allowing low-bit precision inference without the need for full network retraining. The main contributions of our approach are the optimizations of the constrained MSE problem at each layer of the network, the hardware aware partitioning of the network parameters, and the use of multiple low precision quantized tensors for poorly approximated layers. The proposed approach allows 4 bits integer (INT4) quantization for deployment of pretrained models on limited hardware resources. Multiple experiments on various network architectures show that the suggested method yields state of the art results with minimal loss of tasks accuracy.

Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime

Cornell University - arXiv, 2022

Deep Learning has been one of the most disruptive technological advancements in recent times. The high performance of deep learning models comes at the expense of high computational, storage and power requirements. Sensing the immediate need for accelerating and compressing these models to improve on-device performance, we introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite Runtime for deployment of ultra-low bit quantized models on Arm-based platforms. We implement low-level quantization kernels for Armv7 and Armv8 architectures enabling deployment on the vast array of 32-bit and 64-bit Arm-based devices. With efficient implementations using vectorization, parallelization, and tiling, we realize speedups of up to 2x and 2.2x compared to TensorFlow Lite with XNNPACK backend on classification and detection models, respectively. We also achieve significant speedups of up to 5x and 3.2x compared to ONNX Runtime for classification and detection models, respectively.

Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing

2023

One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, onboard power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such lowprecision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.

Bit Efficient Quantization for Deep Neural Networks

2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 2019

Quantization for deep neural networks have afforded models for edge devices that use less on-board memory and enable efficient low-power inference. In this paper, we present a comparison of model-parameter driven quantization approaches that can achieve as low as 3-bit precision without affecting accuracy. The post-training quantization approaches are data-free, and the resulting weight values are closely tied to the dataset distribution on which the model has converged to optimality. We show quantization results for a number of state-of-art deep neural networks (DNN) using large dataset like ImageNet. To better analyze quantization results, we describe the overall range and local sparsity of values afforded through various quantization schemes. We show the methods to lower bit-precision beyond quantization limits with object class clustering.

Short Floating-Point Representation for Convolutional Neural Network Inference

IEICE Electronics Express

Convolutional neural networks (CNNs) are being widely used in computer vision tasks, and there have been many efforts to implement CNNs in ASIC or FPGA for power-hungry environments. Instead of the previous common representation, the fixed-point representation, this letter proposes a short floating-point representation for CNNs. The short floating-point representation is based on the normal floating-point representation, but has much less width and does not have complex cases like Not-a-Number and infinity cases. The short floating-point representation, contrary to the common belief, can produce a low-complexity computation logic because the operands of the multiplier logic can be shortened by the exponent concept of the floating-point representation. The exponent can also reduce the total length to reduce the SRAM area. The experimental results show that the short floating-point representation with 8-bit total width achieves less-than-1-percentage-point degradation without the aid of retraining in the top-5 accuracy on very deep CNNs of up to 152 layers and gives more than a 60% area reduction in the ASIC implementation.

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks

2019

Reducing the numerical precision of data and computation is extremely effective in accelerating deep learning training workloads. Towards this end, 8-bit floating point representations (FP8) were recently proposed for DNN training. However, its applicability was demonstrated on a few selected models only and significant degradation is observed when popular networks such as MobileNet and Transformer are trained using FP8. This degradation is due to the inherent precision requirement difference in the forward and backward passes of DNN training. Using theoretical insights, we propose a hybrid FP8 (HFP8) format and DNN end-to-end distributed training procedure. We demonstrate, using HFP8, the successful training of deep learning models across a whole spectrum of applications including Image Classification, Object Detection, Language and Speech without accuracy degradation. Finally, we demonstrate that, by using the new 8 bit format, we can directly quantize a pre-trained model down to ...

Benchmarking Convolutional Neural Network Inference on Low-Power Edge Devices

ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

The massive adoption of IoT devices, the recent developments in the efficiency of AI systems, and the increase of edge computational power, accelerated the deployment of edge AI systems. The implementation of these systems through the use of low-power embedded devices scattered across the edges of a network allows for reduced latency and cost, compared to traditional cloud-based AI computing systems. As a result of the low-complexity AI models and the available low-power embedded systems on the market, this paper provides a comparative study on the inference performance of convolutional neural networks for different edge devices, by exploiting lowpower GPUs and dedicated AI hardware. The benchmark results were able to achieve 864 inferences/s for the Jetson AGX Xavier board on a pre-trained SqueezeNet, while reaching a high power efficiency of 52.6 inferences/s per Watt. For the dedicated Movidius neural stick, the system requires only 1.5 W for processing 24.2 inferences/s.

Training Deep Neural Networks with 8-bit Floating Point Numbers

arXiv (Cornell University), 2018

The state-of-the-art hardware platforms for training Deep Neural Networks (DNNs) are moving from traditional single precision (32-bit) computations towards 16 bits of precision-in large part due to the high energy efficiency and smaller bit storage associated with using reduced-precision representations. However, unlike inference, training with numbers represented with less than 16 bits has been challenging due to the need to maintain fidelity of the gradient computations during back-propagation. Here we demonstrate, for the first time, the successful training of DNNs using 8-bit floating point numbers while fully maintaining the accuracy on a spectrum of Deep Learning models and datasets. In addition to reducing the data and computation precision to 8 bits, we also successfully reduce the arithmetic precision for additions (used in partial product accumulation and weight updates) from 32 bits to 16 bits through the introduction of a number of key ideas including chunk-based accumulation and floating point stochastic rounding. The use of these novel techniques lays the foundation for a new generation of hardware training platforms with the potential for 2 − 4× improved throughput over today's systems.

BMPQ: Bit-Gradient Sensitivity-Driven Mixed-Precision Quantization of DNNs from Scratch

2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on a computeexpensive, iterative training policy that requires the availability of a pre-trained baseline. To address this issue, this paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. BMPQ requires a single training iteration but does not need a pre-trained baseline. It uses an integer linear program (ILP) to dynamically adjust the precision of layers during training, subject to a fixed hardware budget. To evaluate the efficacy of BMPQ, we conduct extensive experiments with VGG16 and ResNet18 on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4× fewer parameter bits with negligible drop in accuracy. Compared to the SOTA "during training", mixed-precision training scheme, our models are 2.1×, 2.2×, and 2.9× smaller, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, with an improved accuracy of up to 14.54%. Index Terms-Mixed-precision quantization, model compression, energy-efficient DNN training, one-shot quantization † This work was supported in parts by NSF and DARPA with grant numbers 1763747 and HR00112190120, respectively. 1 All layer weights/activations have the same bit widths.