Method for Hybrid Precision Convolutional Neural Network Representation (original) (raw)

Short Floating-Point Representation for Convolutional Neural Network Inference

IEICE Electronics Express

Convolutional neural networks (CNNs) are being widely used in computer vision tasks, and there have been many efforts to implement CNNs in ASIC or FPGA for power-hungry environments. Instead of the previous common representation, the fixed-point representation, this letter proposes a short floating-point representation for CNNs. The short floating-point representation is based on the normal floating-point representation, but has much less width and does not have complex cases like Not-a-Number and infinity cases. The short floating-point representation, contrary to the common belief, can produce a low-complexity computation logic because the operands of the multiplier logic can be shortened by the exponent concept of the floating-point representation. The exponent can also reduce the total length to reduce the SRAM area. The experimental results show that the short floating-point representation with 8-bit total width achieves less-than-1-percentage-point degradation without the aid of retraining in the top-5 accuracy on very deep CNNs of up to 152 layers and gives more than a 60% area reduction in the ASIC implementation.

Accuracy to Throughput Trade-Offs for Reduced Precision Neural Networks on Reconfigurable Logic

Applied Reconfigurable Computing. Architectures, Tools, and Applications, 2018

Modern Convolutional Neural Networks (CNNs) are typically based on floating point linear algebra based implementations. Recently, reduced precision Neural Networks (NNs) have been gaining popularity as they require significantly less memory and computational resources compared to floating point. This is particularly important in power constrained compute environments. However, in many cases a reduction in precision comes at a small cost to the accuracy of the resultant network. In this work, we investigate the accuracy-throughput trade-off for various parameter precision applied to different types of NN models. We firstly propose a quantization training strategy that allows reduced precision NN inference with a lower memory footprint and competitive model accuracy. Then, we quantitatively formulate the relationship between data representation and hardware efficiency. Our experiments finally provide insightful observation. For example, one of our tests show 32-bit floating point is more hardware efficient than 1-bit parameters to achieve 99% MNIST accuracy. In general, 2-bit and 4-bit fixed point parameters show better hardware trade-off on small-scale datasets like MNIST and CIFAR-10 while 4-bit provide the best trade-off in large-scale tasks like AlexNet on ImageNet dataset within our tested problem domain.

FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs With Dynamic Fixed-Point Representation

IEEE Access, 2022

Deep neural networks (DNNs) have demonstrated their effectiveness in a wide range of computer vision tasks, with the state-of-the-art results obtained through complex and deep structures that require intensive computation and memory. In the past, graphic processing units enabled these breakthroughs because of their greater computational speed. Now-a-days, efficient model inference is crucial for consumer applications on resource-constrained platforms. As a result, there is much interest in the research and development of dedicated deep learning (DL) hardware to improve the throughput and energy efficiency of DNNs. Low-precision representation of DNN data-structures through quantization would bring great benefits to specialized DL hardware especially when expensive floating-point operations can be avoided and replaced by more efficient fixed-point operations. However, the rigorous quantization leads to a severe accuracy drop. As such, quantization opens a large hyper-parameter space at bit-precision levels, the exploration of which is a major challenge. In this paper, we propose a novel framework referred to as the Fixed-Point Quantizer of deep neural Networks (FxP-QNet) that flexibly designs a mixed low-precision DNN for integer-arithmeticonly deployment. Specifically, the FxP-QNet gradually adapts the quantization level for each data-structure of each layer based on the trade-off between the network accuracy and the low-precision requirements. Additionally, it employs post-training self-distillation and network prediction error statistics to optimize the quantization of floating-point values into fixed-point numbers. Examining FxP-QNet on state-of-theart architectures and the benchmark ImageNet dataset, we empirically demonstrate the effectiveness of FxP-QNet in achieving the accuracy-compression trade-off without the need for training. The results show that FxP-QNet-quantized AlexNet, VGG-16, and ResNet-18 reduce the overall memory requirements of their full-precision counterparts by 7.16×, 10.36×, and 6.44× with less than 0.95%, 0.95%, and 1.99% accuracy drop, respectively.

Towards a Stable Quantized Convolutional Neural Networks: An Embedded Perspective

Proceedings of the 10th International Conference on Agents and Artificial Intelligence, 2018

Nowadays, convolutional neural network (CNN) plays a major role in the embedded computing environment. Ability to enhance the CNN implementation and performance for embedded devices is an urgent demand. Compressing the network layers parameters and outputs into a suitable precision formats would reduce the required storage and computation cycles in embedded devices. Such enhancement can drastically reduce the consumed power and the required resources, and ultimately reduces cost. In this article, we propose several quantization techniques for quantizing several CNN networks. With a minor degradation of the floating-point performance, the presented quantization methods are able to produce a stable performance fixed-point networks. A precise fixed point calculation for coefficients, input/output signals and accumulators are considered in the quantization process.

FxpNet: Training a deep convolutional neural network in fixed-point representation

2017 International Joint Conference on Neural Networks (IJCNN), 2017

We introduce FxpNet, a framework to train deep convolutional neural networks with low bit-width arithmetics in both forward pass and backward pass. During training procedure, FxpNet further reduces the bit-width of stored parameters (also known as primal parameters) by adaptively updating their fixed-point formats. These primal parameters are ususally represented in full resolution of floating-point values in previous binarzied and quantized neural networks. In FxpNet, during forward pass fixed-point primal weights and activations will first be binarized before computation, while in backward pass all gradients are represented as low resolution fixed-point values and then accumulated to corresponding fixed-point primal parameters. To have highly efficient implementations in FPGAs, ASICs and other dedicated devices, FxpNet introduces Integer Batch Normalization (IBN) and Fixed-point ADAM (FxpADAM) methods to further reduce the required floating-point operations, which will save considerable power and chip area. The evaluation on CIFAR-10 dataset indicates the effectiveness that FxpNet with 12-bit primal parameters and 12-bit gradients achieves comparable prediction accuracy with state-of-art binarized and quantized neural networks. The corresponding training code implemented in TensorFlow is available online.

A Fixed-Point Quantization Technique for Convolutional Neural Networks Based on Weight Scaling

2019 IEEE International Conference on Image Processing (ICIP)

In order to make convolutional neural networks (CNNs) usable on smaller or mobile devices, it is necessary to reduce the computing, energy and storage requirements of these networks. One can achieved this by a fixed-point quantization of weights and activations of a CNN, which are usually represented by 32-bit floating-point. In this paper, we present an adaption of convolutional and fully connected layers in order to obtain a high usage of the given value range of activations and weights. Therefore, we introduce scaling factors obtained by moving average to limit the weights and activations. Our model, quantized to 8 bit, outperforms the 7-layer baseline model from which it is derived and the naive quantization by several percentage points. Our method does not require any additional operations in the inference and both the weights and activations have a fixed radix point.

Low-Precision Floating-Point for Efficient On-Board Deep Neural Network Processing

2023

One of the major bottlenecks in high-resolution Earth Observation (EO) space systems is the downlink between the satellite and the ground. Due to hardware limitations, onboard power limitations or ground-station operation costs, there is a strong need to reduce the amount of data transmitted. Various processing methods can be used to compress the data. One of them is the use of on-board deep learning to extract relevant information in the data. However, most ground-based deep neural network parameters and computations are performed using single-precision floating-point arithmetic, which is not adapted to the context of on-board processing. We propose to rely on quantized neural networks and study how to combine low precision (mini) floating-point arithmetic with a Quantization-Aware Training methodology. We evaluate our approach with a semantic segmentation task for ship detection using satellite images from the Airbus Ship dataset. Our results show that 6-bit floating-point quantization for both weights and activations can compete with single-precision without significant accuracy degradation. Using a Thin U-Net 32 model, only a 0.3% accuracy degradation is observed with 6-bit minifloat quantization (a 6-bit equivalent integer-based approach leads to a 0.5% degradation). An initial hardware study also confirms the potential impact of such lowprecision floating-point designs, but further investigation at the scale of a full inference accelerator is needed before concluding whether they are relevant in a practical on-board scenario.

Multi-precision convolutional neural networks on heterogeneous hardware

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018

Fully binarised convolutional neural networks (CNNs) deliver very high inference performance using singlebit weights and activations, together with XNOR type operators for the kernel convolutions. Current research shows that full binarisation results in a degradation of accuracy and different approaches to tackle this issue are being investigated such as using more complex models as accuracy reduces. This paper proposes an alternative based on a multi-precision CNN framework that combines a binarised and a floating point CNN in a pipeline configuration deployed on heterogeneous hardware. The binarised CNN is mapped onto an FPGA device and used to perform inference over the whole input set while the floating point network is mapped onto a CPU device and performs reinference only when the classification confidence level is low. A lightweight confidence mechanism enables a flexible trade-off between accuracy and throughput. To demonstrate the concept, we choose a Zynq 7020 device as the hardware target and show that the multi-precision network is able to increase the BNN accuracy from 78.5% to 82.5% and the CPU inference speed from 29.68 to 90.82 images/sec.

Scaling Neural Network Performance through Customized Hardware Architectures on Reconfigurable Logic

2017 IEEE International Conference on Computer Design (ICCD), 2017

Convolutional Neural Networks have dramatically improved in recent years, surpassing human accuracy on certain problems and performance exceeding that of traditional computer vision algorithms. While the compute pattern in itself is relatively simple, significant compute and memory challenges remain as CNNs may contain millions of floating-point parameters and require billions of floating-point operations to process a single image. These computational requirements, combined with storage footprints that exceed typical cache sizes, pose a significant performance and power challenge for modern compute architectures. One of the promising opportunities to scale performance and power efficiency is leveraging reduced precision representations for all activations and weights as this allows to scale compute capabilities, reduce weight and feature map buffering requirements as well as energy consumption. While a small reduction in accuracy is encountered, these Quantized Neural Networks have been shown to achieve stateof-the-art accuracy on standard benchmark datasets, such as MNIST, CIFAR-10, SVHN and even ImageNet, and thus provide highly attractive design trade-offs. Current research has focused mainly on the implementation of extreme variants with full binarization of weights and or activations, as well typically smaller input images. Within this paper, we investigate the scalability of dataflow architectures with respect to supporting various precisions for both weights and activations, larger image dimensions, and increasing numbers of feature map channels. Key contributions are a formalized approach to understanding the scalability of the existing hardware architecture with cost models and a performance prediction as a function of the target device size. We provide validating experimental results for an ImageNet classification on a serverclass platform, namely the AWS F1 node.