A streaming accelerator of Convolutional Neural Networks for resource-limited applications (original) (raw)

Extensible Embedded Processor for Convolutional Neural Networks

Sci. Program., 2021

Convolutional neural networks (CNNs) require significant computing power during inference. Smart phones, for example, may not run a facial recognition system or search algorithm smoothly due to the lack of resources and supporting hardware. Methods for reducing memory size and increasing execution speed have been explored, but choosing effective techniques for an application requires extensive knowledge of the network architecture.*is paper proposes a general approach to preparing a compressed deep neural network processor for inference with minimal additions to existing microprocessor hardware. To show the benefits to the proposed approach, an example CNN for synthetic aperture radar target classification is modified and complimentary custom processor instructions are designed. *e modified CNN is examined to show the effects of the modifications and the custom processor instructions are profiled to illustrate the potential performance increase from the new extended instructions.

Optimization of Convolutional Neural Networks on Resource Constrained Devices

Implementation of convolutional neural networks (CNNs) on resource constrained devices like FPGA (example: Zynq) etc. is important for intelligence in edge computing. This paper presents and discusses different hardware optimization methods that were employed to design a CNN model that is amenable to such devices, in general. Adaptive processing, exploitation of parallelism etc. are employed to show the superior performance of proposed methods over state of the art.

Lowering Dynamic Power of a Stream-based CNN Hardware Accelerator

2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), 2019

Custom hardware accelerators of Convolutional Neural Networks (CNN) provide a promising solution to meet real-time constraints for a wide range of applications on low-cost embedded devices. In this work, we aim to lower the dynamic power of a stream-based CNN hardware accelerator by reducing the computational redundancies in the CNN layers. In particular, we investigate the redundancies due to the downsampling effect of max pooling layers which are prevalent in state-of-theart CNNs, and propose an approximation method to reduce the overall computations. The experimental results show that the proposed method leads to lower dynamic power without sacrificing accuracy.

CARLA: A Convolution Accelerator With a Reconfigurable and Low-Energy Architecture

IEEE Transactions on Circuits and Systems I: Regular Papers

Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated by implementing convolutional layers of VGG-Net-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3×3 and 1×1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned.

An Energy-Efficient Accelerator Architecture with Serial Accumulation Dataflow for Deep CNNs

2020 18th IEEE International New Circuits and Systems Conference (NEWCAS), 2020

Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto hardware resources. In addition, to save battery life and reduce energy consumption, it is essential to reduce the number of DRAM accesses since DRAM consumes orders of magnitude more energy compared to other operations in hardware. In this paper, we propose an energy-efficient architecture which maximally utilizes its computational units for convolution operations while requiring a low number of DRAM accesses. The implementation results show that the proposed architecture performs one image recognition task using the VGGNet model with a latency of 393 ms and only 251.5 MB of DRAM accesses.

MulNet: A Flexible CNN Processor With Higher Resource Utilization Efficiency for Constrained Devices

IEEE Access, 2019

Leveraging deep convolutional neural networks (DCNNs) for various application areas has become a recent inclination of many machine learning practitioners due to their impressive performance. Research trends show that the state-of-the-art networks are getting deeper and deeper and such networks have shown significant performance increase. Deeper and larger neural networks imply the increase in computational intensity and memory footprint. This is particularly a problem for inference-based applications on resource constrained computing platforms. On the other hand, field-programmable gate arrays (FPGAs) are becoming a promising choice in giving hardware solutions for most deep learning implementations due to their high-performance and low-power features. With the rapid formation of various state-of-the-art CNN architectures, a flexible CNN hardware processor that can handle different CNN architectures and yet customize itself to achieve higher resource efficiency and optimum performance is critically important. In this paper, a novel and highly flexible DCNN processor, MulNet, is proposed. MulNet can be used to process most regular state-of-the-art CNN variants aiming at maximizing resource utilization of a target device. A processing core with multiplier and without multiplier is employed to achieve that. We formulated optimum fixed-point quantization format for MulNet by analyzing layer-by-layer quantization error. We also created a power-of-2 quantization for multiplier-free (MF) processing core of MulNet. Both quantizations significantly reduced the memory space needed and the logic consumption in the target device. We utilized Xilinx Zynq SoCs to leverage the one die hybrid (CPU and FPGA) architecture. We devised a scheme that utilizes Zynq processing system (PS) for memory intensive layers and the Zynq programmable logic (PL) for computationally intensive layers. We implemented modified LeNet, CIFAR-10 full, ConvNet processor (CNP), MPCNN, and AlexNet to evaluate MulNet. Our architecture with MF processing cores shows the promising result, by saving 36%-72% on-chip memory and 10%-44% DSP48 IPs, compared to the architecture with cores implemented using the multiplier. Comparison with the state of the art showed a very promising 25-40× DSP48 and 25-29× on-chip memory reduction with up to 136.9-GOP/s performance and 88.49-GOP/s/W power efficiency. Hence, our results demonstrate that the proposed architecture can be very expedient for resource constrained devices. INDEX TERMS DCNN, MulNet, constrained devices, hybrid embedded system. I. INTRODUCTION Leveraging deep convolutional neural networks (DCNN) for various application areas has become a recent inclination of many machine learning practitioners due to their impressive performance [1]. These include applications ranging from (but certainly not limited to) image processing, The associate editor coordinating the review of this manuscript and approving it for publication was Junaid Shuja. computer vision, automotive applications to computational biology, computational finance and natural language processing. Research trends show that the state-of-the-art networks proposed in past few years are getting deeper and deeper and such networks have shown significant performance increase [2], [3]. However, as the networks get deeper it also implies more training time, increase in computational intensity and memory-footprint. This is particularly a problem for inference-based applications on resource

AoCStream: All-on-Chip CNN Accelerator with Stream-Based Line-Buffer Architecture and Accelerator-Aware Pruning

Sensors

Convolutional neural networks (CNNs) play a crucial role in many EdgeAI and TinyML applications, but their implementation usually requires external memory, which degrades the feasibility of such resource-hungry environments. To solve this problem, this paper proposes memory-reduction methods at the algorithm and architecture level, implementing a reasonable-performance CNN with the on-chip memory of a practical device. At the algorithm level, accelerator-aware pruning is adopted to reduce the weight memory amount. For activation memory reduction, a stream-based line-buffer architecture is proposed. In the proposed architecture, each layer is implemented by a dedicated block, and the layer blocks operate in a pipelined way. Each block has a line buffer to store a few rows of input data instead of a frame buffer to store the whole feature map, reducing intermediate data-storage size. The experimental results show that the object-detection CNNs of MobileNetV1/V2 and an SSDLite variant,...

Implementing Convolutional Neural Networks on FPGA: A Survey and Research

ITM Web of Conferences

The implementation of CNN FPGA is of increasing importance due to the growing demand for low-power and high-performance edge AI applications. This paper presents a comprehensive survey and research on the topic, with a focus on comparing and evaluating the performance of two main FPGA architectures, streaming and single unit computing. The study includes a detailed evaluation of the state-of-the-art CNNs, LeNet-5 and YOLOv2, on both FPGA architectures. The results provide useful insights into the trade-offs involved, limitations, challenges, and the complexity of implementing CNNs on FPGAs. The paper highlights the difficulties and intricacies involved in implementing CNNs on FPGAs and provides potential solutions for improving performance and efficiency.

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018

As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing endto-end CNNs including NiN, VGG-16 and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

A Hardware Accelerator for the Inference of a Convolutional Neural network

Ciencia e Ingeniería Neogranadina, 2019

Convolutional Neural Networks (CNNs) are becoming increasingly popular in deep learning applications, e.g. image classification, speech recognition, medicine, to name a few. However, the CNN inference is computationally intensive and demanding a large among of memory resources. In this work is proposed a CNN inference hardware accelerator, which was implemented in a co-processing scheme. The aim is to reduce the hardware resources and achieve the better possible throughput. The design was implemented in the Digilent Arty Z7-20 development board, which is based on System on Chip (SoC) Zynq-7000 of Xilinx. Our implementation achieved a of accuracy for the MNIST database using only 12-bits fixed-point format. The results show that the co-processing scheme operating at a conservative speed of 100 MHz can identify around 441 images per second, which is about 17% times faster than a 650 MHz - software implementation. It is difficult to compare our results against other implementations ba...