Accelerating Deep Neural Networks implementation: A survey (original) (raw)

Accelerating Deep Neural Networks Using FPGA

2018 30th International Conference on Microelectronics (ICM), 2018

Deep Convolutional Neural Networks (CNNs) are the state-of-the-art systems for image classification and scene understating. They are widely used for their superior accuracy but at the cost of high computational complexity. The target in this field nowadays is its acceleration to be used in real time applications. The solution is to use Graphics Processing Units (GPU) but many problems arise due to the GPU high-power consumption which prevents its utilization in daily-used equipment. The Field Programmable Gate Array (FPGA) is a new solution for CNN implementations due to its low power consumption and flexible architecture. This work discusses this problem and provides a solution that compromises between the speed of the CNN and the power consumption of the FPGA. This solution depends on two main techniques for speeding up: parallelism of layers resources and pipelining inside some layers

Acceleration of Deep Neural Network Training Using Field Programmable Gate Arrays

Computational Intelligence and Neuroscience, 2022

Convolutional neural network (CNN) training often necessitates a considerable amount of computational resources. In recent years, several studies have proposed for CNN inference and training accelerators in which the FPGAs have previously demonstrated good performance and energy e ciency. To speed up the processing, CNN requires additional computational resources such as memory bandwidth, a FPGA platform resource usage, time, power consumption, and large datasets for training. ey are constrained by the requirement for improved hardware acceleration to support scalability beyond existing data and model sizes. is paper proposes a procedure for energy e cient CNN training in collaboration with an FPGA-based accelerator. We employed optimizations such as quantization, which is a common model compression technique, to speed up the CNN training process. Additionally, a gradient accumulation bu er is used to ensure maximum operating e ciency while maintaining gradient descent of the learning algorithm. To validate the design, we implemented the AlexNet and VGG-16 models on an FPGA board and laptop CPU along side GPU. It achieves 203.75 GOPS on Terasic DE1 SoC with the AlexNet model and 196.50 GOPS with the VGG-16 model on Terasic DE-SoC. Our result also exhibits that the FPGA accelerators are more energy e cient than other platforms.

Optimizing Accelerator on FPGA for Deep Convolutional Neural Networks

Algorithms and Architectures for Parallel Processing, 2020

With the development of deep learning, the traditional neural networks architecture has been gradually met the bottleneck of the performance. Convolutional neural networks (CNNs) has been widely concerned because of its high precision advantage. However, CNNs are usually computationally large. And in addition to the widely used GPUs, but which has higher energy. And FPGA is gradually used to achieve CNNs acceleration due to its high performance, high concurrency, fast development cycle and re-configurability characteristics. Although previous works have made considerable progress, few researches have been used to address data dependence in data structure of CNN. Data dependence greatly affects the performance of accelerators. In this paper, we present a way to greatly improve the read efficiency of the accelerated hardware by reconstructing the original digital set and using the preload mechanism. This effectively reduces the problem of data dependence. In this way, pipeline technology can speed up the CNN computing process more effectively. We implemented the accelerator architecture on the XC 7z045 board, and the proposed accelerator has a clear advantage over previous studies to improve the efficiency and processing speed of CNNs.

FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review

IEEE Access, 2018

Due to recent advances in digital technologies, and availability of credible data, an area of artificial intelligence, deep learning, has emerged and has demonstrated its ability and effectiveness in solving complex learning problems not possible before. In particular, convolutional neural networks (CNNs) have demonstrated their effectiveness in the image detection and recognition applications. However, they require intensive CPU operations and memory bandwidth that make general CPUs fail to achieve the desired performance levels. Consequently, hardware accelerators that use application-specific integrated circuits, field-programmable gate arrays (FPGAs), and graphic processing units have been employed to improve the throughput of CNNs. More precisely, FPGAs have been recently adopted for accelerating the implementation of deep learning networks due to their ability to maximize parallelism and their energy efficiency. In this paper, we review the recent existing techniques for accelerating deep learning networks on FPGAs. We highlight the key features employed by the various techniques for improving the acceleration performance. In addition, we provide recommendations for enhancing the utilization of FPGAs for CNNs acceleration. The techniques investigated in this paper represent the recent trends in the FPGA-based accelerators of deep learning networks. Thus, this paper is expected to direct the future advances on efficient hardware accelerators and to be useful for deep learning researchers. INDEX TERMS Adaptable architectures, convolutional neural networks (CNNs), deep learning, dynamic reconfiguration, energy-efficient architecture, field programmable gate arrays (FPGAs), hardware accelerator, machine learning, neural networks, optimization, parallel computer architecture, reconfigurable computing.

Field-programmable gate array implementation of efficient deep neural network architecture

International Journal of Electrical and Computer Engineering (IJECE), 2024

Deep neural network (DNN) comprises multiple stages of data processing subsystems with one of the primary subsystems is a fully connected neural network (FCNN) model. This fully connected neural network model has multiple layers of neurons that need to be implemented using arithmetic units with suitable number representation to optimize area, power, and speed. In this work, the network parameters are analyzed, and redundancy in weights is eliminated. A pipelined and parallel structure is designed for the fully connected network information. The proposed FCNN structure has 16 inputs, 3 hidden layers, and an output layer. Each hidden layer consists of 4 neurons and describes how the inputs are connected to hidden layer neurons to process the raw data. A hardware description language (HDL) model is developed for the proposed structure and the verified model is implemented on Xilinx field-programmable gate array (FPGA). The modified structure comprises registers, demultiplexers, weight registers, multipliers, adders, and read-only memory lookup table (ROM/LUT). The modified architecture implemented on FPGA is estimated to reduce area by 87.5% and improve timing by 3x compared with direct implementation methods.

Deep Learning on FPGAs: Past, Present, and Future

2016

The rapid growth of data size and accessibility in recent years has instigated a shift of philosophy in algorithm design for artificial intelligence. Instead of engineering algorithms by hand, the ability to learn composable systems automatically from massive amounts of data has led to ground-breaking performance in important domains such as computer vision, speech recognition, and natural language processing. The most popular class of techniques used in these domains is called deep learning, and is seeing significant attention from industry. However, these models require incredible amounts of data and compute power to train, and are limited by the need for better hardware acceleration to accommodate scaling beyond current data and model sizes. While the current solution has been to use clusters of graphics processing units (GPU) as general purpose processors (GPGPU), the use of field programmable gate arrays (FPGA) provide an interesting alternative. Current trends in design tools ...

Optimizing the Deep Neural Networks by Layer-Wise Refined Pruning and the Acceleration on FPGA

Computational Intelligence and Neuroscience

To accelerate the practical applications of artificial intelligence, this paper proposes a high efficient layer-wise refined pruning method for deep neural networks at the software level and accelerates the inference process at the hardware level on a field-programmable gate array (FPGA). The refined pruning operation is based on the channel-wise importance indexes of each layer and the layer-wise input sparsity of convolutional layers. The method utilizes the characteristics of the native networks without introducing any extra workloads to the training phase. In addition, the operation is easy to be extended to various state-of-the-art deep neural networks. The effectiveness of the method is verified on ResNet architecture and VGG networks in terms of dataset CIFAR10, CIFAR100, and ImageNet100. Experimental results show that in terms of ResNet50 on CIFAR10 and ResNet101 on CIFAR100, more than 85% of parameters and Floating-Point Operations are pruned with only 0.35% and 0.40% accur...

Comparative analysis of the specialized software and hardware for deep learning algorithms

Computer systems and network

Convolutional neural nets (CNNs) have become a practical means to perform vision tasks, particularly in the area of image classification. FPGAs are well known to be able to perform convolutions efficiently, however, most recent efforts to run CNNs on FPGAs have shown limited advantages over other devices such as GPUs. Previous approaches on FPGAs have often been memory bound due to the limited external memory bandwidth on the FPGA device. We show a novel architecture written in OpenCL TM , which we refer to as a Deep Learning Accelerator (DLA), that maximizes data reuse and minimizes external memory bandwidth. Furthermore, we show how we can use the Winograd transform to significantly boost the performance of the FPGA. As a result, when running our DLA on Intel's Arria 10 device we can achieve a performance of 1020img/s, or 23img/s/W when running the AlexNet CNN benchmark. This comes to 1382 GFLOPs and is 10x faster with 8.4x more GFLOPS and 5.8x better efficiency than the state-of-the-art on FP-GAs. Additionally, 23 img/s/W is competitive against the best publicly known implementation of AlexNet on nVidia's TitanX GPU.

Evaluation of FPGA Acceleration of Neural Networks

2023

This paper explores real-time Convolutional Neural Network inference on Field Programmable Gate Arrays (FPGAs) implemented in Synchronous Message Exchange (SME). We compare SME to the widespread FPGA tool, High-Level Synthesis (HLS), and compare both the SME and HLS implementations of CNNs with the PyTorch implementation for CNN on CPU/GPU. We find that the SME implementation is more flexible than the HLS implementation as it allows for more customization of the hardware. Programming with SME is more difficult than HLS, although easier than traditional Hardware Description Languages. Finally, for a test use case, we find that the SME implementation on FPGA is approximately 2.8/1.4/2.0 times more energy efficient than CPU/GPU/ARM at larger batch sizes, with the HLS implementation on FPGA falling in between CPU/ARM and GPU in terms of energy efficiency. At a batch size of 1, appropriate for edge-device inference, the gap in energy efficiency between the FPGA and CPU/GPU/ARM implementations becomes more pronounced, with the SME implementation on FPGA being approximately 83/47/8 times more energy efficient than the CPU/GPU/ARM implementations, and with the HLS implementation on FPGA being approximately 40/23/4 times more energy efficient than the CPU/GPU/ARM implementations.

Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2018

As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate (MAC) operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g. loop unrolling, tiling and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This work overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g. memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing endto-end CNNs including NiN, VGG-16 and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.