Pouya Haghi | Boston University (original) (raw)

Papers by Pouya Haghi

Research paper thumbnail of FPGA-Accelerated Range-Limited Molecular Dynamics

I.E.E.E. transactions on computers/IEEE transactions on computers, 2024

Bookmarks Related papers MentionsView impact

Research paper thumbnail of SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

arXiv (Cornell University), May 31, 2023

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FASDA: An FPGA-Aided, Scalable and Distributed Accelerator for Range-Limited Molecular Dynamics

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Proceedings of the 37th International Conference on Supercomputing

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

Proceedings of the 37th International Conference on Supercomputing

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Framework for Neural Network Inference on FPGA-Centric SmartNICs

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework

2022 IEEE High Performance Extreme Computing Conference (HPEC)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications

2022 IEEE High Performance Extreme Computing Conference (HPEC)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FCsN: A FPGA-Centric SmartNIC Framework for Neural Networks

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of COPA Use Case: Distributed Secure Joint Computation

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

IEEE Transactions on Circuits and Systems I: Regular Papers, 2020

In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architectu... more In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on <underline>o</underline>peration packing and <underline>o</underline>ut-<underline>o</underline>f-<underline>o</underline>rder (<underline>OoO</underline>) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$3.44\times$ </tex-math></inline-formula>) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O<sup>4</sup>-DNN compared to the baseline structure.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework

Performance of distributed data center applications can be improved through use of FPGA-based Sma... more Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gbps of communication bandwidth while using only 1% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives

MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI annlications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Reconfigurable switches for high performance and flexible MPI collectives

Concurrency and Computation: Practice and Experience

There has been much effort in offloading MPI collective operations into hardware. But while NIC‐b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC‐based collective acceleration is well‐studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in‐switch collective accelerator, MPI‐FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two‐level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub‐communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA‐centric cluster. A sample MPI‐FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high‐radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI‐FPGA is fully integrated into MPICH and is transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Accelerating MPI Collectives with FPGAs in the Network and Novel Communicator Support

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves 10times10 \times 10times speedup in the most likely scenarios over conventional clusters. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. MPI-FPGA is fully integrated into MPICH and so transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FP-AMG: FPGA-Based Acceleration Framework for Algebraic Multigrid Solvers

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Reconfigurable switches for high performance and flexible MPI collectives

There has been much effort in offloading MPI collective operations into hardware. But while NIC-b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of O 4 -DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional ne... more In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-oforder is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), 2.5× (3.44×) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O 4-DNN compared to the baseline structure.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FPGA-Accelerated Range-Limited Molecular Dynamics

I.E.E.E. transactions on computers/IEEE transactions on computers, 2024

Bookmarks Related papers MentionsView impact

Research paper thumbnail of SmartFuse: Reconfigurable Smart Switches to Accelerate Fused Collectives in HPC Applications

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Survey of Potential MPI Complex Collectives: Large-Scale Mining and Analysis of HPC Applications

arXiv (Cornell University), May 31, 2023

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FASDA: An FPGA-Aided, Scalable and Distributed Accelerator for Range-Limited Molecular Dynamics

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and Training

Proceedings of the 37th International Conference on Supercomputing

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FLASH: FPGA-Accelerated Smart Switches with GCN Case Study

Proceedings of the 37th International Conference on Supercomputing

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Optimized Mappings for Symmetric Range-Limited Molecular Force Calculations on FPGAs

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of A Framework for Neural Network Inference on FPGA-Centric SmartNICs

2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework

2022 IEEE High Performance Extreme Computing Conference (HPEC)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of The Viability of Using Online Prediction to Perform Extra Work while Executing BSP Applications

2022 IEEE High Performance Extreme Computing Conference (HPEC)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FCsN: A FPGA-Centric SmartNIC Framework for Neural Networks

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of COPA Use Case: Distributed Secure Joint Computation

2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

IEEE Transactions on Circuits and Systems I: Regular Papers, 2020

In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architectu... more In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on <underline>o</underline>peration packing and <underline>o</underline>ut-<underline>o</underline>f-<underline>o</underline>rder (<underline>OoO</underline>) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$3.44\times$ </tex-math></inline-formula>) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O<sup>4</sup>-DNN compared to the baseline structure.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Distributed Hardware Accelerated Secure Joint Computation on the COPA Framework

Performance of distributed data center applications can be improved through use of FPGA-based Sma... more Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gbps of communication bandwidth while using only 1% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FPGAs in the Network and Novel Communicator Support Accelerate MPI Collectives

MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI annlications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Reconfigurable switches for high performance and flexible MPI collectives

Concurrency and Computation: Practice and Experience

There has been much effort in offloading MPI collective operations into hardware. But while NIC‐b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC‐based collective acceleration is well‐studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in‐switch collective accelerator, MPI‐FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two‐level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub‐communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA‐centric cluster. A sample MPI‐FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high‐radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI‐FPGA is fully integrated into MPICH and is transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Accelerating MPI Collectives with FPGAs in the Network and Novel Communicator Support

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves 10times10 \times 10times speedup in the most likely scenarios over conventional clusters. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. MPI-FPGA is fully integrated into MPICH and so transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of FP-AMG: FPGA-Based Acceleration Framework for Algebraic Multigrid Solvers

2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Bookmarks Related papers MentionsView impact

Research paper thumbnail of Reconfigurable switches for high performance and flexible MPI collectives

There has been much effort in offloading MPI collective operations into hardware. But while NIC-b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.

Bookmarks Related papers MentionsView impact

Research paper thumbnail of O 4 -DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices

In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional ne... more In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-oforder is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), 2.5× (3.44×) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O 4-DNN compared to the baseline structure.

Bookmarks Related papers MentionsView impact