Pouya Haghi | Boston University (original) (raw)
Papers by Pouya Haghi
I.E.E.E. transactions on computers/IEEE transactions on computers, 2024
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
arXiv (Cornell University), May 31, 2023
Bookmarks Related papers MentionsView impact
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Bookmarks Related papers MentionsView impact
Proceedings of the 37th International Conference on Supercomputing
Bookmarks Related papers MentionsView impact
Proceedings of the 37th International Conference on Supercomputing
Bookmarks Related papers MentionsView impact
2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)
Bookmarks Related papers MentionsView impact
2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)
Bookmarks Related papers MentionsView impact
2022 IEEE High Performance Extreme Computing Conference (HPEC)
Bookmarks Related papers MentionsView impact
2022 IEEE High Performance Extreme Computing Conference (HPEC)
Bookmarks Related papers MentionsView impact
2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
IEEE Transactions on Circuits and Systems I: Regular Papers, 2020
In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architectu... more In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on <underline>o</underline>peration packing and <underline>o</underline>ut-<underline>o</underline>f-<underline>o</underline>rder (<underline>OoO</underline>) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$3.44\times$ </tex-math></inline-formula>) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O<sup>4</sup>-DNN compared to the baseline structure.
Bookmarks Related papers MentionsView impact
Performance of distributed data center applications can be improved through use of FPGA-based Sma... more Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gbps of communication bandwidth while using only 1% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.
Bookmarks Related papers MentionsView impact
MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI annlications.
Bookmarks Related papers MentionsView impact
Concurrency and Computation: Practice and Experience
There has been much effort in offloading MPI collective operations into hardware. But while NIC‐b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC‐based collective acceleration is well‐studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in‐switch collective accelerator, MPI‐FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two‐level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub‐communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA‐centric cluster. A sample MPI‐FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high‐radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI‐FPGA is fully integrated into MPICH and is transparent to MPI applications.
Bookmarks Related papers MentionsView impact
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves 10times10 \times 10times speedup in the most likely scenarios over conventional clusters. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. MPI-FPGA is fully integrated into MPICH and so transparent to MPI applications.
Bookmarks Related papers MentionsView impact
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
There has been much effort in offloading MPI collective operations into hardware. But while NIC-b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.
Bookmarks Related papers MentionsView impact
In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional ne... more In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-oforder is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), 2.5× (3.44×) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O 4-DNN compared to the baseline structure.
Bookmarks Related papers MentionsView impact
I.E.E.E. transactions on computers/IEEE transactions on computers, 2024
Bookmarks Related papers MentionsView impact
Bookmarks Related papers MentionsView impact
arXiv (Cornell University), May 31, 2023
Bookmarks Related papers MentionsView impact
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
Bookmarks Related papers MentionsView impact
Proceedings of the 37th International Conference on Supercomputing
Bookmarks Related papers MentionsView impact
Proceedings of the 37th International Conference on Supercomputing
Bookmarks Related papers MentionsView impact
2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)
Bookmarks Related papers MentionsView impact
2022 32nd International Conference on Field-Programmable Logic and Applications (FPL)
Bookmarks Related papers MentionsView impact
2022 IEEE High Performance Extreme Computing Conference (HPEC)
Bookmarks Related papers MentionsView impact
2022 IEEE High Performance Extreme Computing Conference (HPEC)
Bookmarks Related papers MentionsView impact
2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
2022 IEEE 30th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
IEEE Transactions on Circuits and Systems I: Regular Papers, 2020
In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architectu... more In this paper, we propose O<sup>4</sup>-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on <underline>o</underline>peration packing and <underline>o</underline>ut-<underline>o</underline>f-<underline>o</underline>rder (<underline>OoO</underline>) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), <inline-formula> <tex-math notation="LaTeX">$2.5\times $ </tex-math></inline-formula> (<inline-formula> <tex-math notation="LaTeX">$3.44\times$ </tex-math></inline-formula>) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O<sup>4</sup>-DNN compared to the baseline structure.
Bookmarks Related papers MentionsView impact
Performance of distributed data center applications can be improved through use of FPGA-based Sma... more Performance of distributed data center applications can be improved through use of FPGA-based SmartNICs, which provide additional functionality and enable higher bandwidth communication. Until lately, however, the lack of a simple approach for customizing SmartNICs to application requirements has limited the potential benefits. Intel's Configurable Network Protocol Accelerator (COPA) provides a customizable FPGA framework that integrates both hardware and software development to improve computation and communication performance. In this first case study, we demonstrate the capabilities of the COPA framework with an application from cryptography -- secure Multi-Party Computation (MPC) -- that utilizes hardware accelerators connected directly to host memory and the COPA network. We find that using the COPA framework gives significant improvements to both computation and communication as compared to traditional implementations of MPC that use CPUs and NICs. A single MPC accelerator running on COPA enables more than 17Gbps of communication bandwidth while using only 1% of Stratix 10 resources. We show that utilizing the COPA framework enables multiple MPC accelerators running in parallel to fully saturate a 100Gbps link enabling higher performance compared to traditional NICs.
Bookmarks Related papers MentionsView impact
MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves on average 3.9× speedup over conventional clusters in the most likely scenarios. Essential to this work is providing support for sub-communicator collectives. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. We show how communicator support can be integrated easily into an in-switch hardware accelerator to implement MPI communicators and so enable full offload of MPI collectives. While this mechanism is universally applicable, we implement it in an FPGA cluster; FPGAs provide the ability to couple communication and computation and so are an ideal testbed and have a number of other architectural benefits. MPI-FPGA is fully integrated into MPICH and so transparently usable by MPI annlications.
Bookmarks Related papers MentionsView impact
Concurrency and Computation: Practice and Experience
There has been much effort in offloading MPI collective operations into hardware. But while NIC‐b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC‐based collective acceleration is well‐studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in‐switch collective accelerator, MPI‐FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two‐level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub‐communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA‐centric cluster. A sample MPI‐FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high‐radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI‐FPGA is fully integrated into MPICH and is transparent to MPI applications.
Bookmarks Related papers MentionsView impact
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
MPI collective operations can often be performance killers in HPC applications; we seek to solve ... more MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself, rather than, e.g., the NIC. We have designed a hardware accelerator MPI-FPGA to implement six MPI collectives in the network. Preliminary results show that MPI-FPGA achieves 10times10 \times 10times speedup in the most likely scenarios over conventional clusters. We introduce a novel mechanism that enables the hardware to support a large number of communicators of arbitrary shape, and that is scalable to very large systems. MPI-FPGA is fully integrated into MPICH and so transparent to MPI applications.
Bookmarks Related papers MentionsView impact
2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Bookmarks Related papers MentionsView impact
There has been much effort in offloading MPI collective operations into hardware. But while NIC-b... more There has been much effort in offloading MPI collective operations into hardware. But while NIC-based collective acceleration is well-studied, offloading their processing into the switching fabric, despite numerous advantages, has been much more limited. A major problem with fixed logic implementations is that either only a fraction of the possible collective communication is accelerated or that logic is wasted in the applications that do not need a particular capability. Using reconfigurable logic has numerous advantages: exactly the required operations can be implemented; the level of desired performance can be specified; and new, possibly complex, operations can be defined and implemented. We have designed an in-switch collective accelerator, MPI-FPGA, and demonstrated its use with seven MPI collectives and over a set of benchmarks and proxy applications (MiniApps). The accelerator uses a novel two-level switch design containing fully pipelined vectorized aggregation logic units. Essential to this work is providing support for sub-communicator collectives that enables communicators of arbitrary shape, and that is scalable to large systems. A streaming interface improves the performance for long messages. While this reconfigurable design is generally applicable, we prototype it with an FPGA-centric cluster. A sample MPI-FPGA design in a direct network achieves considerable speedups over conventional clusters in the most likely scenarios. We also present results for indirect networks with reconfigurable high-radix switches and show that this approach is competitive with SHArP technology for the subset of operations that SHArP supports. MPI-FPGA is fully integrated into MPICH and is transparent to MPI applications.
Bookmarks Related papers MentionsView impact
In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional ne... more In this paper, we propose O 4-DNN, a highperformance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-oforder is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to), 2.5× (3.44×) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O 4-DNN compared to the baseline structure.
Bookmarks Related papers MentionsView impact