Gourav Modi - Academia.edu (original) (raw)
Uploads
Papers by Gourav Modi
2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
We can exploit application-specific sparse structure and distribution of non-zero coefficients in... more We can exploit application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) matrices to significantly improve the performance of 1-D DWT mapped to FPGA-based soft vector processors. We reformulate DWT computations specifically in terms of sparse matrix operations, where the transformation matrices have a repeating block with a fixed non-zero pattern, which we refer to as a skeleton. We exploit this property to transform the original DWT matrix into a Modified-Matrix-Form to expose abundant soft vector parallelism in the dot products. The resulting form can also be readily compiled into lowlevel DMA routines for boosting memory throughput. We autogenerate vector routines and memory access sequences tailored for parametric combinations of DWT filter sizes, and decomposition levels as required by the application domain. When compared to embedded ARMv7 32b CPU implementations using optimized OpenBLAS routines, soft vector implementation on the Xilinx Zedboard and Altera DE2/DE4 platforms demonstrate speedups of 12-103x.
We design a 120-core 94 MHz MIPS processor FPGA overlay interconnected with a lightweight message... more We design a 120-core 94 MHz MIPS processor FPGA overlay interconnected with a lightweight message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested RTL source code for the microAp-tiv MIPS processor made available under the Imagination Technologies Academic Program. We augment the processor with suitable custom instruction extensions for moving data between the cores via explicit message passing. We support these instructions with a communication scratchpad that is optimized for high throughput injection of network traffic. We also demonstrate an end-to-end proof-of-concept flow that compiles C code with suitable MIPS UDI-supported (user-defined instructions) message passing workloads and stress-test with synthetic workloads.
Thesis by Gourav Modi
FPGA-based token dataflow graph (DFG) architectures are an increasingly important design choice f... more FPGA-based token dataflow graph (DFG) architectures are an increasingly important design choice for accelerating many hard computational problems, where parallelism is sparse and irregular. Some prior work on raw DFGs has been done wherein a series of optimization's such as substitution, re-association, fanout decomposition are performed by a software compile . For optimized DFG architectures, a collection of nodes is mapped to a processing element (PE) in a custom hardware, which communicates with the other nodes using a Network on Chip (NoC). In this thesis, we have intended to reduce the memory access time by designing a Multi-pumped Multi-ported simple dualport Ram which operates at twice the system clock speed, thereby enabling 2 writes and 2 reads in one system clock cycle (≈ 250 MHz). Furthermore, we propose a tile-based physical design techniques of partitioning and floor planning to boost performance and user programmability of the overlay design on Arria 10 FPGAs. Our flow smartly partitions the overlay design into tiles and then fits the design into rectangular coarse-grain floorplan to ensure maximum achievable F max . The smart floor-planning strategy gives an improvement in achievable F max over naive floor planning by 1.3-1.6X for overlay design sizes of 1x1 to 6x40. Furthermore, it gives us a spatial information of the memory blocks thereby improving the packet/data transmission time across processors in the network and also assists in timing closure. Also, this automated flow is extended for the floor planning of MIPS Processor [27] on Stratix V FPGAs for overlay sizes of 1x1 to 2x30. As part of this thesis, we have also worked on an algorithm development of 1D Discrete Wavelet Transformation (DWT) and 2D DWT for parallel processors and compared the performance with different embedded boards. When compared to embedded ARMv7 32b CPU implementations using optimized OpenBLAS routines, soft vector implementation on the Xilinx Zedboard and Altera DE2/DE4 platforms demonstrate speedups of 490x for various problem sizes. We have exploited application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) transform matrices to significantly improve the efficiency and performance of 1-D and 2-D transforms implemented on FPGA-based custom soft vector processors.
2016 26th International Conference on Field Programmable Logic and Applications (FPL), 2016
We can exploit application-specific sparse structure and distribution of non-zero coefficients in... more We can exploit application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) matrices to significantly improve the performance of 1-D DWT mapped to FPGA-based soft vector processors. We reformulate DWT computations specifically in terms of sparse matrix operations, where the transformation matrices have a repeating block with a fixed non-zero pattern, which we refer to as a skeleton. We exploit this property to transform the original DWT matrix into a Modified-Matrix-Form to expose abundant soft vector parallelism in the dot products. The resulting form can also be readily compiled into lowlevel DMA routines for boosting memory throughput. We autogenerate vector routines and memory access sequences tailored for parametric combinations of DWT filter sizes, and decomposition levels as required by the application domain. When compared to embedded ARMv7 32b CPU implementations using optimized OpenBLAS routines, soft vector implementation on the Xilinx Zedboard and Altera DE2/DE4 platforms demonstrate speedups of 12-103x.
We design a 120-core 94 MHz MIPS processor FPGA overlay interconnected with a lightweight message... more We design a 120-core 94 MHz MIPS processor FPGA overlay interconnected with a lightweight message-passing fabric that fits on a Stratix V GX FPGA (5SGXEA7N2F45C2). We use silicon-tested RTL source code for the microAp-tiv MIPS processor made available under the Imagination Technologies Academic Program. We augment the processor with suitable custom instruction extensions for moving data between the cores via explicit message passing. We support these instructions with a communication scratchpad that is optimized for high throughput injection of network traffic. We also demonstrate an end-to-end proof-of-concept flow that compiles C code with suitable MIPS UDI-supported (user-defined instructions) message passing workloads and stress-test with synthetic workloads.
FPGA-based token dataflow graph (DFG) architectures are an increasingly important design choice f... more FPGA-based token dataflow graph (DFG) architectures are an increasingly important design choice for accelerating many hard computational problems, where parallelism is sparse and irregular. Some prior work on raw DFGs has been done wherein a series of optimization's such as substitution, re-association, fanout decomposition are performed by a software compile . For optimized DFG architectures, a collection of nodes is mapped to a processing element (PE) in a custom hardware, which communicates with the other nodes using a Network on Chip (NoC). In this thesis, we have intended to reduce the memory access time by designing a Multi-pumped Multi-ported simple dualport Ram which operates at twice the system clock speed, thereby enabling 2 writes and 2 reads in one system clock cycle (≈ 250 MHz). Furthermore, we propose a tile-based physical design techniques of partitioning and floor planning to boost performance and user programmability of the overlay design on Arria 10 FPGAs. Our flow smartly partitions the overlay design into tiles and then fits the design into rectangular coarse-grain floorplan to ensure maximum achievable F max . The smart floor-planning strategy gives an improvement in achievable F max over naive floor planning by 1.3-1.6X for overlay design sizes of 1x1 to 6x40. Furthermore, it gives us a spatial information of the memory blocks thereby improving the packet/data transmission time across processors in the network and also assists in timing closure. Also, this automated flow is extended for the floor planning of MIPS Processor [27] on Stratix V FPGAs for overlay sizes of 1x1 to 2x30. As part of this thesis, we have also worked on an algorithm development of 1D Discrete Wavelet Transformation (DWT) and 2D DWT for parallel processors and compared the performance with different embedded boards. When compared to embedded ARMv7 32b CPU implementations using optimized OpenBLAS routines, soft vector implementation on the Xilinx Zedboard and Altera DE2/DE4 platforms demonstrate speedups of 490x for various problem sizes. We have exploited application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) transform matrices to significantly improve the efficiency and performance of 1-D and 2-D transforms implemented on FPGA-based custom soft vector processors.