Hadi Esmaeilzadeh | Georgia Institute of Technology (original) (raw)

Papers by Hadi Esmaeilzadeh

Research paper thumbnail of Error correction for approximate computing

2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016

Approximate computing, which sacrifices the accuracy during computation, is a promising technolog... more Approximate computing, which sacrifices the accuracy during computation, is a promising technology to save energy. However, large number of computation errors may violate the accuracy requirement of certain applications and should be corrected. Consider a Graphical Processing Unit (GPU) with multiple Streaming Multiprocessors (SMs), where some of these SMs perform accurate computation while the others perform approximate computation. Provided the approximate outputs are correlated with other accurate outputs, we exploit this relation and model the approximate computation process as a communication process. Then the problem of error correction transforms to a problem of decoding and we want to solve it with certain error correction code. Different from the classical communications process, approximate computing raises additional constraints on the code design. In this paper, we propose a semi-regular LDPC code satisfying these constraints and prove this code can be perfectly decoded....

Research paper thumbnail of The impact of 3D stacking on GPU-accelerated deep neural networks: An experimental study

In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GP... more In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GPU and a heater/thermometer top die. The top die has four independently-controllable heaters, which can emulate a wide range of components, ranging from low power memory to high-performance multi-core processor cores. The performance and temperature of the bottom-tier GPU on several deep neural network workloads is investigated as a function of increasing top-die power dissipation, and the implications for 3DIC cooling are discussed.

Research paper thumbnail of Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Achieving faster execution with shorter compilation time can enable further diversity and innovat... more Achieving faster execution with shorter compilation time can enable further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently, simulated annealing and genetic algorithms. Our work takes a unique approach by formulating compiler optimizations for neural networks as a reinforcement learning problem, whose solution takes fewer steps to converge. This solution, dubbed ReLeASE, comes with a sampling algorithm that leverages clustering to focus the costly samples (real hardware measurements) on representative points, subsuming an entire subspace. Our adaptive sampling not only reduces the number of samples, but also improves the quality of samples for better exploration in shorter time. As such, experimentation with real hardware shows that reinforcement learning with adaptive sampling provides 4.45x speed up in optimization time over AutoTV...

Research paper thumbnail of From Tensors to FPGAs: Accelerating Deep Learning

Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a... more Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains, such as vision, robotics, video analytics, speech recognition, natural language processing, targeted advertising, and web search. With diminishing benefits from technology scaling, the research community is increasingly turning to specialized accelerators for DNNs. Even though ASICs provide significant gains in performance and efficiency for DNNs, they may not cope with the ever-evolving DNN models. Furthermore, ASICs and customized cores come at the price of high non-recurring engineering costs over long design periods. FPGAs are an attractive choice for DNNs since they represent an intermediate point between the efficiency of ASICs and the programmability of general purpose processors, and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware desi...

Research paper thumbnail of ExpAX: A Framework for Automating Approximate Programming

We present ExpAX, a framework for automating approximate programming. ExpAX consists of these thr... more We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3× to 113×) while providing formal safety and statistical quality-of-result guarantees.

Research paper thumbnail of A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference

INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Network... more INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Networks (DNNs) in home automation, targeted advertising, machine vision, etc. The cloud receives the inference request as a raw input, containing a rich set of private information, that can be misused or leaked, possibly inadvertently. This prevalent setting can compromise the privacy of users during the inference phase. This paper sets out to provide a principled approach, dubbed Cloak, that finds optimal stochastic perturbations to obfuscate the private data before it is sent to the cloud. To this end, Cloak reduces the information content of the transmitted data while conserving the essential pieces that enable the request to be serviced accurately. The key idea is formulating the discovery of this stochasticity as an offline gradient-based optimization problem that reformulates a pre-trained DNN (with optimized known weights) as an analytical function of the stochastic perturbations. Using...

Research paper thumbnail of FastStereoNet: A Fast Neural Architecture Search for Improving the Inference of Disparity Estimation on Resource-Limited Platforms

IEEE Transactions on Systems, Man, and Cybernetics: Systems

Research paper thumbnail of In-DRAM near-data approximate acceleration for GPUs

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Research paper thumbnail of Not All Features Are Equal: Discovering Essential Features for Preserving Prediction Privacy

Proceedings of the Web Conference 2021

Research paper thumbnail of FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks

2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of ... more Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of two models, a generative model and a discriminative model. While the discriminative model uses the conventional convolution operator, the generative model is fundamentally different per its use of the transposed convolution operator. Unlike the conventional convolution, the transposed convolution initially inserts a large number of zeros in its input. This zero-insertion leads to a large number of inconsequential operations and creates different patterns of computation across the sliding windows. The inconsequential operations along with the variation in computation patterns lead to signicant resource underutilization when evaluated using conventional convolution hardware. This paper introduces FlexiGAN, an end-to-end solution, from high-level GAN specication to an optimized synthesizable FPGA accelerator. FlexiGAN framework is coupled with a novel architecture that aims to harness the benets of both MIMD and SIMD execution models. The proposed architecture separated data retrieval and data processing units at the nest granularity of each compute engine. Leveraging the separation between data retrieval and data processing units in the compute engines, we introduce a succinct set of operations that enable us to signicantly reduce the on-chip memory usage, which is generally scarce in FPGAs. We evaluate our end-to-end solution across various GANs from machine learning literature. FlexiGAN provides 2.4 higher performance than an optimized conventional convolution design. In addition, FlexiGAN, on average, yields 2.8 (up to 3.7) improvements in Performance-per-Watt over a high-end GPU. These results indicate that FlexiGAN is an effective initial step towards providing an end-to-end solution for accelerating GANs

Research paper thumbnail of Shredder

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Research paper thumbnail of ReLeQ : A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks

Research paper thumbnail of SiMul: An Algorithm-Driven Approximate Multiplier Design for Machine Learning

Research paper thumbnail of In-RDBMS hardware acceleration of advanced analytics

Proceedings of the VLDB Endowment

The data revolution is fueled by advances in machine learning, databases, and hardware design. Pr... more The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in- D atabase A cceleration of Advanced A nalytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders , that directl...

Research paper thumbnail of Machine Learning Acceleration

Research paper thumbnail of AxGames

Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '16, 2016

Research paper thumbnail of Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016

Conventionally, an approximate accelerator replaces every invocation of a frequently executed reg... more Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator---improving performance and efficiency--or run on the precise core---maintaining quality. In this paper we introduce M ithra , a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. M ithra seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that M ithra performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.

Research paper thumbnail of Scalable Neural-Network Stream Processor

New computation technology demands high performance and scalability at the hardware side. Traditi... more New computation technology demands high performance and scalability at the hardware side. Traditional shared bus architectures cannot fulfill the demand of higher performance because of its shared characteristics. Different bus architectures such as multi-layer buses do not satisfy the scalability needs of such systems. Network-on-Chip (NoC) structure is a new paradigm that addresses these two requirements. NnSP architecture originally has

Research paper thumbnail of GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016

Research paper thumbnail of Axilog: Language support for approximate hardware design

2015 Design Automation Test in Europe Conference Exhibition, Mar 9, 2015

Research paper thumbnail of Error correction for approximate computing

2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016

Approximate computing, which sacrifices the accuracy during computation, is a promising technolog... more Approximate computing, which sacrifices the accuracy during computation, is a promising technology to save energy. However, large number of computation errors may violate the accuracy requirement of certain applications and should be corrected. Consider a Graphical Processing Unit (GPU) with multiple Streaming Multiprocessors (SMs), where some of these SMs perform accurate computation while the others perform approximate computation. Provided the approximate outputs are correlated with other accurate outputs, we exploit this relation and model the approximate computation process as a communication process. Then the problem of error correction transforms to a problem of decoding and we want to solve it with certain error correction code. Different from the classical communications process, approximate computing raises additional constraints on the code design. In this paper, we propose a semi-regular LDPC code satisfying these constraints and prove this code can be perfectly decoded....

Research paper thumbnail of The impact of 3D stacking on GPU-accelerated deep neural networks: An experimental study

In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GP... more In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GPU and a heater/thermometer top die. The top die has four independently-controllable heaters, which can emulate a wide range of components, ranging from low power memory to high-performance multi-core processor cores. The performance and temperature of the bottom-tier GPU on several deep neural network workloads is investigated as a function of increasing top-die power dissipation, and the implications for 3DIC cooling are discussed.

Research paper thumbnail of Reinforcement Learning and Adaptive Sampling for Optimized DNN Compilation

Achieving faster execution with shorter compilation time can enable further diversity and innovat... more Achieving faster execution with shorter compilation time can enable further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently, simulated annealing and genetic algorithms. Our work takes a unique approach by formulating compiler optimizations for neural networks as a reinforcement learning problem, whose solution takes fewer steps to converge. This solution, dubbed ReLeASE, comes with a sampling algorithm that leverages clustering to focus the costly samples (real hardware measurements) on representative points, subsuming an entire subspace. Our adaptive sampling not only reduces the number of samples, but also improves the quality of samples for better exploration in shorter time. As such, experimentation with real hardware shows that reinforcement learning with adaptive sampling provides 4.45x speed up in optimization time over AutoTV...

Research paper thumbnail of From Tensors to FPGAs: Accelerating Deep Learning

Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a... more Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains, such as vision, robotics, video analytics, speech recognition, natural language processing, targeted advertising, and web search. With diminishing benefits from technology scaling, the research community is increasingly turning to specialized accelerators for DNNs. Even though ASICs provide significant gains in performance and efficiency for DNNs, they may not cope with the ever-evolving DNN models. Furthermore, ASICs and customized cores come at the price of high non-recurring engineering costs over long design periods. FPGAs are an attractive choice for DNNs since they represent an intermediate point between the efficiency of ASICs and the programmability of general purpose processors, and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware desi...

Research paper thumbnail of ExpAX: A Framework for Automating Approximate Programming

We present ExpAX, a framework for automating approximate programming. ExpAX consists of these thr... more We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3× to 113×) while providing formal safety and statistical quality-of-result guarantees.

Research paper thumbnail of A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference

INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Network... more INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Networks (DNNs) in home automation, targeted advertising, machine vision, etc. The cloud receives the inference request as a raw input, containing a rich set of private information, that can be misused or leaked, possibly inadvertently. This prevalent setting can compromise the privacy of users during the inference phase. This paper sets out to provide a principled approach, dubbed Cloak, that finds optimal stochastic perturbations to obfuscate the private data before it is sent to the cloud. To this end, Cloak reduces the information content of the transmitted data while conserving the essential pieces that enable the request to be serviced accurately. The key idea is formulating the discovery of this stochasticity as an offline gradient-based optimization problem that reformulates a pre-trained DNN (with optimized known weights) as an analytical function of the stochastic perturbations. Using...

Research paper thumbnail of FastStereoNet: A Fast Neural Architecture Search for Improving the Inference of Disparity Estimation on Resource-Limited Platforms

IEEE Transactions on Systems, Man, and Cybernetics: Systems

Research paper thumbnail of In-DRAM near-data approximate acceleration for GPUs

Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Research paper thumbnail of Not All Features Are Equal: Discovering Essential Features for Preserving Prediction Privacy

Proceedings of the Web Conference 2021

Research paper thumbnail of FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks

2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of ... more Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of two models, a generative model and a discriminative model. While the discriminative model uses the conventional convolution operator, the generative model is fundamentally different per its use of the transposed convolution operator. Unlike the conventional convolution, the transposed convolution initially inserts a large number of zeros in its input. This zero-insertion leads to a large number of inconsequential operations and creates different patterns of computation across the sliding windows. The inconsequential operations along with the variation in computation patterns lead to signicant resource underutilization when evaluated using conventional convolution hardware. This paper introduces FlexiGAN, an end-to-end solution, from high-level GAN specication to an optimized synthesizable FPGA accelerator. FlexiGAN framework is coupled with a novel architecture that aims to harness the benets of both MIMD and SIMD execution models. The proposed architecture separated data retrieval and data processing units at the nest granularity of each compute engine. Leveraging the separation between data retrieval and data processing units in the compute engines, we introduce a succinct set of operations that enable us to signicantly reduce the on-chip memory usage, which is generally scarce in FPGAs. We evaluate our end-to-end solution across various GANs from machine learning literature. FlexiGAN provides 2.4 higher performance than an optimized conventional convolution design. In addition, FlexiGAN, on average, yields 2.8 (up to 3.7) improvements in Performance-per-Watt over a high-end GPU. These results indicate that FlexiGAN is an effective initial step towards providing an end-to-end solution for accelerating GANs

Research paper thumbnail of Shredder

Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

Research paper thumbnail of ReLeQ : A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks

Research paper thumbnail of SiMul: An Algorithm-Driven Approximate Multiplier Design for Machine Learning

Research paper thumbnail of In-RDBMS hardware acceleration of advanced analytics

Proceedings of the VLDB Endowment

The data revolution is fueled by advances in machine learning, databases, and hardware design. Pr... more The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in- D atabase A cceleration of Advanced A nalytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders , that directl...

Research paper thumbnail of Machine Learning Acceleration

Research paper thumbnail of AxGames

Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '16, 2016

Research paper thumbnail of Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration

2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016

Conventionally, an approximate accelerator replaces every invocation of a frequently executed reg... more Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator---improving performance and efficiency--or run on the precise core---maintaining quality. In this paper we introduce M ithra , a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. M ithra seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that M ithra performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.

Research paper thumbnail of Scalable Neural-Network Stream Processor

New computation technology demands high performance and scalability at the hardware side. Traditi... more New computation technology demands high performance and scalability at the hardware side. Traditional shared bus architectures cannot fulfill the demand of higher performance because of its shared characteristics. Different bus architectures such as multi-layer buses do not satisfy the scalability needs of such systems. Network-on-Chip (NoC) structure is a new paradigm that addresses these two requirements. NnSP architecture originally has

Research paper thumbnail of GRATER: An Approximation Workflow for Exploiting Data-Level Parallelism in FPGA Acceleration

Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016

Research paper thumbnail of Axilog: Language support for approximate hardware design

2015 Design Automation Test in Europe Conference Exhibition, Mar 9, 2015