Hadi Esmaeilzadeh - Profile on Academia.edu (original) (raw)
Papers by Hadi Esmaeilzadeh
Error correction for approximate computing
2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016
Approximate computing, which sacrifices the accuracy during computation, is a promising technolog... more Approximate computing, which sacrifices the accuracy during computation, is a promising technology to save energy. However, large number of computation errors may violate the accuracy requirement of certain applications and should be corrected. Consider a Graphical Processing Unit (GPU) with multiple Streaming Multiprocessors (SMs), where some of these SMs perform accurate computation while the others perform approximate computation. Provided the approximate outputs are correlated with other accurate outputs, we exploit this relation and model the approximate computation process as a communication process. Then the problem of error correction transforms to a problem of decoding and we want to solve it with certain error correction code. Different from the classical communications process, approximate computing raises additional constraints on the code design. In this paper, we propose a semi-regular LDPC code satisfying these constraints and prove this code can be perfectly decoded....
The impact of 3D stacking on GPU-accelerated deep neural networks: An experimental study
In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GP... more In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GPU and a heater/thermometer top die. The top die has four independently-controllable heaters, which can emulate a wide range of components, ranging from low power memory to high-performance multi-core processor cores. The performance and temperature of the bottom-tier GPU on several deep neural network workloads is investigated as a function of increasing top-die power dissipation, and the implications for 3DIC cooling are discussed.
Achieving faster execution with shorter compilation time can enable further diversity and innovat... more Achieving faster execution with shorter compilation time can enable further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently, simulated annealing and genetic algorithms. Our work takes a unique approach by formulating compiler optimizations for neural networks as a reinforcement learning problem, whose solution takes fewer steps to converge. This solution, dubbed ReLeASE, comes with a sampling algorithm that leverages clustering to focus the costly samples (real hardware measurements) on representative points, subsuming an entire subspace. Our adaptive sampling not only reduces the number of samples, but also improves the quality of samples for better exploration in shorter time. As such, experimentation with real hardware shows that reinforcement learning with adaptive sampling provides 4.45x speed up in optimization time over AutoTV...
From Tensors to FPGAs: Accelerating Deep Learning
Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a... more Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains, such as vision, robotics, video analytics, speech recognition, natural language processing, targeted advertising, and web search. With diminishing benefits from technology scaling, the research community is increasingly turning to specialized accelerators for DNNs. Even though ASICs provide significant gains in performance and efficiency for DNNs, they may not cope with the ever-evolving DNN models. Furthermore, ASICs and customized cores come at the price of high non-recurring engineering costs over long design periods. FPGAs are an attractive choice for DNNs since they represent an intermediate point between the efficiency of ASICs and the programmability of general purpose processors, and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware desi...
We present ExpAX, a framework for automating approximate programming. ExpAX consists of these thr... more We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3× to 113×) while providing formal safety and statistical quality-of-result guarantees.
A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference
INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Network... more INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Networks (DNNs) in home automation, targeted advertising, machine vision, etc. The cloud receives the inference request as a raw input, containing a rich set of private information, that can be misused or leaked, possibly inadvertently. This prevalent setting can compromise the privacy of users during the inference phase. This paper sets out to provide a principled approach, dubbed Cloak, that finds optimal stochastic perturbations to obfuscate the private data before it is sent to the cloud. To this end, Cloak reduces the information content of the transmitted data while conserving the essential pieces that enable the request to be serviced accurately. The key idea is formulating the discovery of this stochasticity as an offline gradient-based optimization problem that reformulates a pre-trained DNN (with optimized known weights) as an analytical function of the stochastic perturbations. Using...
IEEE Transactions on Systems, Man, and Cybernetics: Systems
Convolutional neural networks (CNNs) provide the best accuracy for disparity estimation. However,... more Convolutional neural networks (CNNs) provide the best accuracy for disparity estimation. However, CNNs are computationally expensive, making them unfavorable for resourcelimited devices with real-time constraints. Recent advances in neural architectures search (NAS) promise opportunities in automated optimization for disparity estimation. However, the main challenge of the NAS methods is the significant amount of computing time to explore a vast search space [e.g., 1.6 × 10 29 ] and costly training candidates. To reduce the NAS computational demand, many proxy-based NAS methods have been proposed. Despite their success, most of them are designed for comparatively small-scale learning tasks. In this article, we propose a fast NAS method, called FastStereoNet, to enable resource-aware NAS within an intractably large search space. FastStereoNet automatically searches for hardware-friendly CNN architectures based on late acceptance hill climbing (LAHC), followed by simulated annealing (SA). FastStereoNet also employs a fine-tuning with a transferred weights mechanism to improve the convergence of the search process. The collection of these ideas provides competitive results in terms of search time and strikes a balance between accuracy and efficiency. Compared to the state of the art, FastStereoNet provides 5.25× reduction in search time and 44.4× reduction in model size. These benefits are attained while yielding a comparable accuracy that enables seamless deployment of disparity estimation on resource-limited devices. Finally, FastStereoNet significantly improves the perception quality of disparity estimation deployed on field-programmable gate array Manuscript
In-DRAM near-data approximate acceleration for GPUs
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the Web Conference 2021
When receiving machine learning services from the cloud, the provider does not need to receive al... more When receiving machine learning services from the cloud, the provider does not need to receive all features; in fact, only a subset of the features are necessary for the target prediction task. Discerning this subset is the key problem of this work. We formulate this problem as a gradient-based perturbation maximization method that discovers this subset in the input feature space with respect to the functionality of the prediction model used by the provider. After identifying the subset, our framework, Cloak, suppresses the rest of the features using utility-preserving constant values that are discovered through a separate gradient-based optimization process. We show that Cloak does not necessarily require collaboration from the service provider beyond its normal service, and can be applied in scenarios where we only have black-box access to the service provider's model. We theoretically guarantee that Cloak's optimizations reduce the upper bound of the Mutual Information (MI) between the data and the sifted representations that are sent out. Experimental results show that Cloak reduces the mutual information between the input and the sifted representations by 85.01% with only negligible reduction in utility (1.42%). In addition, we show that Cloak greatly diminishes adversaries' ability to learn and infer non-conducive features. CCS CONCEPTS • Security and privacy → Privacy protections; Usability in security and privacy; • Computing methodologies → Neural networks; Computer vision tasks; • Mathematics of computing → Information theory.
FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks
2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of ... more Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of two models, a generative model and a discriminative model. While the discriminative model uses the conventional convolution operator, the generative model is fundamentally different per its use of the transposed convolution operator. Unlike the conventional convolution, the transposed convolution initially inserts a large number of zeros in its input. This zero-insertion leads to a large number of inconsequential operations and creates different patterns of computation across the sliding windows. The inconsequential operations along with the variation in computation patterns lead to signicant resource underutilization when evaluated using conventional convolution hardware. This paper introduces FlexiGAN, an end-to-end solution, from high-level GAN specication to an optimized synthesizable FPGA accelerator. FlexiGAN framework is coupled with a novel architecture that aims to harness the benets of both MIMD and SIMD execution models. The proposed architecture separated data retrieval and data processing units at the nest granularity of each compute engine. Leveraging the separation between data retrieval and data processing units in the compute engines, we introduce a succinct set of operations that enable us to signicantly reduce the on-chip memory usage, which is generally scarce in FPGAs. We evaluate our end-to-end solution across various GANs from machine learning literature. FlexiGAN provides 2.4 higher performance than an optimized conventional convolution design. In addition, FlexiGAN, on average, yields 2.8 (up to 3.7) improvements in Performance-per-Watt over a high-end GPU. These results indicate that FlexiGAN is an effective initial step towards providing an end-to-end solution for accelerating GANs
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
A wide variety of deep neural applications increasingly rely on the cloud to perform their comput... more A wide variety of deep neural applications increasingly rely on the cloud to perform their compute-heavy inference. This common practice requires sending private and privileged data over the network to remote servers, exposing it to the service provider and potentially compromising its privacy. Even if the provider is trusted, the data can still be vulnerable over communication channels or via side-channel attacks in the cloud. To that end, this paper aims to reduce the information content of the communicated data with as little as possible compromise on the inference accuracy by making the sent data noisy. An undisciplined addition of noise can significantly reduce the accuracy of inference, rendering the service unusable. To address this challenge, this paper devises Shredder, an end-to-end framework, that, without altering the topology or the weights of a pre-trained network, learns additive noise distributions that significantly reduce the information content of communicated data while maintaining the inference accuracy. The key idea is finding the additive noise distributions by casting it as a disjoint offline learning process with a loss function that strikes a balance between accuracy and information degradation. The loss function also exposes a knob for a disciplined and controlled asymmetric trade-off between privacy and accuracy. While keeping the DNN intact, Shredder divides inference between the cloud and the edge device, striking a balance between computation and communication. In the separate phase of inference, the edge device takes samples from the Laplace distributions that were collected during the proposed offline learning phase and populates a noise tensor with these sampled elements. Then, the edge device merely adds this populated noise tensor to the intermediate results to be sent to the cloud. As such, Shredder enables accurate inference on
IEEE Micro
Deep Quantization (below eight bits) can significantly reduce DNN computation and storage by decr... more Deep Quantization (below eight bits) can significantly reduce DNN computation and storage by decreasing the bitwidth of network encodings. However, without arduous manual effort, this deep quantization can lead to significant accuracy loss, leaving it in a position of questionable utility. We propose a systematic approach to tackle this problem, by automating the process of discovering the bitwidths through an end-to-end deep reinforcement learning framework (RELEQ). This framework utilizes the sample efficiency of Proximal Policy Optimization (PPO) to explore the exponentially large space of possible assignment of the bitwidths to the layers. We show how RELEQ can balance speed and quality, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks with minimal accuracy loss (≤ 0.3% loss) while minimizing the computation and storage costs. With these DNNs, RELEQ enables conventional hardware and custom DNN accelerator to achieve 2.2× speedup over 8-bit execution.
SiMul: An Algorithm-Driven Approximate Multiplier Design for Machine Learning
IEEE Micro
Proceedings of the VLDB Endowment
The data revolution is fueled by advances in machine learning, databases, and hardware design. Pr... more The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in- D atabase A cceleration of Advanced A nalytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders , that directl...
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '16, 2016
Approximate computing trades quality of application output for higher efficiency and performance.... more Approximate computing trades quality of application output for higher efficiency and performance. Approximation is useful only if its impact on application output quality is acceptable to the users. However, there is a lack of systematic solutions and studies that explore users' perspective on the effects of approximation. In this paper, we seek to provide one such solution for the developers to probe and discover the boundary of quality loss that most users will deem acceptable. We propose AxGames, a crowdsourced solution that enables developers to readily infer a statistical common ground from the general public through three entertaining games. The users engage in these games by betting on their opinion about the quality loss of the final output while the AxGames framework collects statistics about their perceptions. The framework then statistically analyzes the results to determine the acceptable levels of quality for a pair of (application, approximation technique). The three games are designed such that they effectively capture quality requirements with various tradeoffs and contexts. To evaluate AxGames, we examine seven diverse applications that produce user perceptible outputs and cover a wide range of domains, including image processing, optical character recognition, speech to text conversion, and audio processing. We recruit 700 participants/users through Amazon's Mechanical Turk to play the games that collect statistics about their perception on different levels of quality. Subsequently, the AxGames framework uses the Clopper-Pearson exact method, which computes a binomial proportion confidence interval, to analyze the collected statistics for each level of quality. Using this analysis, AxGames can statistically project the quality level that satisfies a given percentage of users. The developers can use these statistical projections to tune the level of approximation based on the user experience. We find that the level of acceptable quality loss significantly varies across applications. For instance, to satisfy 90% of users, the level of acceptable quality loss is 2% for one application (image processing) and 26% for another (audio processing). Moreover, the pattern with which the crowd responds to approximation takes significantly different shape and form depending on the class of applications. These results confirm the necessity of solutions that systematically explore the effect of approximation on the end user experience.
Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016
Conventionally, an approximate accelerator replaces every invocation of a frequently executed reg... more Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator---improving performance and efficiency--or run on the precise core---maintaining quality. In this paper we introduce M ithra , a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. M ithra seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that M ithra performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.
Scalable Neural-Network Stream Processor
New computation technology demands high performance and scalability at the hardware side. Traditi... more New computation technology demands high performance and scalability at the hardware side. Traditional shared bus architectures cannot fulfill the demand of higher performance because of its shared characteristics. Different bus architectures such as multi-layer buses do not satisfy the scalability needs of such systems. Network-on-Chip (NoC) structure is a new paradigm that addresses these two requirements. NnSP architecture originally has
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
Modern applications including graphics, multimedia, web search, and data analytics not only can b... more Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel's data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit datalevel parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×-3.0× higher throughput with less than 1% quality loss.
2015 Design Automation Test in Europe Conference Exhibition, Mar 9, 2015
Relaxing the traditional abstraction of "nearperfect" accuracy in hardware design can lead to sig... more Relaxing the traditional abstraction of "nearperfect" accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designer's annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.
Error correction for approximate computing
2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 2016
Approximate computing, which sacrifices the accuracy during computation, is a promising technolog... more Approximate computing, which sacrifices the accuracy during computation, is a promising technology to save energy. However, large number of computation errors may violate the accuracy requirement of certain applications and should be corrected. Consider a Graphical Processing Unit (GPU) with multiple Streaming Multiprocessors (SMs), where some of these SMs perform accurate computation while the others perform approximate computation. Provided the approximate outputs are correlated with other accurate outputs, we exploit this relation and model the approximate computation process as a communication process. Then the problem of error correction transforms to a problem of decoding and we want to solve it with certain error correction code. Different from the classical communications process, approximate computing raises additional constraints on the code design. In this paper, we propose a semi-regular LDPC code satisfying these constraints and prove this code can be perfectly decoded....
The impact of 3D stacking on GPU-accelerated deep neural networks: An experimental study
In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GP... more In this work, we present a two-tier air-cooled thermal testbed composed of an NVIDIA Tesla K40 GPU and a heater/thermometer top die. The top die has four independently-controllable heaters, which can emulate a wide range of components, ranging from low power memory to high-performance multi-core processor cores. The performance and temperature of the bottom-tier GPU on several deep neural network workloads is investigated as a function of increasing top-die power dissipation, and the implications for 3DIC cooling are discussed.
Achieving faster execution with shorter compilation time can enable further diversity and innovat... more Achieving faster execution with shorter compilation time can enable further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently, simulated annealing and genetic algorithms. Our work takes a unique approach by formulating compiler optimizations for neural networks as a reinforcement learning problem, whose solution takes fewer steps to converge. This solution, dubbed ReLeASE, comes with a sampling algorithm that leverages clustering to focus the costly samples (real hardware measurements) on representative points, subsuming an entire subspace. Our adaptive sampling not only reduces the number of samples, but also improves the quality of samples for better exploration in shorter time. As such, experimentation with real hardware shows that reinforcement learning with adaptive sampling provides 4.45x speed up in optimization time over AutoTV...
From Tensors to FPGAs: Accelerating Deep Learning
Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a... more Deep Neural Networks (DNNs) are compute-intensive learning models with growing applicability in a wide range of domains, such as vision, robotics, video analytics, speech recognition, natural language processing, targeted advertising, and web search. With diminishing benefits from technology scaling, the research community is increasingly turning to specialized accelerators for DNNs. Even though ASICs provide significant gains in performance and efficiency for DNNs, they may not cope with the ever-evolving DNN models. Furthermore, ASICs and customized cores come at the price of high non-recurring engineering costs over long design periods. FPGAs are an attractive choice for DNNs since they represent an intermediate point between the efficiency of ASICs and the programmability of general purpose processors, and are becoming available across different market segments. However, obtaining both performance and energy efficiency with FPGAs is a laborious task even for expert hardware desi...
We present ExpAX, a framework for automating approximate programming. ExpAX consists of these thr... more We present ExpAX, a framework for automating approximate programming. ExpAX consists of these three components: (1) a programming model based on a new kind of program specification, which we refer to as error expectations. Our programming model enables programmers to implicitly relax the accuracy constraints without explicitly marking operations as approximate; (2) an approximation safety analysis that automatically infers a safe-to-approximate set of program operations; and (3) an optimization that automatically marks a subset of the safe-to-approximate operations as approximate while statistically adhering to the error expectations. We evaluate ExpAX on a diverse set of Java applications. The results show that ExpAX provides significant energy savings (up to 35%) with large reduction in programmer effort (between 3× to 113×) while providing formal safety and statistical quality-of-result guarantees.
A Principled Approach to Learning Stochastic Representations for Privacy in Deep Neural Inference
INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Network... more INFerence-as-a-Service (INFaaS) in the cloud has enabled the prevalent use of Deep Neural Networks (DNNs) in home automation, targeted advertising, machine vision, etc. The cloud receives the inference request as a raw input, containing a rich set of private information, that can be misused or leaked, possibly inadvertently. This prevalent setting can compromise the privacy of users during the inference phase. This paper sets out to provide a principled approach, dubbed Cloak, that finds optimal stochastic perturbations to obfuscate the private data before it is sent to the cloud. To this end, Cloak reduces the information content of the transmitted data while conserving the essential pieces that enable the request to be serviced accurately. The key idea is formulating the discovery of this stochasticity as an offline gradient-based optimization problem that reformulates a pre-trained DNN (with optimized known weights) as an analytical function of the stochastic perturbations. Using...
IEEE Transactions on Systems, Man, and Cybernetics: Systems
Convolutional neural networks (CNNs) provide the best accuracy for disparity estimation. However,... more Convolutional neural networks (CNNs) provide the best accuracy for disparity estimation. However, CNNs are computationally expensive, making them unfavorable for resourcelimited devices with real-time constraints. Recent advances in neural architectures search (NAS) promise opportunities in automated optimization for disparity estimation. However, the main challenge of the NAS methods is the significant amount of computing time to explore a vast search space [e.g., 1.6 × 10 29 ] and costly training candidates. To reduce the NAS computational demand, many proxy-based NAS methods have been proposed. Despite their success, most of them are designed for comparatively small-scale learning tasks. In this article, we propose a fast NAS method, called FastStereoNet, to enable resource-aware NAS within an intractably large search space. FastStereoNet automatically searches for hardware-friendly CNN architectures based on late acceptance hill climbing (LAHC), followed by simulated annealing (SA). FastStereoNet also employs a fine-tuning with a transferred weights mechanism to improve the convergence of the search process. The collection of these ideas provides competitive results in terms of search time and strikes a balance between accuracy and efficiency. Compared to the state of the art, FastStereoNet provides 5.25× reduction in search time and 44.4× reduction in model size. These benefits are attained while yielding a comparable accuracy that enables seamless deployment of disparity estimation on resource-limited devices. Finally, FastStereoNet significantly improves the perception quality of disparity estimation deployed on field-programmable gate array Manuscript
In-DRAM near-data approximate acceleration for GPUs
Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques
Proceedings of the Web Conference 2021
When receiving machine learning services from the cloud, the provider does not need to receive al... more When receiving machine learning services from the cloud, the provider does not need to receive all features; in fact, only a subset of the features are necessary for the target prediction task. Discerning this subset is the key problem of this work. We formulate this problem as a gradient-based perturbation maximization method that discovers this subset in the input feature space with respect to the functionality of the prediction model used by the provider. After identifying the subset, our framework, Cloak, suppresses the rest of the features using utility-preserving constant values that are discovered through a separate gradient-based optimization process. We show that Cloak does not necessarily require collaboration from the service provider beyond its normal service, and can be applied in scenarios where we only have black-box access to the service provider's model. We theoretically guarantee that Cloak's optimizations reduce the upper bound of the Mutual Information (MI) between the data and the sifted representations that are sent out. Experimental results show that Cloak reduces the mutual information between the input and the sifted representations by 85.01% with only negligible reduction in utility (1.42%). In addition, we show that Cloak greatly diminishes adversaries' ability to learn and infer non-conducive features. CCS CONCEPTS • Security and privacy → Privacy protections; Usability in security and privacy; • Computing methodologies → Neural networks; Computer vision tasks; • Mathematics of computing → Information theory.
FlexiGAN: An End-to-End Solution for FPGA Acceleration of Generative Adversarial Networks
2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of ... more Generative Adversarial Networks (GANs) are among the frontiers of deep networks. GANs consist of two models, a generative model and a discriminative model. While the discriminative model uses the conventional convolution operator, the generative model is fundamentally different per its use of the transposed convolution operator. Unlike the conventional convolution, the transposed convolution initially inserts a large number of zeros in its input. This zero-insertion leads to a large number of inconsequential operations and creates different patterns of computation across the sliding windows. The inconsequential operations along with the variation in computation patterns lead to signicant resource underutilization when evaluated using conventional convolution hardware. This paper introduces FlexiGAN, an end-to-end solution, from high-level GAN specication to an optimized synthesizable FPGA accelerator. FlexiGAN framework is coupled with a novel architecture that aims to harness the benets of both MIMD and SIMD execution models. The proposed architecture separated data retrieval and data processing units at the nest granularity of each compute engine. Leveraging the separation between data retrieval and data processing units in the compute engines, we introduce a succinct set of operations that enable us to signicantly reduce the on-chip memory usage, which is generally scarce in FPGAs. We evaluate our end-to-end solution across various GANs from machine learning literature. FlexiGAN provides 2.4 higher performance than an optimized conventional convolution design. In addition, FlexiGAN, on average, yields 2.8 (up to 3.7) improvements in Performance-per-Watt over a high-end GPU. These results indicate that FlexiGAN is an effective initial step towards providing an end-to-end solution for accelerating GANs
Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
A wide variety of deep neural applications increasingly rely on the cloud to perform their comput... more A wide variety of deep neural applications increasingly rely on the cloud to perform their compute-heavy inference. This common practice requires sending private and privileged data over the network to remote servers, exposing it to the service provider and potentially compromising its privacy. Even if the provider is trusted, the data can still be vulnerable over communication channels or via side-channel attacks in the cloud. To that end, this paper aims to reduce the information content of the communicated data with as little as possible compromise on the inference accuracy by making the sent data noisy. An undisciplined addition of noise can significantly reduce the accuracy of inference, rendering the service unusable. To address this challenge, this paper devises Shredder, an end-to-end framework, that, without altering the topology or the weights of a pre-trained network, learns additive noise distributions that significantly reduce the information content of communicated data while maintaining the inference accuracy. The key idea is finding the additive noise distributions by casting it as a disjoint offline learning process with a loss function that strikes a balance between accuracy and information degradation. The loss function also exposes a knob for a disciplined and controlled asymmetric trade-off between privacy and accuracy. While keeping the DNN intact, Shredder divides inference between the cloud and the edge device, striking a balance between computation and communication. In the separate phase of inference, the edge device takes samples from the Laplace distributions that were collected during the proposed offline learning phase and populates a noise tensor with these sampled elements. Then, the edge device merely adds this populated noise tensor to the intermediate results to be sent to the cloud. As such, Shredder enables accurate inference on
IEEE Micro
Deep Quantization (below eight bits) can significantly reduce DNN computation and storage by decr... more Deep Quantization (below eight bits) can significantly reduce DNN computation and storage by decreasing the bitwidth of network encodings. However, without arduous manual effort, this deep quantization can lead to significant accuracy loss, leaving it in a position of questionable utility. We propose a systematic approach to tackle this problem, by automating the process of discovering the bitwidths through an end-to-end deep reinforcement learning framework (RELEQ). This framework utilizes the sample efficiency of Proximal Policy Optimization (PPO) to explore the exponentially large space of possible assignment of the bitwidths to the layers. We show how RELEQ can balance speed and quality, and provide a heterogeneous bitwidth assignment for quantization of a large variety of deep networks with minimal accuracy loss (≤ 0.3% loss) while minimizing the computation and storage costs. With these DNNs, RELEQ enables conventional hardware and custom DNN accelerator to achieve 2.2× speedup over 8-bit execution.
SiMul: An Algorithm-Driven Approximate Multiplier Design for Machine Learning
IEEE Micro
Proceedings of the VLDB Endowment
The data revolution is fueled by advances in machine learning, databases, and hardware design. Pr... more The data revolution is fueled by advances in machine learning, databases, and hardware design. Programmable accelerators are making their way into each of these areas independently. As such, there is a void of solutions that enables hardware acceleration at the intersection of these disjoint fields. This paper sets out to be the initial step towards a unifying solution for in- D atabase A cceleration of Advanced A nalytics (DAnA). Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of advanced analytics queries to an FPGA accelerator. The accelerator implementation is generated for a User Defined Function (UDF), expressed as a part of an SQL query using a Python-embedded Domain-Specific Language (DSL). To realize an efficient in-database integration, DAnA accelerators contain a novel hardware structure, Striders , that directl...
Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS '16, 2016
Approximate computing trades quality of application output for higher efficiency and performance.... more Approximate computing trades quality of application output for higher efficiency and performance. Approximation is useful only if its impact on application output quality is acceptable to the users. However, there is a lack of systematic solutions and studies that explore users' perspective on the effects of approximation. In this paper, we seek to provide one such solution for the developers to probe and discover the boundary of quality loss that most users will deem acceptable. We propose AxGames, a crowdsourced solution that enables developers to readily infer a statistical common ground from the general public through three entertaining games. The users engage in these games by betting on their opinion about the quality loss of the final output while the AxGames framework collects statistics about their perceptions. The framework then statistically analyzes the results to determine the acceptable levels of quality for a pair of (application, approximation technique). The three games are designed such that they effectively capture quality requirements with various tradeoffs and contexts. To evaluate AxGames, we examine seven diverse applications that produce user perceptible outputs and cover a wide range of domains, including image processing, optical character recognition, speech to text conversion, and audio processing. We recruit 700 participants/users through Amazon's Mechanical Turk to play the games that collect statistics about their perception on different levels of quality. Subsequently, the AxGames framework uses the Clopper-Pearson exact method, which computes a binomial proportion confidence interval, to analyze the collected statistics for each level of quality. Using this analysis, AxGames can statistically project the quality level that satisfies a given percentage of users. The developers can use these statistical projections to tune the level of approximation based on the user experience. We find that the level of acceptable quality loss significantly varies across applications. For instance, to satisfy 90% of users, the level of acceptable quality loss is 2% for one application (image processing) and 26% for another (audio processing). Moreover, the pattern with which the crowd responds to approximation takes significantly different shape and form depending on the class of applications. These results confirm the necessity of solutions that systematically explore the effect of approximation on the end user experience.
Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration
2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016
Conventionally, an approximate accelerator replaces every invocation of a frequently executed reg... more Conventionally, an approximate accelerator replaces every invocation of a frequently executed region of code without considering the final quality degradation. However, there is a vast decision space in which each invocation can either be delegated to the accelerator---improving performance and efficiency--or run on the precise core---maintaining quality. In this paper we introduce M ithra , a co-designed hardware-software solution, that navigates these tradeoffs to deliver high performance and efficiency while lowering the final quality loss. M ithra seeks to identify whether each individual accelerator invocation will lead to an undesirable quality loss and, if so, directs the processor to run the original precise code. This identification is cast as a binary classification task that requires a cohesive co-design of hardware and software. The hardware component performs the classification at runtime and exposes a knob to the software mechanism to control quality tradeoffs. The software tunes this knob by solving a statistical optimization problem that maximizes benefits from approximation while providing statistical guarantees that final quality level will be met with high confidence. The software uses this knob to tune and train the hardware classifiers. We devise two distinct hardware classifiers, one table-based and one neural network based. To understand the efficacy of these mechanisms, we compare them with an ideal, but infeasible design, the oracle. Results show that, with 95% confidence the table-based design can restrict the final output quality loss to 5% for 90% of unseen input sets while providing 2.5× speedup and 2.6× energy efficiency. The neural design shows similar speedup however, improves the efficiency by 13%. Compared to the table-based design, the oracle improves speedup by 26% and efficiency by 36%. These results show that M ithra performs within a close range of the oracle and can effectively navigate the quality tradeoffs in approximate acceleration.
Scalable Neural-Network Stream Processor
New computation technology demands high performance and scalability at the hardware side. Traditi... more New computation technology demands high performance and scalability at the hardware side. Traditional shared bus architectures cannot fulfill the demand of higher performance because of its shared characteristics. Different bus architectures such as multi-layer buses do not satisfy the scalability needs of such systems. Network-on-Chip (NoC) structure is a new paradigm that addresses these two requirements. NnSP architecture originally has
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
Modern applications including graphics, multimedia, web search, and data analytics not only can b... more Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, GRATER, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernel's data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit datalevel parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate GRATER on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4×-3.0× higher throughput with less than 1% quality loss.
2015 Design Automation Test in Europe Conference Exhibition, Mar 9, 2015
Relaxing the traditional abstraction of "nearperfect" accuracy in hardware design can lead to sig... more Relaxing the traditional abstraction of "nearperfect" accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designer's annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9× area reduction with 10% output quality loss.