Amir Ashouri | University of Toronto (original) (raw)

Papers by Amir Ashouri

2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)

Cornell University - arXiv, Jul 18, 2022

For the past 25 years, we have witnessed an extensive application of Machine Learning to the Comp... more For the past 25 years, we have witnessed an extensive application of Machine Learning to the Compiler space; the selection and the phase-ordering problem. However, limited works have been upstreamed into the state-of-the-art compilers, i.e., LLVM, to seamlessly integrate the former into the optimization pipeline of a compiler to be readily deployed by the user. MLGO was among the first of such projects and it only strives to reduce the code size of a binary with an ML-based Inliner using Reinforcement Learning. This paper presents MLGOPerf; the first end-to-end framework capable of optimizing performance using LLVM's ML-Inliner. It employs a secondary ML model to generate rewards used for training a retargeted Reinforcement learning agent, previously used as the primary model by MLGO. It does so by predicting the post-inlining speedup of a function under analysis and it enables a fast training framework for the primary model which otherwise wouldn't be practical. The experimental results show MLGOPerf is able to gain up to 1.8% and 2.2% with respect to LLVM's optimization at O3 when trained for performance on SPEC CPU2006 and Cbench benchmarks, respectively. Furthermore, the proposed approach provides up to 26% increased opportunities to autotune code regions for our benchmarks which can be translated into an additional 3.7% speedup value. CCS CONCEPTS • Software and its engineering → Compilers; • Computing methodologies → Machine learning.

Automatic Tuning of Compilers Using Machine Learning, 2017

This chapter presents the first of two methods to tackle the phase-ordering problem of compiler o... more This chapter presents the first of two methods to tackle the phase-ordering problem of compiler optimizations. Here, we present an intermediate speedup prediction approach followed by a full-sequence prediction approach in the next chapter and we show pros and cons of each approach in detail. Today’s compilers offer a vast number of transformation options to choose among, and this choice can significantly impact on the performance of the code being optimized. Not only the selection of compiler options represents a hard problem to be solved, but also the ordering of the phases is adding further complexity, making it a long-standing problem in compilation research. This chapter presents an innovative approach to tackling the compiler phase-ordering problem by using predictive modeling. The proposed methodology enables (i) to efficiently explore compiler exploration space including optimization permutations and repetitions and (ii) to extract the application dynamic features to predict the next-best optimization to be applied to maximize the performance given the current status. Experimental results are done by assessing the proposed methodology with utilizing two different search heuristics on the compiler optimization space and it demonstrates the effectiveness of the methodology on the selected set of applications. Using the proposed methodology on average we observed up to 4% execution speedup with respect to LLVM standard baseline.

Diversity of today’s architectures have forced programmers and compiler researchers to port their... more Diversity of today’s architectures have forced programmers and compiler researchers to port their application across many different platforms. Compiler auto-tuning itself plays a major role within that process as it has certain levels of complexities that simply the standard pre-defined optimization levels fail to bring the best results due to their average performance output.To address the problem, different optimization techniques has been used for traversing, pruning the huge space, adaptability and portability. In this short paper, we propose our different approaches including the use of Design-Space-Exploration (DSE) techniques and machine learning to further address the problem. It has been demonstrated and assessed that utilizing these techniques have positive effects on the performance metrics of the given applications and can bring up to 60% performance improvement with respect to standard optimization levels (e.g. -O2

SpringerBriefs in Applied Sciences and Technology, Dec 23, 2017

Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

Designing and optimizing applications for energy-efficient High Performance Computing systems up ... more Designing and optimizing applications for energy-efficient High Performance Computing systems up to the Exascale era is an extremely challenging problem. This paper presents the toolbox developed in the ANTAREX European project for autotuning and adaptivity in energy efficient HPC systems. In particular, the modules of the ANTAREX toolbox are described as well as some preliminary results of the application to two target use cases. 1

Automatic Tuning of Compilers Using Machine Learning, 2017

Very Long Instruction Word (VLIW) processors represent an attractive solution for embedded comput... more Very Long Instruction Word (VLIW) processors represent an attractive solution for embedded computing, offering significant computational power with reduced hardware complexity. However, they impose higher compiler complexity since the instructions are executed in parallel based on the static compiler schedule. Therefore, finding a promising set of compiler transformations and defining their effects have a significant impact on the overall system performance. In this chapter, we provide a methodology with an integrated framework to automatically (i) generate optimized application-specific VLIW architectural configurations and (ii) analyze compiler level transformations, enabling application-specific compiler tuning over customized VLIW system architectures. We based the analysis on a Design of Experiments (DoEs) procedure that statistically captures the higher order effects among different sets of activated compiler transformations. Applying the proposed methodology onto real-case embedded application scenarios, we show that (i) only a limited set of compiler transformations exposes high confidence level (over 95%) in affecting the performance and (ii) using them we could be able to achieve gains between 16-23% in comparison to the default optimization levels. In the next chapters, we go deeper in building machine learning models to tackle the problem. 2.1 VLIW Embedded system design traditionally exploits the knowledge of the target domain, e.g., telecommunication, multimedia, home automation etc., to customize the HW/SW coefficients found onto the deployed computing devices. Although the functionalities of these devices differ, the computational structure and design are tightly connected with the platform they rely on. Platform-based designs have been proposed as a promising alternative for designing complex systems by redefining the problem of tuning specific design parameters of the platform template. The scientific and commercial urge to use VLIW technology seems to be raised again after three decades of existence [1]; VLIW processor templates are being used particularly in embedded processors, designed to perform special-purpose functions, usually for real-time or hardware acceleration. Being able to use VLIW power-saving

Diversity of today’s architectures have forced programmers and compiler researchers to port their... more Diversity of today’s architectures have forced programmers and compiler researchers to port their application across many different platforms. Compiler auto-tuning itself plays a major role within that process as it has certain levels of complexities that the standard optimization levels fail to bring the best results due to their average performance output. To address the problem, different optimization techniques has been used for traversing, pruning the huge space, adaptability and portability. In this paper, we evaluate our different autotuning approaches including the use of Design Space Exploration (DSE) techniques and machine learning to further tackle the both problems of selecting and the phase-ordering of the compiler optimizations. It has been experimentally demonstrated that using these techniques have positive effects on the performance metrics of the applications under analysis and can bring up to 60% performance improvement with respect to standard optimization levels...

Diversity of today’s architectures have forced programmers and compiler researchers to port their... more Diversity of today’s architectures have forced programmers and compiler researchers to port their application across many different platforms. Compiler auto-tuning itself plays a major role within that process as it has certain levels of complexities that simply the standard pre-defined optimization levels fail to bring the best results due to their average performance output. To address the problem, different optimization techniques has been used for traversing, pruning the huge space, adaptability and portability. In this short paper, we propose our different approaches including the use of Design Space Exploration (DSE) techniques and Machine Learning to further tackle the both problems of selection and the phase-ordering of the compiler optimizations. It has been demonstrated and assessed that utilizing these techniques have positive effects on the performance metrics of the given applications and can bring up to 60% performance improvement with respect to standard optimization ...

Since the mid-1990s, researchers have been trying to use machine-learning based approaches to sol... more Since the mid-1990s, researchers have been trying to use machine-learning based approaches to solve a number of di erent compiler optimization problems. These techniques primarily enhance the quality of the obtained results and, more importantly, make it feasible to tackle two main compiler optimization problems: optimization selection (choosing which optimizations to apply) and phase-ordering (choosing the order of applying optimizations). The compiler optimization space continues to grow due to the advancement of applications, increasing number of compiler optimizations, and new target architectures. Generic optimization passes in compilers cannot fully leverage newly introduced optimizations and, therefore, cannot keep up with the pace of increasing options. This survey summarizes and classi es the recent advances in using machine learning for the compiler optimization eld, particularly on the two major problems of (1) selecting the best optimizations, and (2) the phase-ordering ...

This chapter proposes our second approach to tackle the phase-ordering problem. We already showed... more This chapter proposes our second approach to tackle the phase-ordering problem. We already showed our intermediate speedup prediction method in Chap. 4. Here, we present our full-sequence speedup prediction method called MiCOMP. MiCOMP: Mitigating the Compiler Phase-ordering problem using optimization sub-sequences and machine learning, is an autotuning framework to mitigate the compiler phase-ordering problem based on machine-learning techniques effectively. The idea is to cluster the optimization passes of the LLVM O3 setting into different clusters to predict the speedup of the complete-sequence of all the optimization clusters. The predictive model uses (i) dynamic features, (ii) an encoded version of the compiler sequence and (iii) an exploration heuristic to tackle the problem. Experimental results using the LLVM compiler framework and the Cbench suite show the effectiveness of the encoding technique to application-based reordering of passes while using a number of predictive ...

After presenting our DSE approach for finding good compiler optimizations, we present our autotun... more After presenting our DSE approach for finding good compiler optimizations, we present our autotuning framework to tackle the problem of selecting the best compiler passes. It leverages machine learning and an application characterization to find the most promising optimization passes given an application. This chapter proposes COBAYN: Compiler autotuning framework using Bayesian Networks . An autotuning methodology based on machine learning to speed up application performance and to reduce the cost of the compiler optimization phases. The proposed framework is based on the application characterization done dynamically by using independent micro-architecture features and Bayesian networks. The chapter also presents an evaluation based on static analysis and hybrid feature collection approaches. Besides, we compare our approach against several state-of-the-art machine-learning models. Experiments are carried out on an ARM embedded platform and GCC compiler by considering two benchmark...

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018

Configuring program parallelism and selecting optimal compiler options according to the underlyin... more Configuring program parallelism and selecting optimal compiler options according to the underlying platform architecture is a difficult task. Tipically, this task is either assigned to the programmer or done by a standard one-fitsall policy generated by the compiler or runtime system. A runtime selection of the best configuration requires the insertion of a lot of glue code for profiling and runtime selection. This represents a programming wall for application developers. This paper presents a structured approach, called SOCRATES, based on an aspect-oriented language (LARA) and a runtime autotuner (mARGOt) to mitigate this problem. LARA has been used to hide the glue code insertion, thus separating the pure functional application description from extra-functional requirements. mARGOT has been used for the automatic selection of the best configuration according to the runtime evolution of the application. 1

Embedded systems can be considered as specialized computing systems which can be used for multi-p... more Embedded systems can be considered as specialized computing systems which can be used for multi-purpose application varying from mobile-phone to military and home- automation devices. Although the functionalities of these devices are differed, the computational structure and design is tightly connected with the platform and programmability in which they rely on. During this design phase, Design Space Exploration (DSE) plays a major role to benefit the designer, to prune the large design space and support the designer during the analysis phase. For addressing the complex solution space, there is a necessity to extend conventional exploration approaches by applying data-mining for extract knowledge from statistical results. The goal of this work is to develop statistical exploration and analysis framework for the compiler-architecture co-design in VLIW processors to tackle the aforementioned problem by proposing an automatic methodology based on a tool-chain including the MOST tool (M...

2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia), 2014

The complexity and diversity of today's architectures require an additional effort from the progr... more The complexity and diversity of today's architectures require an additional effort from the programmers in porting and tuning the application code across different platforms. The problem is even more complex when considering that also the compiler requires some tuning, since standard optimization options have been customized for specific architectures or designed for the average case. This paper proposes a machinelearning approach for reducing the cost of the compiler autotuning phase and to speedup the application performance in embedded architectures. The proposed framework is based on an application characterization done dynamically with microarchitecture independent features and based on the usage of Bayesian Networks. The main characteristic of the Bayesian Network approach consists of not describing the solution as a strict set of compiler transformations to be applied, but as a complex probability distribution function to be sampled. Experimental results, carried out on an ARM platform and GCC transformation space, proved the effectiveness of the proposed methodology for the selected benchmarks. The selected set of solutions (less than 10% of the search space) demonstrated to be very close to the optimal sequence of transformations, showing also an applications performance speedup up to 2.8 (1.5 on average) with respect to-O2 and-O3 for the cBench suite. Additionally, the proposed method demonstrated a 3x speedup in terms of search time with respect to an iterative compilation approach, given the same quality of the solutions 1 .

2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC), 2013

ABSTRACT Very Long Instruction Word (VLIW) application specific processors represent an attractiv... more ABSTRACT Very Long Instruction Word (VLIW) application specific processors represent an attractive solution for embedded computing, offering significant computational power with reduced hardware complexity. However, they impose higher compiler complexity since the instructions are executed in parallel based on the static compiler schedule. Therefore, finding a promising set of compiler transformations and defining their effects have a significant impact on the overall system performance. The proposed methodology provides the designer with an integrated framework to automatically (i) generate optimized application-specific VLIW architectural configurations and (ii) analyze com-piler level transformations, enabling application-specific compiler tuning over customized VLIW system architectures. We based the aforementioned analysis on a Design of Experiments (DoEs) procedure that captures in a statistical manner the higher order effects among different sets of activated compiler transformations. Applying the proposed methodology onto real-case embedded application scenarios, we show that (i) only a limited set of compiler transformations exposes high confidence level (over 95%) in affecting the performance and (ii) using them we could be able to achieve gains between (16-23)% in comparison to the default optimization levels.

Neurocomputing

Modern Convolutional Neural Networks (CNNs) are complex, encompassing millions of parameters. The... more Modern Convolutional Neural Networks (CNNs) are complex, encompassing millions of parameters. Their deployment exerts computational, storage and energy demands, particularly on embedded platforms. Existing approaches to prune or sparsify CNNs require retraining to maintain inference accuracy. Such retraining is not feasible in some contexts. In this paper, we explore the sparsification of CNNs by proposing three model-independent methods. Our methods are applied on-thefly and require no retraining. We show that the state-of-the-art models' weights can be reduced by up to 73% (compression factor of 3.7×) without incurring more than 5% loss in Top-5 accuracy. Additional fine-tuning gains only 8% in sparsity, which indicates that our fast on-the-fly methods are effective.

SpringerBriefs in Applied Sciences and Technology

The use of general descriptive names, registered names, trademarks, service marks, etc. in this p... more The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

ACM Transactions on Architecture and Code Optimization

Recent compilers offer a vast number of multilayered optimizations targeting different code segme... more Recent compilers offer a vast number of multilayered optimizations targeting different code segments of an application. Choosing among these optimizations can significantly impact the performance of the code being optimized. The selection of the right set of compiler optimizations for a particular code segment is a very hard problem, but finding the best ordering of these optimizations adds further complexity. Finding the best ordering represents a long standing problem in compilation research, named the phase-ordering problem. The traditional approach of constructing compiler heuristics to solve this problem simply cannot cope with the enormous complexity of choosing the right ordering of optimizations for every code segment in an application. This article proposes an automatic optimization framework we call MiCOMP, which Mitigates the Compiler Phase-ordering problem. We perform phase ordering of the optimizations in LLVM’s highest optimization level using optimization sub-sequence...

2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)

Cornell University - arXiv, Jul 18, 2022

Automatic Tuning of Compilers Using Machine Learning, 2017

SpringerBriefs in Applied Sciences and Technology, Dec 23, 2017

Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

Automatic Tuning of Compilers Using Machine Learning, 2017

Diversity of today’s architectures have forced programmers and compiler researchers to port their... more Diversity of today’s architectures have forced programmers and compiler researchers to port their application across many different platforms. Compiler auto-tuning itself plays a major role within that process as it has certain levels of complexities that the standard optimization levels fail to bring the best results due to their average performance output. To address the problem, different optimization techniques has been used for traversing, pruning the huge space, adaptability and portability. In this paper, we evaluate our different autotuning approaches including the use of Design Space Exploration (DSE) techniques and machine learning to further tackle the both problems of selecting and the phase-ordering of the compiler optimizations. It has been experimentally demonstrated that using these techniques have positive effects on the performance metrics of the applications under analysis and can bring up to 60% performance improvement with respect to standard optimization levels...

Diversity of today’s architectures have forced programmers and compiler researchers to port their... more Diversity of today’s architectures have forced programmers and compiler researchers to port their application across many different platforms. Compiler auto-tuning itself plays a major role within that process as it has certain levels of complexities that simply the standard pre-defined optimization levels fail to bring the best results due to their average performance output. To address the problem, different optimization techniques has been used for traversing, pruning the huge space, adaptability and portability. In this short paper, we propose our different approaches including the use of Design Space Exploration (DSE) techniques and Machine Learning to further tackle the both problems of selection and the phase-ordering of the compiler optimizations. It has been demonstrated and assessed that utilizing these techniques have positive effects on the performance metrics of the given applications and can bring up to 60% performance improvement with respect to standard optimization ...

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018

2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia), 2014

2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC), 2013

Neurocomputing

SpringerBriefs in Applied Sciences and Technology

ACM Transactions on Architecture and Code Optimization