harish Patil - Academia.edu (original) (raw)
Papers by harish Patil
2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
We address the challenge faced in characterizing long-running workloads, namely how to reliably f... more We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, recompilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks.
2015 IEEE International Symposium on Workload Characterization, 2015
As computational applications become common for graphics processing units, new hardware designs m... more As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for indepth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).
International Symposium on Code Generation and Optimization, 2004. CGO 2004.
Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processo... more Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the Itanium R 2 processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel R Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.
Proceedings of the 16th international conference on Supercomputing, 2002
Data prefetching is an e ective approach to addressing the memory latency problem. While a few pr... more Data prefetching is an e ective approach to addressing the memory latency problem. While a few processors have i mplemented hardware-based data prefetching, the majority o f modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, w e broaden the scope of softwarecontrolled prefetching by addressing the above t wo limitations. We use pro ling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We nd that even with such large caches, our technique o ers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, t wo products in Compaq's Tru64 Unix.
2008 IEEE International Symposium on Workload Characterization, 2008
As multiprocessors become mainstream, techniques to address efficient simulation of multi-threade... more As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010
Analysis of parallel programs is hard mainly because their behavior changes from run to run. We p... more Analysis of parallel programs is hard mainly because their behavior changes from run to run. We present an execution capture and deterministic replay system that enables repeatable analysis of parallel programs. Our goal is to provide an easy-to-use framework for capturing, deterministically replaying, and analyzing execution of large programs with reasonable runtime and disk usage. Our system, called PinPlay, is
Proceedings of the joint international conference on Measurement and modeling of computer systems, 2006
Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A t... more Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the application's execution. Existing applicationlevel simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application's execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch. In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our applicationlevel simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.
37th International Symposium on Microarchitecture (MICRO-37'04)
Detailed modeling of the performance of commercial applications is difficult. The applications ca... more Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions-called PinPoints for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrialstrength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.
2007 IEEE International Symposium on Performance Analysis of Systems & Software, 2007
Architectures are usually compared by running the same workload on each architecture and comparin... more Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the program's execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the program's execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations.
Sigplan Notices, 2005
Robust and powerful software instrumentation tools are essential for program analysis tasks such ... more Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated
Proc. 4th Workshop on …, 2001
Spike is an executable optimizer that uses profile information to place application code for impr... more Spike is an executable optimizer that uses profile information to place application code for improved fetch efficiency and reduced cache footprint. This placement reduces the number of cache misses and the latencies they add to program execution time. This paper ...
2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)
We address the challenge faced in characterizing long-running workloads, namely how to reliably f... more We address the challenge faced in characterizing long-running workloads, namely how to reliably focus the detailed analysis on interesting execution regions. We present a set of tools that allows users to precisely capture any region of interest in program execution, and create a stand-alone executable, called an ELFie, from it. An ELFie starts with the same program state captured at the beginning of the region of interest and then executes natively. With ELFies, there is no fast-forwarding to the region of interest needed or the uncertainty of reaching the region. ELFies can be fed to dynamic program-analysis tools or simulators that work with regular program binaries. Our tool-chain is based on the PinPlay framework and requires no special hardware, operating system changes, recompilation, or re-linking of test programs. This paper describes the design of our ELFie generation tool-chain and the application of ELFies in performance analysis and simulation of regions of interest in popular long-running single and multi-threaded benchmarks.
2015 IEEE International Symposium on Workload Characterization, 2015
As computational applications become common for graphics processing units, new hardware designs m... more As computational applications become common for graphics processing units, new hardware designs must be developed to meet the unique needs of these workloads. Performance simulation is an important step in appraising how well a candidate design will serve these needs, but unfortunately, computational GPU programs are so large that simulating them in detail is prohibitively slow. This work addresses the need to understand very large computational GPU programs in three ways. First, it introduces a fast tracing tool that uses binary instrumentation for indepth analyses of native executions on existing architectures. Second, it characterizes 25 commercial and benchmark OpenCL applications, which average 308 billion GPU instructions apiece and are by far the largest benchmarks that have been natively profiled at this level of detail. Third, it accelerates simulation of future hardware by pinpointing small subsets of OpenCL applications that can be simulated as representative surrogates in lieu of full-length programs. Our fast selection method requires no simulation itself and allows the user to navigate the accuracy/simulation speed trade-off space, from extremely accurate with reasonable speedups (35X increase in simulation speed for 0.3% error) to reasonably accurate with extreme speedups (223X simulation speedup for 3.0% error).
International Symposium on Code Generation and Optimization, 2004. CGO 2004.
Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processo... more Ispike is a post-link optimizer developed for the Intel R Itanium Processor Family (IPF) processors. The IPF architecture poses both opportunities and challenges to post-link optimizations. IPF offers a rich set of performance counters to collect detailed profile information at a low cost, which is essential to post-link optimization being practical. At the same time, the predication and bundling features on IPF make post-link code transformation more challenging than on other architectures. In Ispike, we have implemented optimizations like code layout, instruction prefetching, data layout, and data prefetching that exploit the IPF advantages, and strategies that cope with the IPF-specific challenges. Using SPEC CINT2000 as benchmarks, we show that Ispike improves performance by as much as 40% on the Itanium R 2 processor, with average improvement of 8.5% and 9.9% over executables generated by the Intel R Electron compiler and by the Gcc compiler, respectively. We also demonstrate that statistical profiles collected via IPF performance counters and complete profiles collected via instrumentation produce equal performance benefit, but the profiling overhead is significantly lower for performance counters.
Proceedings of the 16th international conference on Supercomputing, 2002
Data prefetching is an e ective approach to addressing the memory latency problem. While a few pr... more Data prefetching is an e ective approach to addressing the memory latency problem. While a few processors have i mplemented hardware-based data prefetching, the majority o f modern processors support data-prefetch instructions and rely on compilers to automatically insert prefetches. However, most prefetching schemes in commercial compilers suffer from two limitations: (1) the source code must be available before prefetching can be applied, and (2) these prefetching schemes target only loops with statically-known strided accesses. In this study, w e broaden the scope of softwarecontrolled prefetching by addressing the above t wo limitations. We use pro ling to discover strided accesses that frequently occur during program execution but are not determinable by the compiler. We then use the strides discovered to insert prefetches into the executable directly, without the need for re-compilation. Performance evaluation was done on an Alpha 21264-based system with a 64KB data cache and an 8MB secondary cache. We nd that even with such large caches, our technique o ers speedups ranging from 3% to 56% in 11 out of the 26 SPEC2000 benchmarks. Our technique has been incorporated into Pixie and Spike, t wo products in Compaq's Tru64 Unix.
2008 IEEE International Symposium on Workload Characterization, 2008
As multiprocessors become mainstream, techniques to address efficient simulation of multi-threade... more As multiprocessors become mainstream, techniques to address efficient simulation of multi-threaded workloads are needed. Multi-threaded simulation presents a new challenge: non-determinism across simulations for different architecture configurations. If the execution paths between two simulation runs of the same benchmark with the same input are too different, the simulation results cannot be used to compare the configurations. In this paper we focus on a simulation technique to efficiently collect simulation checkpoints for multi-threaded workloads, and to compare simulation runs addressing this non-determinism problem. We focus on user-level simulation of multi-threaded workloads for multiprocessor architectures. We present an approach, based on binary instrumentation, to collect checkpoints for simulation. Our checkpoints allow reproducible execution of the samples across different architecture configurations by controlling the sources of nondeterminism during simulation. This results in stalls that would not naturally occur in execution. We propose techniques that allow us to accurately compare performance across architecture configurations in the presence of these stalls.
Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, 2010
Analysis of parallel programs is hard mainly because their behavior changes from run to run. We p... more Analysis of parallel programs is hard mainly because their behavior changes from run to run. We present an execution capture and deterministic replay system that enables repeatable analysis of parallel programs. Our goal is to provide an easy-to-use framework for capturing, deterministically replaying, and analyzing execution of large programs with reasonable runtime and disk usage. Our system, called PinPlay, is
Proceedings of the joint international conference on Measurement and modeling of computer systems, 2006
Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A t... more Modern architecture research relies heavily on applicationlevel detailed pipeline simulation. A time consuming part of building a simulator is correctly emulating the operating system effects, which is required even if the goal is to simulate just the application code, in order to achieve functional correctness of the application's execution. Existing applicationlevel simulators require manually hand coding the emulation of each and every possible system effect (e.g., system call, interrupt, DMA transfer) that can impact the application's execution. Developing such an emulator for a given operating system is a tedious exercise, and it can also be costly to maintain it to support newer versions of that operating system. Furthermore, porting the emulator to a completely different operating system might involve building it all together from scratch. In this paper, we describe a tool that can automatically log operating system effects to guide architecture simulation of application code. The benefits of our approach are: (a) we do not have to build or maintain any infrastructure for emulating the operating system effects, (b) we can support simulation of more complex applications on our applicationlevel simulator, including those applications that use asynchronous interrupts, DMA transfers, etc., and (c) using the system effects logs collected by our tool, we can deterministically re-execute the application to guide architecture simulation that has reproducible results.
37th International Symposium on Microarchitecture (MICRO-37'04)
Detailed modeling of the performance of commercial applications is difficult. The applications ca... more Detailed modeling of the performance of commercial applications is difficult. The applications can take a very long time to run on real hardware and it is impractical to simulate them to completion on performance models. Furthermore, these applications have complex execution environments that cannot easily be reproduced on a simulator, making porting the applications to simulators difficult. We attack these problems using the well-known SimPoint methodology to find representative portions of an application to simulate, and a dynamic instrumentation framework called Pin to avoid porting altogether. Our system uses dynamic instrumentation instead of simulation to find representative portions-called PinPoints for simulation. We have developed a toolkit that automatically detects PinPoints, validates whether they are representative using hardware performance counters, and generates traces for large Itanium® programs. We compared SimPoint-based selection to random selection of simulation points. We found for 95% of the SPEC2000 programs we tested, the PinPoints prediction was within 8% of the actual whole-program CPI, as opposed to 18% for random selection. We measure the end-to-end error, comparing real hardware to a performance model, and have a simple and efficient methodology to determine the step that introduced the error. Finally, we evaluate the system in the context of multiple configurations of real hardware, commercial applications, and industrialstrength performance models to understand the behavior of a complete and practical workload collection system. We have successfully used our system with many commercial Itanium® programs, some running for trillions of instructions, and have used the resulting traces for predicting performance of those applications on future Itanium processors.
2007 IEEE International Symposium on Performance Analysis of Systems & Software, 2007
Architectures are usually compared by running the same workload on each architecture and comparin... more Architectures are usually compared by running the same workload on each architecture and comparing performance. When a single compiled binary of a program is executed on many different architectures, techniques like SimPoint can be used to find a small set of samples that represent the majority of the program's execution. Architectures can be compared by simulating their behavior on the code samples selected by SimPoint, to quickly determine which architecture has the best performance. Architectural design space exploration becomes more difficult when different binaries must be used for the same program. These cases arise when evaluating architectures that include ISA extensions, and when evaluating compiler optimizations. This problem domain is the focus of our paper. When multiple binaries are used to evaluate a program, one approach is to create a separate set of simulation points for each binary. This approach works reasonably well for many applications, but breaks down when the simulation points chosen for the different binaries emphasize different parts of the program's execution. This problem can be avoided if simulation points are selected consistently across the different binaries, to ensure that the same parts of program execution are represented in all binaries. In this paper we present an approach that finds a single set of simulation points to be used across all binaries for a single program. This allows for simulation of the same parts of program execution despite changes in the binary due to ISA changes or compiler optimizations.
Sigplan Notices, 2005
Robust and powerful software instrumentation tools are essential for program analysis tasks such ... more Robust and powerful software instrumentation tools are essential for program analysis tasks such as profiling, performance evaluation, and bug detection. To meet this need, we have developed a new instrumentation system called to instrument executables while they are running. For efficiency, Pin uses several techniques, including inlining, register re-allocation, liveness analysis, and instruction scheduling to optimize instrumentation. This fully automated
Proc. 4th Workshop on …, 2001
Spike is an executable optimizer that uses profile information to place application code for impr... more Spike is an executable optimizer that uses profile information to place application code for improved fetch efficiency and reduced cache footprint. This placement reduces the number of cache misses and the latencies they add to program execution time. This paper ...