A. Lokhmotov - Academia.edu (original) (raw)
Papers by A. Lokhmotov
Advances in GPU Research and Practice, 2017
Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number o... more Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.
2015 International Conference on Parallel Architecture and Compilation (PACT), 2015
Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA i... more Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011
Lecture Notes in Computer Science, 2009
We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs... more We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs on the Cell BE architecture. Auto-parallelisation is made easier by annotating sieve scopes, which abstract the "read in, compute in parallel, write out" processing paradigm. We show that the semantics of sieve scopes enables data movement optimisations, such as re-organising global memory reads to minimise DMA transfers and streaming reads from uniformly accessed arrays. We also describe run-time optimisations for committing sideeffects to main memory. We provide experimental results showing the benefits of our optimisations, and compare the Sieve-Cell system with IBM's OpenMP implementation for Cell.
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007
Lecture Notes in Computer Science
We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter... more We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter and show that optimising the innermost loop does not always produce the best results despite this being a common belief.
Performance Computing, 2008
Lecture Notes in Computer Science, 2012
ABSTRACT We present a framework for representing image processing kernels based on decoupled acce... more ABSTRACT We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.
Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...
Lecture Notes in Computer Science, 2010
We demonstrate that the performance of commodity parallel systems significantly depends on low-le... more We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.
Lecture Notes in Computer Science, 2009
Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating ... more Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explic-itly specify both the memory access pattern and the ...
Lecture Notes in Computer Science, 2008
Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil ... more Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil Hickey2, and David Stuttard2 1 Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 ...
Scientific Programming, 2014
ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential t... more ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...
Advances in GPU Research and Practice, 2017
Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number o... more Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.
2015 International Conference on Parallel Architecture and Compilation (PACT), 2015
Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA i... more Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011
Lecture Notes in Computer Science, 2009
We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs... more We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs on the Cell BE architecture. Auto-parallelisation is made easier by annotating sieve scopes, which abstract the "read in, compute in parallel, write out" processing paradigm. We show that the semantics of sieve scopes enables data movement optimisations, such as re-organising global memory reads to minimise DMA transfers and streaming reads from uniformly accessed arrays. We also describe run-time optimisations for committing sideeffects to main memory. We provide experimental results showing the benefits of our optimisations, and compare the Sieve-Cell system with IBM's OpenMP implementation for Cell.
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007
Lecture Notes in Computer Science
We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter... more We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter and show that optimising the innermost loop does not always produce the best results despite this being a common belief.
Performance Computing, 2008
Lecture Notes in Computer Science, 2012
ABSTRACT We present a framework for representing image processing kernels based on decoupled acce... more ABSTRACT We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.
Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...
Lecture Notes in Computer Science, 2010
We demonstrate that the performance of commodity parallel systems significantly depends on low-le... more We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.
Lecture Notes in Computer Science, 2009
Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating ... more Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explic-itly specify both the memory access pattern and the ...
Lecture Notes in Computer Science, 2008
Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil ... more Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil Hickey2, and David Stuttard2 1 Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 ...
Scientific Programming, 2014
ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential t... more ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.
Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...