A. Lokhmotov - Academia.edu (original) (raw)

Papers by A. Lokhmotov

Research paper thumbnail of Formal analysis techniques for reliable GPU programming

Advances in GPU Research and Practice, 2017

Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number o... more Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.

Research paper thumbnail of PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

2015 International Conference on Parallel Architecture and Compilation (PACT), 2015

Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA i... more Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.

Research paper thumbnail of Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011

Research paper thumbnail of Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor

Lecture Notes in Computer Science, 2009

We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs... more We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs on the Cell BE architecture. Auto-parallelisation is made easier by annotating sieve scopes, which abstract the "read in, compute in parallel, write out" processing paradigm. We show that the semantics of sieve scopes enables data movement optimisations, such as re-organising global memory reads to minimise DMA transfers and streaming reads from uniformly accessed arrays. We also describe run-time optimisations for committing sideeffects to main memory. We provide experimental results showing the benefits of our optimisations, and compare the Sieve-Cell system with IBM's OpenMP implementation for Cell.

Research paper thumbnail of Optimal bit-reversal using vector permutations

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007

Research paper thumbnail of Auto-parallelisation of Sieve C++ Programs

Lecture Notes in Computer Science

Research paper thumbnail of Nested loop vectorisation

We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter... more We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter and show that optimising the innermost loop does not always produce the best results despite this being a common belief.

Research paper thumbnail of Strict and Relaxed Sieving for Multi-Core Programming

Research paper thumbnail of Optimising component composition using indexed dependence metadata

Research paper thumbnail of Generating CUDA Code at Runtime: Specializing Accelerator Code to Runtime Data

Performance Computing, 2008

Research paper thumbnail of Dynamic Data Structures for Taskgraph Scheduling Policies with Applications in OpenCL Accelerators

Research paper thumbnail of Generating GPU Code from a High-Level Representation for Image Processing Kernels

Lecture Notes in Computer Science, 2012

ABSTRACT We present a framework for representing image processing kernels based on decoupled acce... more ABSTRACT We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.

Research paper thumbnail of Automating generation of data movement code for parallel architectures with distributed memories

Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...

Research paper thumbnail of Decoupled Access/Execute metaprogramming for GPU-accelerated systems

Research paper thumbnail of Towards Metaprogramming for Parallel Systems on a Chip

Lecture Notes in Computer Science, 2010

We demonstrate that the performance of commodity parallel systems significantly depends on low-le... more We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.

Research paper thumbnail of Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Lecture Notes in Computer Science, 2009

Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating ... more Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explic-itly specify both the memory access pattern and the ...

Research paper thumbnail of Revisiting SIMD Programming

Lecture Notes in Computer Science, 2008

Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil ... more Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil Hickey2, and David Stuttard2 1 Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 ...

Research paper thumbnail of Collective Mind: Towards Practical and Collaborative Auto-Tuning

Scientific Programming, 2014

ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential t... more ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.

Research paper thumbnail of Automating generation of data movement code for parallel architectures with distributed memories

Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...

Research paper thumbnail of Formal analysis techniques for reliable GPU programming

Advances in GPU Research and Practice, 2017

Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number o... more Graphics processing units (GPU)-accelerated computing is being adopted increasingly in a number of areas, ranging from high-end scientific computing to mobile and embedded computing. While GPU programs routinely provide high computational throughput in a number of areas, they also prove to be notoriously difficult to write and optimize correctly, largely because of the subtleties of GPU concurrency. This chapter discusses several issues that make GPU programming difficult and examines recent progress on rigorous methods for formal analysis of GPU software. Our key observation is that given the fast-paced advances in GPU programming, the use of rigorous specification and verification methods must be an integral part of the culture of programming and training, and not an afterthought.

Research paper thumbnail of PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

2015 International Conference on Parallel Architecture and Compilation (PACT), 2015

Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA i... more Programming accelerators such as GPUs with low-level APIs and languages such as OpenCL and CUDA is difficult, error-prone, and not performance-portable. Automatic parallelization and domain specific languages (DSLs) have been proposed to hide complexity and regain performance portability. We present PENCIL, a rigorously-defined subset of GNU C99-enriched with additional language constructs-that enables compilers to exploit parallelism and produce highly optimized code when targeting accelerators. PENCIL aims to serve both as a portable implementation language for libraries, and as a target language for DSL compilers. We implemented a PENCIL-to-OpenCL backend using a state-of-the-art polyhedral compiler. The polyhedral compiler, extended to handle data-dependent control flow and non-affine array accesses, generates optimized OpenCL code. To demonstrate the potential and performance portability of PENCIL and the PENCIL-to-OpenCL compiler, we consider a number of image processing kernels, a set of benchmarks from the Rodinia and SHOC suites, and DSL embedding scenarios for linear algebra (BLAS) and signal processing radar applications (SpearDE), and present experimental results for four GPU platforms: AMD Radeon HD 5670 and R9 285, NVIDIA GTX 470, and ARM Mali-T604.

Research paper thumbnail of Automatically generating and tuning GPU code for sparse matrix-vector multiplication from a high-level representation

Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, 2011

Research paper thumbnail of Compile-Time and Run-Time Issues in an Auto-Parallelisation System for the Cell BE Processor

Lecture Notes in Computer Science, 2009

We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs... more We describe compiler and run-time optimisations for effective autoparallelisation of C++ programs on the Cell BE architecture. Auto-parallelisation is made easier by annotating sieve scopes, which abstract the "read in, compute in parallel, write out" processing paradigm. We show that the semantics of sieve scopes enables data movement optimisations, such as re-organising global memory reads to minimise DMA transfers and streaming reads from uniformly accessed arrays. We also describe run-time optimisations for committing sideeffects to main memory. We provide experimental results showing the benefits of our optimisations, and compare the Sieve-Cell system with IBM's OpenMP implementation for Cell.

Research paper thumbnail of Optimal bit-reversal using vector permutations

Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, 2007

Research paper thumbnail of Auto-parallelisation of Sieve C++ Programs

Lecture Notes in Computer Science

Research paper thumbnail of Nested loop vectorisation

We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter... more We discuss two alternative strategies for vectorisation of a Finite Impulse Response (FIR) filter and show that optimising the innermost loop does not always produce the best results despite this being a common belief.

Research paper thumbnail of Strict and Relaxed Sieving for Multi-Core Programming

Research paper thumbnail of Optimising component composition using indexed dependence metadata

Research paper thumbnail of Generating CUDA Code at Runtime: Specializing Accelerator Code to Runtime Data

Performance Computing, 2008

Research paper thumbnail of Dynamic Data Structures for Taskgraph Scheduling Policies with Applications in OpenCL Accelerators

Research paper thumbnail of Generating GPU Code from a High-Level Representation for Image Processing Kernels

Lecture Notes in Computer Science, 2012

ABSTRACT We present a framework for representing image processing kernels based on decoupled acce... more ABSTRACT We present a framework for representing image processing kernels based on decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access pattern of a kernel. The framework performs source-to-source translation of kernels expressed in high-level framework-specific C++ classes into low-level CUDA or OpenCL code with effective device-dependent optimizations such as global memory padding for memory coalescing and optimal memory bandwidth utilization. We evaluate the framework on several image filters, comparing generated code against highly-optimized CPU and GPU versions in the popular OpenCV library.

Research paper thumbnail of Automating generation of data movement code for parallel architectures with distributed memories

Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...

Research paper thumbnail of Decoupled Access/Execute metaprogramming for GPU-accelerated systems

Research paper thumbnail of Towards Metaprogramming for Parallel Systems on a Chip

Lecture Notes in Computer Science, 2010

We demonstrate that the performance of commodity parallel systems significantly depends on low-le... more We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata which allow the programmer to specify both execution constraints and memory access pattern of a computation kernel.

Research paper thumbnail of Deriving Efficient Data Movement from Decoupled Access/Execute Specifications

Lecture Notes in Computer Science, 2009

Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating ... more Abstract. On multi-core architectures with software-managed memories, effec-tively orchestrating data movement is essential to performance, but is tedious and error-prone. In this paper we show that when the programmer can explic-itly specify both the memory access pattern and the ...

Research paper thumbnail of Revisiting SIMD Programming

Lecture Notes in Computer Science, 2008

Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil ... more Page 1. Revisiting SIMD Programming Anton Lokhmotov1,⋆, Benedict R. Gaster2, Alan Mycroft1, Neil Hickey2, and David Stuttard2 1 Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 ...

Research paper thumbnail of Collective Mind: Towards Practical and Collaborative Auto-Tuning

Scientific Programming, 2014

ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential t... more ABSTRACT Empirical auto-tuning and machine learning techniques have been showing high potential to improve execution time, power consumption, code size, reliability and other important metrics of various applications for more than two decades. However, they are still far from widespread production use due to lack of native support for auto-tuning in an ever changing and complex software and hardware stack, large and multi-dimensional optimization spaces, excessively long exploration times, and lack of unified mechanisms for preserving and sharing of optimization knowledge and research material. We present a possible collaborative approach to solve above problems using Collective Mind knowledge management system. In contrast with previous cTuning framework, this modular infrastructure allows to preserve and share through the Internet the whole auto-tuning setups with all related artifacts and their software and hardware dependencies besides just performance data. It also allows to gradually structure, systematize and describe all available research material including tools, benchmarks, data sets, search strategies and machine learning models. Researchers can take advantage of shared components and data with extensible meta-description to quickly and collaboratively validate and improve existing auto-tuning and benchmarking techniques or prototype new ones. The community can now gradually learn and improve complex behavior of all existing computer systems while exposing behavior anomalies or model mispredictions to an interdisciplinary community in a reproducible way for further analysis. We present several practical, collaborative and model-driven auto-tuning scenarios. We also decided to release all material at http://c-mind.org/repo to set up an example for a collaborative and reproducible research as well as our new publication model in computer engineering where experimental results are continuously shared and validated by the community.

Research paper thumbnail of Automating generation of data movement code for parallel architectures with distributed memories

Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs ope... more Scaling to a large number of processing elements (PEs) mandates that at least some of the PEs operate on their local memories. Orchestrating data movement between distributed memories, however, is tedious and error-prone, because the programmer needs to ...