W. Gaudin | University of Bristol (original) (raw)

Papers by W. Gaudin

Research paper thumbnail of Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems

Lecture Notes in Computer Science, 2015

In this paper we present research on applying a domain specific high-level abstractions (HLA) dev... more In this paper we present research on applying a domain specific high-level abstractions (HLA) development strategy with the aim to "future-proof" a key class of high performance computing (HPC) applications that simulate hydrodynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured mesh-based applications at the University of Oxford. OPS uses an "active library" approach where a single application code written using the OPS API can be transformed into different highly optimized parallel implementations which can then be linked against the appropriate parallel library enabling execution on different back-end hardware platforms. The target application in this work is the CloverLeaf mini-app from Sandia National Laboratory's Mantevo suite of codes that consists of algorithms of interest from hydrodynamics workloads. Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain a range of parallel implementations, and (2) the performance of the auto-generated OPS versions of CloverLeaf compared to that of the performance of the hand-coded original CloverLeaf implementations on a range of platforms. Benchmarked systems include Intel multi-core CPUs and NVIDIA GPUs, the Archer (Cray XC30) CPU cluster and the Titan (Cray XK7) GPU cluster with different parallelizations (OpenMP, OpenACC, CUDA, OpenCL and MPI). Our results show that the development of parallel HPC applications using a high-level framework such as OPS is no more time consuming nor difficult than writing a one-off parallel program targeting only a single parallel implementation. However the OPS strategy pays off with a highly maintainable single application source, through which multiple parallelizations can be realized, without compromising performance portability on a range of parallel systems.

Research paper thumbnail of Achieving Portability and Performance through OpenACC

2014 First Workshop on Accelerator Programming using Directives, 2014

ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging... more ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.

Research paper thumbnail of Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

Research paper thumbnail of Accelerating Hydrocodes with OpenACC, OpenCL and CUDA

2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012

ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms a... more ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.

Research paper thumbnail of Experiences at scale with PGAS versions of a Hydrodynamics application

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14, 2014

Research paper thumbnail of Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs

ACM SIGMETRICS Performance Evaluation Review, 2011

This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the ... more This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems.

Research paper thumbnail of Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite

Research paper thumbnail of High-level Abstractions for Performance, Portability and Continuity of Scientific Software on Future Computing Systems

In this report we present research on applying a domain specific high-level abstractions developm... more In this report we present research on applying a domain specific high-level abstractions development strategy with the aim to "future-proof" a key class of high performance computing (HPC) applications that simulate hydro-dynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured meshbased applications at the University of Oxford. The target application, is an unclassified benchmark application, CloverLeaf, that consists of algorithms of interest from the hydro-dynamics workload at AWE plc.

Research paper thumbnail of Optimisation of Patch Distribution Strategies for AMR Applications

As core counts increase in the world’s most powerful supercomputers, applications are becoming li... more As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.

Research paper thumbnail of Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems

Lecture Notes in Computer Science, 2015

In this paper we present research on applying a domain specific high-level abstractions (HLA) dev... more In this paper we present research on applying a domain specific high-level abstractions (HLA) development strategy with the aim to "future-proof" a key class of high performance computing (HPC) applications that simulate hydrodynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured mesh-based applications at the University of Oxford. OPS uses an "active library" approach where a single application code written using the OPS API can be transformed into different highly optimized parallel implementations which can then be linked against the appropriate parallel library enabling execution on different back-end hardware platforms. The target application in this work is the CloverLeaf mini-app from Sandia National Laboratory's Mantevo suite of codes that consists of algorithms of interest from hydrodynamics workloads. Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain a range of parallel implementations, and (2) the performance of the auto-generated OPS versions of CloverLeaf compared to that of the performance of the hand-coded original CloverLeaf implementations on a range of platforms. Benchmarked systems include Intel multi-core CPUs and NVIDIA GPUs, the Archer (Cray XC30) CPU cluster and the Titan (Cray XK7) GPU cluster with different parallelizations (OpenMP, OpenACC, CUDA, OpenCL and MPI). Our results show that the development of parallel HPC applications using a high-level framework such as OPS is no more time consuming nor difficult than writing a one-off parallel program targeting only a single parallel implementation. However the OPS strategy pays off with a highly maintainable single application source, through which multiple parallelizations can be realized, without compromising performance portability on a range of parallel systems.

Research paper thumbnail of Achieving Portability and Performance through OpenACC

2014 First Workshop on Accelerator Programming using Directives, 2014

ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging... more ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.

Research paper thumbnail of Parallel Block Structured Adaptive Mesh Refinement on Graphics Processing Units

Research paper thumbnail of Accelerating Hydrocodes with OpenACC, OpenCL and CUDA

2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012

ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms a... more ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.

Research paper thumbnail of Experiences at scale with PGAS versions of a Hydrodynamics application

Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models - PGAS '14, 2014

Research paper thumbnail of Benchmarking and modelling of POWER7, Westmere, BG/P, and GPUs

ACM SIGMETRICS Performance Evaluation Review, 2011

This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the ... more This paper introduces an industry strength, multi-purpose, benchmark: Shamrock. Developed at the Atomic Weapons Establishment (AWE), Shamrock is a two dimensional (2D) structured hydrocode; one of its aims is to assess the impacts of a change in hardware, and (in conjunction with a larger HPC Benchmark Suite) to provide guidance in procurement of future systems.

Research paper thumbnail of Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite

Research paper thumbnail of High-level Abstractions for Performance, Portability and Continuity of Scientific Software on Future Computing Systems

In this report we present research on applying a domain specific high-level abstractions developm... more In this report we present research on applying a domain specific high-level abstractions development strategy with the aim to "future-proof" a key class of high performance computing (HPC) applications that simulate hydro-dynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured meshbased applications at the University of Oxford. The target application, is an unclassified benchmark application, CloverLeaf, that consists of algorithms of interest from the hydro-dynamics workload at AWE plc.

Research paper thumbnail of Optimisation of Patch Distribution Strategies for AMR Applications

As core counts increase in the world’s most powerful supercomputers, applications are becoming li... more As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.