David A Beckingsale | Lawrence Livermore National Lab (original) (raw)
Uploads
Papers by David A Beckingsale
Performance modelling is an important tool utilised by the High Performance Computing industry to... more Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the runtime of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
2014 First Workshop on Accelerator Programming using Directives, 2014
ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging... more ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.
2013 International Conference on High Performance Computing & Simulation (HPCS), 2013
ABSTRACT
Lecture Notes in Computer Science, 2013
Performance modelling is an important tool utilised by the High Performance Computing industry to... more Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the runtime of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms a... more ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.
In the approach to exascale, scalable tools are becoming increasingly necessary to support parall... more In the approach to exascale, scalable tools are becoming increasingly necessary to support parallel applications. Evaluating an application's call stack is a vital technique for a wide variety of profilers and debuggers, and can create a significant performance overhead. In this paper we present a heuristic technique to reduce the overhead of frequent call stack evaluations. We use this technique to estimate the similarity between successive call stacks, removing the need for full call stack traversal and eliminating a significant portion of the performance overhead. We demonstrate this technique applied to a parallel memory tracing toolkit, WMTools, and analyse the performance gains and accuracy.
The Computer Journal, 2013
Abstract The importance of memory performance and capacity is a growing concern for high performa... more Abstract The importance of memory performance and capacity is a growing concern for high performance computing laboratories around the world. It has long been recognized that improvements in processor speed exceed the rate of improvement in dynamic random access memory speed and, as a result, memory access times can be the limiting factor in high performance scientific codes. The use of multi-core processors exacerbates this problem with the rapid growth in the number of cores not being matched by similar ...
The latteMPI project is a pure Java implementation of a subset of the Message Passing Interface (... more The latteMPI project is a pure Java implementation of a subset of the Message Passing Interface (MPI) Standard. The MPI Standard defines a number of methods that facilitate writing programs using the message-passing model of computing, where processes share data using explicit calls to send and receive functions. Performance of parallel programs running on both latteMPI and a C implementation of the MPI Standard is investigated. The performance of latteMPI is slower than the equivalent C implementation, ...
In the approach to exascale, scalable tools are becoming increasingly necessary to support parall... more In the approach to exascale, scalable tools are becoming increasingly necessary to support parallel applications. Evaluating an application's call stack is a vital technique for a wide variety of profilers and debuggers, and can create a significant performance overhead. In this paper we present a heuristic technique to reduce the overhead of frequent call stack evaluations. We use this technique to estimate the similarity between successive call stacks, removing the need for full call stack traversal and eliminating a significant portion of the ...
As core counts increase in the world’s most powerful supercomputers, applications are becoming li... more As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.
inproceedings by David A Beckingsale
techreports by David A Beckingsale
articles by David A Beckingsale
Performance modelling is an important tool utilised by the High Performance Computing industry to... more Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the runtime of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
2014 First Workshop on Accelerator Programming using Directives, 2014
ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging... more ABSTRACT OpenACC is a directive-based programming model designed to allow easy access to emerging advanced architecture systems for existing production codes based on Fortran, C and C++. It also provides an approach to coding contemporary technologies without the need to learn complex vendor-specific languages, or understand the hardware at the deepest level. Portability and performance are the key features of this programming model, which are essential to productivity in real scientific applications. OpenACC support is provided by a number of vendors and is defined by an open standard. However the standard is relatively new, and the implementations are relatively immature. This paper experimentally evaluates the currently available compilers by assessing two approaches to the OpenACC programming model: the "parallel" and "kernels" constructs. The implementation of both of these construct is compared, for each vendor, showing performance differences of up to 84%. Additionally, we observe performance differences of up to 13% between the best vendor implementations. OpenACC features which appear to cause performance issues in certain compilers are identified and linked to differing default vector length clauses between vendors. These studies are carried out over a range of hardware including GPU, APU, Xeon and Xeon Phi based architectures. Finally, OpenACC performance, and productivity, are compared against the alternative native programming approaches on each targeted platform, including CUDA, OpenCL, OpenMP 4.0 and Intel Offload, in addition to MPI and OpenMP.
2013 International Conference on High Performance Computing & Simulation (HPCS), 2013
ABSTRACT
Lecture Notes in Computer Science, 2013
Performance modelling is an important tool utilised by the High Performance Computing industry to... more Performance modelling is an important tool utilised by the High Performance Computing industry to accurately predict the runtime of science applications on a variety of different architectures. Performance models aid in procurement decisions and help to highlight areas for possible code optimisations. This paper presents a performance model for a magnetohydrodynamics physics application, Lare. We demonstrate that this model is capable of accurately predicting the run-time of Lare across multiple platforms with an accuracy of 90% (for both strong and weak scaled problems). We then utilise this model to evaluate the performance of future optimisations. The model is generated using SST/macro, the machine level component of the Structural Simulation Toolkit (SST) from Sandia National Laboratories, and is validated on both a commodity cluster located at the University of Warwick and a large scale capability resource located at Lawrence Livermore National Laboratory.
2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms a... more ABSTRACT Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies-OpenACC, OpenCL and CUDA-in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA.
In the approach to exascale, scalable tools are becoming increasingly necessary to support parall... more In the approach to exascale, scalable tools are becoming increasingly necessary to support parallel applications. Evaluating an application's call stack is a vital technique for a wide variety of profilers and debuggers, and can create a significant performance overhead. In this paper we present a heuristic technique to reduce the overhead of frequent call stack evaluations. We use this technique to estimate the similarity between successive call stacks, removing the need for full call stack traversal and eliminating a significant portion of the performance overhead. We demonstrate this technique applied to a parallel memory tracing toolkit, WMTools, and analyse the performance gains and accuracy.
The Computer Journal, 2013
Abstract The importance of memory performance and capacity is a growing concern for high performa... more Abstract The importance of memory performance and capacity is a growing concern for high performance computing laboratories around the world. It has long been recognized that improvements in processor speed exceed the rate of improvement in dynamic random access memory speed and, as a result, memory access times can be the limiting factor in high performance scientific codes. The use of multi-core processors exacerbates this problem with the rapid growth in the number of cores not being matched by similar ...
The latteMPI project is a pure Java implementation of a subset of the Message Passing Interface (... more The latteMPI project is a pure Java implementation of a subset of the Message Passing Interface (MPI) Standard. The MPI Standard defines a number of methods that facilitate writing programs using the message-passing model of computing, where processes share data using explicit calls to send and receive functions. Performance of parallel programs running on both latteMPI and a C implementation of the MPI Standard is investigated. The performance of latteMPI is slower than the equivalent C implementation, ...
In the approach to exascale, scalable tools are becoming increasingly necessary to support parall... more In the approach to exascale, scalable tools are becoming increasingly necessary to support parallel applications. Evaluating an application's call stack is a vital technique for a wide variety of profilers and debuggers, and can create a significant performance overhead. In this paper we present a heuristic technique to reduce the overhead of frequent call stack evaluations. We use this technique to estimate the similarity between successive call stacks, removing the need for full call stack traversal and eliminating a significant portion of the ...
As core counts increase in the world’s most powerful supercomputers, applications are becoming li... more As core counts increase in the world’s most powerful supercomputers, applications are becoming limited not only by computational power, but also by data availability. In the race to exascale, efficient and effective communication policies are key to achieving optimal application performance. Applications using adaptive mesh refinement (AMR) trade off communication for computational load balancing, to enable the focused computation of specific areas of interest. This class of application is particularly susceptible to the communication performance of the underlying architectures, and are inherently difficult to scale efficiently. In this paper we present a study of the effect of patch distribution strategies on the scalability of an AMR code. We demonstrate the significance of patch placement on communication overheads, and by balancing the computation and communication costs of patches, we develop a scheme to optimise performance of a specific, industry-strength, benchmark application.
In the march towards exascale, supercomputer architectures are undergoing a significant change. L... more In the march towards exascale, supercomputer architectures are undergoing a significant change. Limited by power consumption and heat dissipation, future supercomputers are likely to be built around a lower-power many-core model. This shift in supercomputer design will require sweeping code changes in order to take advantage of the highly-parallel architectures. Evolving or rewriting legacy applications to perform well on these machines is a significant challenge.