High-level Abstractions for Performance, Portability and Continuity of Scientific Software on Future Computing Systems (original) (raw)

Performance Analysis of a High-Level Abstractions-Based Hydrocode on Future Computing Systems

Lecture Notes in Computer Science, 2015

In this paper we present research on applying a domain specific high-level abstractions (HLA) development strategy with the aim to "future-proof" a key class of high performance computing (HPC) applications that simulate hydrodynamics computations at AWE plc. We build on an existing high-level abstraction framework, OPS, that is being developed for the solution of multi-block structured mesh-based applications at the University of Oxford. OPS uses an "active library" approach where a single application code written using the OPS API can be transformed into different highly optimized parallel implementations which can then be linked against the appropriate parallel library enabling execution on different back-end hardware platforms. The target application in this work is the CloverLeaf mini-app from Sandia National Laboratory's Mantevo suite of codes that consists of algorithms of interest from hydrodynamics workloads. Specifically, we present (1) the lessons learnt in re-engineering an industrial representative hydro-dynamics application to utilize the OPS high-level framework and subsequent code generation to obtain a range of parallel implementations, and (2) the performance of the auto-generated OPS versions of CloverLeaf compared to that of the performance of the hand-coded original CloverLeaf implementations on a range of platforms. Benchmarked systems include Intel multi-core CPUs and NVIDIA GPUs, the Archer (Cray XC30) CPU cluster and the Titan (Cray XK7) GPU cluster with different parallelizations (OpenMP, OpenACC, CUDA, OpenCL and MPI). Our results show that the development of parallel HPC applications using a high-level framework such as OPS is no more time consuming nor difficult than writing a one-off parallel program targeting only a single parallel implementation. However the OPS strategy pays off with a highly maintainable single application source, through which multiple parallelizations can be realized, without compromising performance portability on a range of parallel systems.

Development of large scale high performance applications with a parallelizing compiler

A bstract: -High level environment such as High Performance Fortran (HPF) supporting the development of parallel applications and porting of legacy codes to parallel architectures have not yet gained a broad acceptance and diffusion. Common objections claim difficulty of performance tuning, limitation of its application to regular, data parallel computations, and lack of robustness of parallelizing HPF compilers in handling large sized codes.

Using simple abstraction to guide the reinvention of computing for parallelism

The sudden shift from single-processor computer systems to many-processor parallel ones requires reinventing much of Computer Science (CS): how to actually build and program the new parallel systems. CS urgently requires convergence to a robust parallel general-purpose platform that provides good performance and is easy enough to program by at least all CS majors. Unfortunately, lesser ease-ofprogramming objectives have eluded decades of parallel computing research. The idea of starting with an established easy parallel programming model and build an architecture for it has been treated as radical by vendors. This article advocates a more radical idea. Start with a minimalist stepping-stone: a simple abstraction that encapsulates the desired interface between programmers and system builders. An Immediate Concurrent Execution (ICE) abstraction proposal is followed by two specific contributions: (i) A general-purpose many-core Explicit Multi-Threaded (XMT) computer architecture. XMT was designed from the ground up to capitalize on the huge on-chip resources becoming available in order to support the formidable body of knowledge, known as PRAM (for parallel random-access machine, or model) algorithmics, and the latent, though not widespread, familiarity with it. (ii) A programmer's workflow that links: ICE, PRAM algorithmics and XMT programming. The synchronous PRAM provides ease of algorithm design, and ease of reasoning about correctness and complexity. Multithreaded programming relaxes this synchrony for implementation. Directly reasoning about soundness and performance of multi-threaded code is generally known to be error prone. To circumvent that, the workflow incorporates multiple levels of abstraction: the programmer must only establish that the multi-threaded program behavior matches the PRAMlike algorithm it implements − a much simpler task. Current XMT hardware and software prototypes, and demonstrated ease-of-programming and strong speedups suggest that we may be much better prepared for the challenges ahead than many realize. .

Guest Editorial: High-Level Parallel Programming and Applications

International Journal of Parallel Programming, 2016

The arrival of the multi-/many-core systems has produced a game-changing event for the computing industry, which today, much more than few years ago, is relying on parallel processing as a means of improving application performance. Although a wide gap still exists between parallel architectures and parallel programming maturity, the only way forward to keep increasing performance and reducing power consumption is through parallelism. Any program must become a parallel program in order to exploit the capabilities of modern computers at any scale. In the industrial practice, parallel programming is still dominated by low-level machine-centric unstructured approaches based on tools and specialized libraries that originate from high performance computing. Parallel programming at this level of abstraction is difficult, error-prone, time-consuming and, hence, economically infeasible in most application domains. Now, more than ever, it is crucial that the research community makes a significant progress toward making the development of parallel code accessible to all programmers, rather than allowing parallel programming to continue to be the domain of specialized expert programmers. Achieving a proper trade-off among performance, programmability and portability issues is becoming a must. Parallel and distributed programming methodologies are currently dominated by low-level techniques such as send/receive message passing or data sharing coordinated by locks. These abstractions are not a good fit for reasoning about parallelism. In this evolution/revolution phase, a fundamental role is played by high-level and portable programming tools as well as application development frameworks. They may offer

Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application

2013 IEEE 27th International Symposium on Parallel and Distributed Processing, 2013

Parallel computing architectures are becoming more complex with increasing core counts and more heterogeneous architectures. However, the most commonly used programming models, C/C++ with MPI and/or OpenMP, make it very difficult to write source code that is easily tuned for many targets. Newer language approaches attempt to ease this burden by providing optimization features such as computation and communication overlap, message-driven execution, automatic load balancing and implicit data layout optimizations. In this paper, we compare multiple implementations of LULESH, a proxy application for a shock hydrodynamics, to determine strengths and weaknesses of four traditional (OpenMP, MPI, MPI+OpenMP, CUDA) and four emerging (Chapel, Charm++, Liszt, Loci) programming models for parallel computation. In evaluating these programming models, we focus on programmer productivity, performance and ease of applying optimizations.

Scalable Execution of Legacy Scientific Codes

Lecture Notes in Computer Science, 2006

This paper presents Weaves, a language neutral framework for scalable execution of legacy parallel scientific codes. Weaves supports scalable threads of control and multiple namespaces with selective sharing of state within a single address space. We resort to two examples for illustration of different aspects of the framework and to stress the diversity of its application domains. The more expressive collaborating partial differential equation (PDE) solvers are used to exemplify developmental aspects, while freely available Sweep3D is used for performance results. We outline the framework in the context of shared memory systems, where its benefits are apparent. We also contrast Weaves against existing programming paradigms, present use cases, and outline its implementation. Preliminary performance tests show significant scalability over process-based implementations of Sweep3D.

High-Throughput Computing on High-Performance Platforms: A Case Study

2017 IEEE 13th International Conference on e-Science (e-Science), 2017

The computing systems used by LHC experiments has historically consisted of the federation of hundreds to thousands of distributed resources, ranging from small to mid-size resource. In spite of the impressive scale of the existing distributed computing solutions, the federation of small to mid-size resources will be insufficient to meet projected future demands. This paper is a case study of how the ATLAS experiment has embraced Titan-a DOE leadership facility in conjunction with traditional distributed high-throughput computing to reach sustained production scales of approximately 52M core-hours a years. The three main contributions of this paper are: (i) a critical evaluation of design and operational considerations to support the sustained, scalable and production usage of Titan; (ii) a preliminary characterization of a next generation executor for PanDA to support new workloads and advanced execution modes; and (iii) early lessons for how current and future experimental and observational systems can be integrated with production supercomputers and other platforms in a general and extensible manner.

SCE Toolboxes for the development of high-level parallel applications

… Science–ICCS 2006, 2006

Abstract. Users of Scientific Computing Environments (SCE) benefit from faster high-level software development at the cost of larger run time due to the interpreted environment. For time-consuming SCE applications, dividing the workload among several computers can be a cost-effective ...