Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications (original) (raw)

Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment

2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Many of the heterogeneous resources available to modern computers are designed for different workloads. In order to efficiently use GPU resources, the workload must have a greater degree of parallelism than a workload designed for multicore-CPUs. And conceptually, the Intel Xeon Phi coprocessors are capable of handling workloads somewhere in between the two. This multitude of applicable workloads will likely lead to mixing multicore-CPUs, GPUs, and Intel coprocessors in multi-user environments that must offer adequate computing facilities for a wide range of workloads. In this work, we are using a lightweight runtime environment to manage the resource-specific workload, and to control the dataflow and parallel execution in two-way hybrid systems. The lightweight runtime environment uses task superscalar concepts to enable the developer to write serial code while providing parallel execution. In addition, our task abstractions enable unified algorithmic development across all the heterogeneous resources. We provide performance results for dense linear algebra applications, demonstrating the effectiveness of our approach and full utilization of a wide variety of accelerator hardware.

Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads across Accelerators, Coprocessors, and Multicore Processors

2014

Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and easeof-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 2012

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC's runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. We demonstrate the extensible design of MPI-ACC by using the popular CUDA and OpenCL accelerator programming interfaces. We examine the impact of MPI-ACC on communication performance and evaluate application-level benefits on a large-scale epidemiology simulation.

The FPGA High-Performance Computing Alliance Parallel Toolkit

Second NASA/ESA Conference on Adaptive Hardware and Systems (AHS 2007), 2007

We describe the FPGA HPC Alliance's Parallel Toolkit (PTK), an initial step towards the standardization of high-level configuration and APIs for high-performance reconfigurable computing (HPRC). We discuss the motivation and challenges of reaping the performance benefits of FPGAs for memory-bound HPC codes and describe the approach we have taken on the FHPCA supercomputer Maxwell.

High-performance computing using accelerators

Parallel Computing, 2007

High-performance computing using accelerators A recent trend in high-performance computing is the development and use of heterogeneous architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. These processing cores are available as accelerators or many-core processors, which are designed with the goal of achieving higher parallel-code performance. This is in contrast with traditional multicore CPUs that effectively replicate serial CPU cores. The recent demand for these accelerators comes primarily from consumer applications, including computer gaming and multimedia. Examples of such accelerators include graphics processing units (GPUs), Cell Broadband Engines (Cell BEs), field-programmable gate arrays (FPGAs), and other data-parallel or streaming processors. Compared to conventional CPUs, the accelerators can offer an order-of-magnitude improvement in performance per dollar as well as per watt. Moreover, some recent industry announcements are pointing towards the design of heterogeneous processors and computing environments, which are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors. This special issue on ''High-Performance Computing Using Accelerators'' includes many papers on such commodity, many-core processors, including GPUs, Cell BEs, and FPGAs. GPGPUs: Current top-of-the-line GPUs have tens or hundreds of fragment processors and high memory bandwidth, i.e. 10• more than current CPUs. This processing power of GPUs has been successfully exploited for scientific, database, geometric and imaging applications (i.e. GPGPUs, short for General-Purpose computation on GPUs). The significant increase in parallelism within a processor can also lead to other benefits including higher power-efficiency and better memory-latency tolerance. In many cases, an order-of-magnitude performance was shown, as compared to top-of-the-line CPUs. For example, GPUTeraSort used the GPU interface to drive memory more efficiently and resulted in a threefold improvement in records/second/CPU. Similarly, some of the fastest algorithms for many numerical computations-including FFT, dense matrix multiplications and linear solvers, and collision and proximity computations-use GPUs to achieve tremendous speed-ups. Cell Broadband Engines: The Cell Broadband Engine is a joint venture between Sony, Toshiba, and IBM. It appears in consumer products such as Sony's PlayStation 3 computer entertainment system and Toshiba's Cell Reference Set, a development tool for Cell Broadband Engine applications. When viewed as a processor, the Cell can exploit the orthogonal dimensions of task and data parallelism on a single chip. The Cell processor consists of a symmetric multi-threaded (SMT) Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs) with pipelined SIMD capabilities. The processor achieves a theoretical peak performance of over 200 Gflops for single-precision floating-point calculations and has a peak memory bandwidth of over 25 GB/s. Actual speed-up factors achieved when automatically parallelizing sequential code kernels via the Cell's pipelined SIMD capabilities reach as high as 26-fold. Field-Programmable Gate Arrays (FPGAS): FPGAs support the notion of reconfigurable computing and offer a high degree of on-chip parallelism that can be mapped directly from the dataflow characteristics of an application's parallel algorithm. Their recent emergence in the high-performance computing arena can be attributed to a hybrid approach that combines the logic blocks and interconnects of traditional FPGAs with

PQE HPF-a library for exploiting the capabilities of a PQE-1 heterogeneous parallel architecture

Proceedings 8th Euromicro Workshop on Parallel and Distributed Processing, 1999

Heterogeneous Computing is a special form of parallel and distributed computing where computations are performed using a single autonomous computer operating in both SIMD and MIMD modes, or using a number of connected autonomous computers. In Multimode System Heterogeneous Computing, tasks can be executed in both SIMD and MIMD simultaneously.

Supporting multiple accelerators in high-level programming models

Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '15, 2015

Computational accelerators, such as manycore NVIDIA GPUs, Intel Xeon Phi and FPGAs, are becoming common in workstations, servers and supercomputers for scientific and engineering applications. Efficiently exploiting the massive parallelism these accelerators provide requires the designs and implementations of productive programming models. In this paper, we explore support of multiple accelerators in high-level programming models. We design novel language extensions to OpenMP to support offloading data and computation regions to multiple accelerators (devices). These extensions allow for distributing data and computation among a list of devices via easy-to-use interfaces, including specifying the distribution of multi-dimensional arrays and declaring shared data regions among accelerators. Computation distribution is realized by partitioning a loop iteration space among accelerators. We implement mechanisms to marshal/unmarshal and to move data of non-contiguous array subregions and shared regions between accelerators without involving CPUs. We design reduction techniques that work across multiple accelerators. Combined compiler and runtime support is designed to manage multiple GPUs using asynchronous operations and threading mechanisms. We implement our solutions for NVIDIA GPUs and demonstrate through example OpenMP codes the effectiveness of our solutions for performance improvement.

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

Journal of Parallel and Distributed Computing, 2013

ABSTRACT While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.

Towards performance portability through runtime adaptation for high-performance computing applications

2010

The Abstract Data and Communication Library (ADCL) is an adaptive communication library optimizing application level collective communication operations at runtime. The library provides for a given communication pattern a large number of implementations and incorporates a runtime selection logic in order to choose the implementation leading to the highest performance. In this paper, we demonstrate for the first time, how an application utilizing ADCL is deployed on a wide range of HPC architectures, including an IBM Blue Gene, an NEC SX8, an IBM Power PC cluster using an IBM Federation Switch, an AMD Opteron cluster utilizing an 4xInfiniBand and a Gigabit Ethernet network, and an Intel EM64T cluster using a hierarchical Gigabit Ethernet network with reduced uplink bandwidth. We demonstrate, how different implementations for the three dimensional neighborhood communication lead to the minimal execution time of the application on different architectures. ADCL gives the user the advantage of having to maintain only a single version of the source code and still have the ability to achieve close to optimal performance for the application on all architectures.

Optimizing a High Energy Physics (HEP) Toolkit on Heterogeneous Architectures

2011

A desired trend within high energy physics is to increase particle accelerator luminosities, leading to production of more collision data and higher probabilities of finding interesting physics results. A central data analysis technique used to determine whether results are interesting or not is the maximum likelihood method, and the corresponding evaluation of the negative log-likelihood, which can be computationally expensive. As the amount of data grows, it is important to take benefit from the parallelism in modern computers. This, in essence, means to exploit vector registers and all available cores on CPUs, as well as utilizing co-processors as GPUs.