Entering the petaflop era: the architecture and performance of Roadrunner (original) (raw)

Multicore surprises: Lessons learned from optimizing sweep3d on the cellbe

Symposium on Parallel and Distributed Processing, 2007

The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as thread-level parallelism using several Synergistic Processing Elements, data streaming parallelism, vector parallelism in the form of 128-bit SIMD operations, and pipeline parallelism by issuing multiple instructions in the same clock cycle. In our exploration to achieve the optimum level of performance for Sweep3D, we have enjoyed many pleasant surprises, such as a very high floating point performance, reaching 64% of the theoretical peak in double precision, and an overall performance speedup ranging from 4.5 times when compared with "heavy iron" processors, up to over 20 times with conventional processors.

Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine

2007

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

IEEE Transactions on Parallel and Distributed Systems, 2000

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

Petascale computing with accelerators

ACM SIGPLAN Notices, 2009

A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. This system combines traditional x86-64 host processors with IBM PowerXCell™ 8i accelerator processors. The implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.

High-Performance Computing with Desktop Workstations

The performance of modeling and simulation tools is inherently tied to the platform on which they are implemented. In most cases, this platform is a microprocessor, either in a desktop PC, PC cluster, or supercomputer. Microprocessors are used because of their familiarity to developers, not necessarily their performance on the problems of interest. We have developed the underlying techniques and technologies to produce supercomputer performance from a standard desktop workstation for a variety of applications. This is accomplished through the combined use of graphics processing units (GPUs), field-programmable gate arrays (FPGAs), Cell processors, and standard microprocessors. Each of these platforms has unique strengths and weaknesses but can be used in concert to rival the computational power of a high-performance computer (HPC). In this paper, we discuss the relative advantages and disadvantages of each platform and how they can be combined in order to achieve high performance on...

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine™ architecture

IBM Systems Journal, 2000

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Enginee architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPCt processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning. Ó

Report of the Purdue Workshop on Grand Challenges in Computer Architecture for the Support of High Performance Computing

Journal of Parallel and Distributed Computing, 1992

The “Purdue Workshop on Grand Challenges in Computer Architecture for the Support of High Performance Computing” was sponsored by the National Science Foundation to identify critical research topics in computer architecture as they relate to high performance computing. Following a wide-ranging discussion of the computational characteristics and requirements of the grand challenge applications, the workshop identified four major computer architecture grand challenges as crucial to advancing the state of the art of high performance computation in the coming decade. These are: (1) idealized parallel computer models; (2) usable peta-ops (1015 ops) performance; (3) computers in an era of HDTV, gigabyte networks, and visualization; and (4) infrastructure for prototyping architectures. This report overviews some of the demands of the grand challenge applications and presents the above four grand challenges for computer architecture.

Understanding Application Performance via Micro-benchmarks on Three Large Supercomputers: Intrepid, Ranger and Jaguar

2010

Emergence of new parallel architectures presents new challenges for application developers. Supercomputers vary in processor speed, network topology, interconnect communication characteristics and memory subsystems. This paper presents a performance comparison of three of the fastest machines in the world: IBM's Blue Gene/P installation at ANL (Intrepid), the SUN-Infiniband cluster at TACC (Ranger) and Cray's XT4 installation at ORNL (Jaguar). Comparisons are based on three applications selected by NSF for the Track 1 proposal to benchmark the Blue Waters system: NAMD, MILC and a turbulence code, DNS. We present a comprehensive overview of the architectural details of each of these machines and a comparison of their basic performance parameters. Application performance is presented for multiple problem sizes and the relative performance on the selected machines is explained through micro-benchmarking results. We hope that insights from this work will be useful to managers making buying decisions for supercomputers and application users trying to decide on a machine to run on. Based on the performance analysis techniques used in the paper, we also suggest a step-by-step procedure for estimating the suitability of a given architecture for a highly parallel application.

Cutting-edge computing: Using new commodity architectures

Proceedings of the IEEE, 2008

The recent trends in commodity processor architectures exploit multiple cores to achieve higher performance. Some examples include multicore processors that replicate identical serial CPU cores on a single chip, e.g., quad-core CPU chips available from Intel and AMD in 2007. The current trend seems to indicate that the number of cores is growing at a rate governed by the Moore's law. An ongoing and contrasting trend has been the development of heterogeneous processor architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. Examples of such processors include the Cell BE processor, which is used as a CPU in workstations, game consoles, and manycore accelerators (e.g., GPUs), which are designed with the goal achieving higher parallel-code performance for a class of applications.

Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

Ibm Systems Journal, 2006

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Engine™ architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.

Entering the petaflop era: the architecture and performance of Roadrunner (original) (raw)

Related papers