Multicore surprises: Lessons learned from optimizing sweep3d on the cellbe (original) (raw)

Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine

2007

The Cell Broadband Engine (BE) processor provides the potential to achieve an impressive level of performance for scientific applications. This level of performance can be reached by exploiting several dimensions of parallelism, such as thread-level parallelism using several Synergistic Processing Elements, data streaming parallelism, vector parallelism in the form of 128-bit SIMD operations, and pipeline parallelism by issuing multiple instructions in the same clock cycle. In our exploration to achieve the optimum level of performance for Sweep3D, we have enjoyed many pleasant surprises, such as a very high floating point performance, reaching 64% of the theoretical peak in double precision, and an overall performance speedup ranging from 4.5 times when compared with "heavy iron" processors, up to over 20 times with conventional processors.

Implementation of scientific computing applications on the Cell Broadband Engine

Scientific Programming, 2009

The Cell Broadband Engine architecture is a revolutionary processor architecture well suited for many scientific codes. This paper reports on an effort to implement several traditional high-performance scientific computing applications on the Cell Broadband Engine processor, including molecular dynamics, quantum chromodynamics and quantum chemistry codes. The paper discusses data and code restructuring strategies necessary to adapt the applications to the intrinsic properties of the Cell processor and demonstrates performance improvements achieved on the Cell architecture. It concludes with the lessons learned and provides practical recommendations on optimization techniques that are believed to be most appropriate.

Dynamic multigrain parallelization on the cell broadband engine

Proceedings of the …, 2007

This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and scheduling of layered parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exploited from the runtime system whenever task-level parallelism leaves idle cores. We present a scheduler for applications with layered parallelism on Cell and investigate its performance with RAxML, an application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic methods that selectively exploit the layers of parallelism in the system, in response to workload fluctuation. Our scheduler outperforms the MPI version of RAxML, scheduled by the Linux kernel, by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5-10% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor.

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine™ architecture

IBM Systems Journal, 2000

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Enginee architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPCt processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning. Ó

FFTC: Fastest Fourier Transform for the IBM Cell Broadband Engine

The Fast Fourier Transform (FFT) is of primary importance and a fundamental kernel in many computationally intensive scientific applications. In this paper we investigate its performance on the Sony-Toshiba-IBM Cell Broadband Engine, a heterogeneous multicore chip architected for intensive gaming applications and high performance computing. The Cell processor consists of a traditional microprocessor (called the PPE) that controls eight SIMD co-processing units called synergistic processor elements (SPEs). We exploit the architectural features of the Cell processor to design an efficient parallel implementation of Fast Fourier Transform (FFT). While there have been several attempts to develop a fast implementation of FFT on the Cell, none have been able to achieve high performance for input series with several thousand complex points. We use an iterative out-of-place approach to design our parallel implementation of FFT with 1K to 16K complex input samples and attain a single precision performance of 18.6 GFLOP/s on the Cell. Our implementation beats FFTW on Cell by several GFLOP/s for these input sizes and outperforms Intel Duo Core (Woodcrest) for inputs of greater than 2K samples. To our knowledge we have the fastest FFT for this range of complex inputs.

Using advanced compiler technology to exploit the performance of the Cell Broadband EngineTM architecture

Ibm Systems Journal, 2006

The continuing importance of game applications and other numerically intensive workloads has generated an upsurge in novel computer architectures tailored for such functionality. Game applications feature highly parallel code for functions such as game physics, which have high computation and memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Engine™ architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC® processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.

High-performance computing using accelerators

Parallel Computing, 2007

High-performance computing using accelerators A recent trend in high-performance computing is the development and use of heterogeneous architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. These processing cores are available as accelerators or many-core processors, which are designed with the goal of achieving higher parallel-code performance. This is in contrast with traditional multicore CPUs that effectively replicate serial CPU cores. The recent demand for these accelerators comes primarily from consumer applications, including computer gaming and multimedia. Examples of such accelerators include graphics processing units (GPUs), Cell Broadband Engines (Cell BEs), field-programmable gate arrays (FPGAs), and other data-parallel or streaming processors. Compared to conventional CPUs, the accelerators can offer an order-of-magnitude improvement in performance per dollar as well as per watt. Moreover, some recent industry announcements are pointing towards the design of heterogeneous processors and computing environments, which are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors. This special issue on ''High-Performance Computing Using Accelerators'' includes many papers on such commodity, many-core processors, including GPUs, Cell BEs, and FPGAs. GPGPUs: Current top-of-the-line GPUs have tens or hundreds of fragment processors and high memory bandwidth, i.e. 10• more than current CPUs. This processing power of GPUs has been successfully exploited for scientific, database, geometric and imaging applications (i.e. GPGPUs, short for General-Purpose computation on GPUs). The significant increase in parallelism within a processor can also lead to other benefits including higher power-efficiency and better memory-latency tolerance. In many cases, an order-of-magnitude performance was shown, as compared to top-of-the-line CPUs. For example, GPUTeraSort used the GPU interface to drive memory more efficiently and resulted in a threefold improvement in records/second/CPU. Similarly, some of the fastest algorithms for many numerical computations-including FFT, dense matrix multiplications and linear solvers, and collision and proximity computations-use GPUs to achieve tremendous speed-ups. Cell Broadband Engines: The Cell Broadband Engine is a joint venture between Sony, Toshiba, and IBM. It appears in consumer products such as Sony's PlayStation 3 computer entertainment system and Toshiba's Cell Reference Set, a development tool for Cell Broadband Engine applications. When viewed as a processor, the Cell can exploit the orthogonal dimensions of task and data parallelism on a single chip. The Cell processor consists of a symmetric multi-threaded (SMT) Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs) with pipelined SIMD capabilities. The processor achieves a theoretical peak performance of over 200 Gflops for single-precision floating-point calculations and has a peak memory bandwidth of over 25 GB/s. Actual speed-up factors achieved when automatically parallelizing sequential code kernels via the Cell's pipelined SIMD capabilities reach as high as 26-fold. Field-Programmable Gate Arrays (FPGAS): FPGAs support the notion of reconfigurable computing and offer a high degree of on-chip parallelism that can be mapped directly from the dataflow characteristics of an application's parallel algorithm. Their recent emergence in the high-performance computing arena can be attributed to a hybrid approach that combines the logic blocks and interconnects of traditional FPGAs with

The potential of the cell processor for scientific computing

Proceedings of the 3rd conference on Computing frontiers - CF '06, 2006

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the forthcoming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on the Cell full system simulator. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010

In studying the scalability of the Scalasca performance analysis toolset to several hundred thousand MPI processes on IBM Blue Gene/P, we investigated a progressive execution performance deterioration of the well-known ASCI Sweep3D compact application. Scalasca runtime summarization analysis quantified MPI communication time that correlated with computational imbalance, and automated trace analysis confirmed growing amounts of MPI waiting times. Further instrumentation, measurement and analyses pinpointed a conditional section of highly imbalanced computation which amplified waiting times inherent in the associated wavefront communication that seriously degraded overall execution efficiency at very large scales. By employing effective data collation, management and graphical presentation, Scalasca was thereby able to demonstrate performance measurements and analyses with 294,912 processes for the first time.

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

IEEE Transactions on Parallel and Distributed Systems, 2000

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5× for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

Multicore surprises: Lessons learned from optimizing sweep3d on the cellbe (original) (raw)

Related papers