Cutting-edge computing: Using new commodity architectures (original) (raw)

High-performance computing using accelerators

Parallel Computing, 2007

High-performance computing using accelerators A recent trend in high-performance computing is the development and use of heterogeneous architectures that combine fine-grain and coarse-grain parallelism using tens or hundreds of disparate processing cores. These processing cores are available as accelerators or many-core processors, which are designed with the goal of achieving higher parallel-code performance. This is in contrast with traditional multicore CPUs that effectively replicate serial CPU cores. The recent demand for these accelerators comes primarily from consumer applications, including computer gaming and multimedia. Examples of such accelerators include graphics processing units (GPUs), Cell Broadband Engines (Cell BEs), field-programmable gate arrays (FPGAs), and other data-parallel or streaming processors. Compared to conventional CPUs, the accelerators can offer an order-of-magnitude improvement in performance per dollar as well as per watt. Moreover, some recent industry announcements are pointing towards the design of heterogeneous processors and computing environments, which are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors. This special issue on ''High-Performance Computing Using Accelerators'' includes many papers on such commodity, many-core processors, including GPUs, Cell BEs, and FPGAs. GPGPUs: Current top-of-the-line GPUs have tens or hundreds of fragment processors and high memory bandwidth, i.e. 10• more than current CPUs. This processing power of GPUs has been successfully exploited for scientific, database, geometric and imaging applications (i.e. GPGPUs, short for General-Purpose computation on GPUs). The significant increase in parallelism within a processor can also lead to other benefits including higher power-efficiency and better memory-latency tolerance. In many cases, an order-of-magnitude performance was shown, as compared to top-of-the-line CPUs. For example, GPUTeraSort used the GPU interface to drive memory more efficiently and resulted in a threefold improvement in records/second/CPU. Similarly, some of the fastest algorithms for many numerical computations-including FFT, dense matrix multiplications and linear solvers, and collision and proximity computations-use GPUs to achieve tremendous speed-ups. Cell Broadband Engines: The Cell Broadband Engine is a joint venture between Sony, Toshiba, and IBM. It appears in consumer products such as Sony's PlayStation 3 computer entertainment system and Toshiba's Cell Reference Set, a development tool for Cell Broadband Engine applications. When viewed as a processor, the Cell can exploit the orthogonal dimensions of task and data parallelism on a single chip. The Cell processor consists of a symmetric multi-threaded (SMT) Power Processing Element (PPE) and eight Synergistic Processing Elements (SPEs) with pipelined SIMD capabilities. The processor achieves a theoretical peak performance of over 200 Gflops for single-precision floating-point calculations and has a peak memory bandwidth of over 25 GB/s. Actual speed-up factors achieved when automatically parallelizing sequential code kernels via the Cell's pipelined SIMD capabilities reach as high as 26-fold. Field-Programmable Gate Arrays (FPGAS): FPGAs support the notion of reconfigurable computing and offer a high degree of on-chip parallelism that can be mapped directly from the dataflow characteristics of an application's parallel algorithm. Their recent emergence in the high-performance computing arena can be attributed to a hybrid approach that combines the logic blocks and interconnects of traditional FPGAs with

A comparison of three commodity-level parallel architectures: Multi-core CPU, cell BE and GPU

2010

The CPU has traditionally been the computational work horse in scientific computing, but we have seen a tremendous increase in the use of accelerators, such as Graphics Processing Units (GPUs), in the last decade. These architectures are used because they consume less power and offer higher performance than equivalent CPU solutions. They are typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.

The GPU Computing Revolution: From Multi-Core CPUs to Many-Core Graphics Processors

2011

is head of the Microelectronics Research Group at the University of Bristol and chair of the Many-Core and Reconfigurable Supercomputing Conference (MRSC), Europe's largest conference dedicated to the use of massively parallel computer architectures. Prior to joining the university he spent fifteen years in industry where he designed massively parallel hardware and software at companies such as Inmos, STMicroelectronics and Pixelfusion, before co-founding ClearSpeed as Vice-President of Architecture and Applications. In 2006 ClearSpeed's many-core processors enabled the creation of one of the fastest and most energy-efficient supercomputers: TSUBAME at Tokyo Tech. He has given many invited talks on how massively parallel, heterogeneous processors are revolutionising high-performance computing, and has published numerous papers on how to exploit the speed of graphics processors for scientific applications. He holds eight patents in parallel hardware and software and sits on the steering and programme committees of most of the well-known international high-performance computing conferences.

Evaluated Design of High-Performance Processing Architectures

2020

I present the design and evaluation of two new processing elements for reconfigurable computing. I also present a circuitlevel implementation of the data paths in static and dynamic design styles to explore the various performance-power tradeoffs involved. When implemented in IBM 90-nm CMOS process, the 8-b data paths achieve operating frequencies ranging over 1 GHz both for static and dynamic implementations, with each data path supporting single-cycle computational capability. A novel single-precision floating point processing element (FPPE) using a 24-b variant of the proposed data paths is also presented. The full dynamic implementation of the FPPE shows that it operates at a frequency of 1 GHz with 6.5-mW average power consumption. Comparison with competing architectures shows that the FPPE provides two orders of magnitude higher throughput. Furthermore, to evaluate its feasibility as a soft-processing solution, we also map the floating point unit onto the Virtex 4 and 5 devices, and observe that the unit requires less than 1% of the total logic slices, while utilising only around 4% of the DSP blocks available. When compared against popular fieldprogrammable-gate-array-based floating point units, our design on Virtex 5 showed significantly lower resource utilisation, while achieving comparable peak operating frequency. 3D integration of solid-state memories and logic, as demonstrated by the Hybrid Memory Cube (HMC), offers major opportunities for revisiting near-memory computation and gives new hope to mitigate the power and performance losses caused by the "memory wall". Several publications in the past few years demonstrate this renewed interest. In this paper we present the first exploration steps towards design of the Smart Memory Cube (SMC), a new Processor-in-Memory (PIM) architecture that enhances the capabilities of the logic-base (LoB) die in HMC. An accurate simulation environment called SMCSim has been developed, along with a full featured software stack. The key contribution of this work is full system analysis of near memory computation including high-level software to low-level firmware and hardware layers, considering offloading and dynamic overheads caused by the operating system (OS), cache coherence, and memory management. A zero-copy pointer passing mechanism has been devised to allow low overhead data sharing between the host and the PIM. Benchmarking results demonstrate up to 2X performance improvement in comparison with the host System-on-Chip (SoC), and around 1.5X against a similar host-side accelerator. Moreover, by scaling down the voltage and frequency of PIM's processor it is possible to reduce energy by around 70 % and 55 % in comparison with the host and the accelerator, respectively.

Accelerating High Performance Computing Applications: Using CPUs, GPUs, Hybrid CPU/GPU, and FPGAs

2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2012

Most modern scientific research requires significant advanced modeling, simulation, and visualization. Due to the growing complexity of physical models, these research activities increasingly are requiring more and more High Performance Computing (HPC) resources and this trend is predicted to grow even stronger. Considering this growth in HPC applications, the traditional parallel computing model based solely on Central Processing Units (CPUs) is unable to meet the scientific needs of the researchers. HPC requirements are expected to reach exascale in this decade. There are several approaches to enhance and speed up HPC; some of the most promising involve hybrid solutions. In this paper, we describe existing state of hardware and accelerators for HPC. Such components include CPUs, Graphics Processing Units (GPU), and Field-Programmable Gate Arrays (FPGAs). Various hybrid implementations of these accelerators are presented and compared. Examples of the top supercomputers are included as well, together with their hardware configurations. Concluding this paper, we discuss our prediction of further HPC hardware trends in support of advanced modeling, simulation, and visualization.

Computational density of fixed and reconfigurable multi-core devices for application acceleration

Proceedings of …

As on-chip transistor counts increase, the computing landscape has shifted to multi-and manycore devices. Computational accelerators have adopted this trend by incorporating both fixed and reconfigurable many-core and multi-core devices. As more, disparate devices enter the market, there is an increasing need for concepts, terminology and classification techniques to understand the device tradeoffs. Additionally, performance and power metrics are needed to objectively compare accelerators. These metrics will assist application scientists in selecting the appropriate device early in the development cycle. This paper presents a hierarchical taxonomy of computing devices, concepts and terminology describing reconfigurability, and computational density metrics to compare devices.

GP-GPU Accelerated Supercomputing

2012

The latest trend introduced in Supercomputing processor technology is the introduction of General Purpose Graphical processing Unit or GP-GPU. These new variety of non-standard processors ever since their introduction in the late 80's have undergone tremendous change in their internal architecture, memory interface, Hardware Abstraction(HAL) and feature set. This new technology grew rapidly with the emerging gaming market and better interconnection standards and soon surpassed the processing strength of general purpose x86 processors with the introduction of DirectX 9 Shader technology. By mid 2000s this power was realized to be harness able with the introduction DirectX 11 API and compatible shader hardware. These cheap consumer level processors were capable of generating a combined processing power in excess of 1 Teraflop in 2 007, which reached to 4.7 Teraflops by 2012. This power was achieved within the ordinary computers, with power requirements of less than 800 watts and costs under $800. Soon GPU manufacturers released their own SDKs to allow programmers to make use of this computation power within their own applications, two mainly notable being the AMD-Stream APP SDK and the NVidia CUDA SDK[3].

High performance computing on graphics processing units

Pollack Periodica, 2008

The evolution of GPUs (graphics processing units) has been enormous in the past few years. Their calculation power has improved exponentially, while the range of the tasks computable on GPUs has got significantly wider. The milestone of GPU development of the recent years is the appearance of the unified architecture-based devices. These GPUs implement a massively parallel design, which led them be capable not only of processing the common computer graphics tasks, but qualifies them for performing highly parallel mathematical algorithms effectively. Recognizing this availability GPU providers have issued developer platforms, which let the programmers manage computations on the GPU as a data-parallel computing device without the need of mapping them to a graphics API. Researchers salute this initiative, and the application of the new technology is quickly spreading in various branches of science

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing

2011 Symposium on Application Accelerators in High-Performance Computing, 2011

The graphics processing unit (GPU) has made significant strides as an accelerator in parallel computing. However, because the GPU has resided out on PCIe as a discrete device, the performance of GPU applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the functionality of the CPU and GPU, e.g., AMD Fusion and Intel Knights Ferry, hold the promise of addressing the PCIe bottleneck. In this paper, we empirically characterize and analyze the efficacy of AMD Fusion, an architecture that combines generalpurpose x86 cores and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Depending on the benchmark, our results show that Fusion produces a 1.7 to 6.0-fold improvement in the data-transfer time, when compared to a discrete GPU. In turn, this improvement in data-transfer performance can significantly enhance application performance. For example, running a reduction benchmark on AMD Fusion with its mere 80 GPU cores improves performance by 3.5-fold over the discrete AMD Radeon HD 5870 GPU with its 1600 more powerful GPU cores.