Modeling Multigrain Parallelism on Heterogeneous Multi-core Processors: A Case Study of the Cell BE (original) (raw)

Fine-grain parallelism using multi-core, Cell/BE, and GPU Systems

Parallel Computing, 2012

We are currently faced with the situation where applications have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demanding Bioinformatics application -MrBayes -and its Phylogenetic Likelihood Functions (PLF) using different architectures. Our experiments compare side-by-side the scalability and performance achieved using general-purpose multi-core processors, the Cell/BE, and Graphics Processor Units (GPU). The results indicate that all processors scale well for larger computation and data sets. Also, GPU and Cell/BE processors achieve the best improvement for the parallel code section. Nevertheless, data transfers and the execution of the serial portion of the code are the reasons for their poor overall performance. The general-purpose multi-core processors prove to be simpler to program and provide the best balance between an efficient parallel and serial execution, resulting in the largest speedup.

Dynamic multigrain parallelization on the cell broadband engine

Proceedings of the …, 2007

This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and scheduling of layered parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exploited from the runtime system whenever task-level parallelism leaves idle cores. We present a scheduler for applications with layered parallelism on Cell and investigate its performance with RAxML, an application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic methods that selectively exploit the layers of parallelism in the system, in response to workload fluctuation. Our scheduler outperforms the MPI version of RAxML, scheduled by the Linux kernel, by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5-10% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor.

Architectural Requirements of Parallel Computational Biology Applications With Explicit Instruction Level Parallelism

2008

The tremendous growth in the information culture, efficient digital searches are needed to extract and identify information from huge data. The notion that evolution in silicon technology for computer processing speed could handle the increasingly exponential demand for areas of knowledge processing in molecular biology. Processing powers has jumped from 2.3 MHz processors to 2.3 GHz processor. But the contemporary compute and data intensive computational biology applications have lead the issue of a suitable architecture design that mete the requirement. In the same vein the growing trend to port such applications on handheld devices has increased over the last decades. To achieve such goals, expression profile is a critical performance metric in high end genomic data processing. In such applications massive data take the form of bio sequence raw files, multi-dimensional structure images. In most cases, they contain highly irregular phylogenetic trees. A prior knowledge of software application would be very useful to exploit the algorithm-to-Silicon matching. This work explores the trace driven simulation at a high end multimedia processor in bioinformatics applications. Using an Energy Cycle Aware Compilation Framework (ECACF) for application expression profile extraction, we connected the legacy applications to the available off-the-shelf parallel processors and compare the performance in terms of architectural parameter as well as their static expressions. For common bioinformatics applications, we find that burst mode applications are better than the bulk mode applications. Similarly flat applications performed well on our parallel architectures as compared to branch dominated applications, with the added advantage that component reusability is very high. We expose the minutia of the proposed scheme for 10 widely used bioinformatics applications.

Runtime scheduling of dynamic parallelism on accelerator-based multi-core systems

Parallel Computing, 2007

We explore runtime mechanisms and policies for scheduling dynamic multi-grain parallelism on heterogeneous multicore processors. Heterogeneous multi-core processors integrate conventional cores that run legacy codes with specialized cores that serve as computational accelerators. The term multi-grain parallelism refers to the exposure of multiple dimensions of parallelism from within the runtime system, so as to best exploit a parallel architecture with heterogeneous computational capabilities between its cores and execution units. We investigate user-level schedulers that dynamically ''rightsize'' the dimensions and degrees of parallelism on the cell broadband engine. The schedulers address the problem of mapping application-specific concurrency to an architecture with multiple hardware layers of parallelism, without requiring programmer intervention or sophisticated compiler support. We evaluate recently introduced schedulers for event-driven execution and utilization-driven dynamic multi-grain parallelization on Cell. We also present a new scheduling scheme for dynamic multi-grain parallelism, S-MGPS, which uses sampling of dominant execution phases to converge to the optimal scheduling algorithm. We evaluate S-MGPS on an IBM Cell BladeCenter with two realistic bioinformatics applications that infer large phylogenies. S-MGPS performs within 2-10% of the optimal scheduling algorithm in these applications, while exhibiting low overhead and little sensitivity to application-dependent parameters.

Approaches to architecture-aware parallel scientific computation

2005

Modern large-scale scientific computation problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may differ in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation.

Available Task-level Parallelism on the Cell BE

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures.

Fine-grain parallelism using multi-core, cell/BE, and GPU systems: Accelerating the phylogenetic likelihood function

Proceedings of the International Conference on Parallel Processing, 2009

We are currently faced with the situation where applications have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demanding Bioinformatics application -MrBayes -and its Phylogenetic Likelihood Functions (PLF) using different architectures. Our experiments compare side-by-side the scalability and performance achieved using general-purpose multi-core processors, the Cell/BE, and Graphics Processor Units (GPU). The results indicate that all processors scale well for larger computation and data sets. Also, GPU and Cell/BE processors achieve the best improvement for the parallel code section. Nevertheless, data transfers and the execution of the serial portion of the code are the reasons for their poor overall performance. The general-purpose multi-core processors prove to be simpler to program and provide the best balance between an efficient parallel and serial execution, resulting in the largest speedup.

Design and Optimization of Scientific Applications for Highly Heterogeneous and Hierarchical HPC Platforms Using Functional Computation Performance Models

High-Performance Computing on Complex Environments, 2014

HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Heterogeneous many-core processors will be another source of intra-node heterogeneity in the near future. As modern HPC clusters become more heterogeneous, due to increasing number of different processing devices, hierarchical approach needs to be taken with respect to memory and communication interconnects to reduce complexity. During recent years, many scientific codes have been ported to multicore and GPU architectures. To achieve optimum performance of these applications on CPU/GPU hybrid platforms software heterogeneity needs to be accounted for. Therefore, design and implementation of data parallel scientific applications for such highly heterogeneous and hierarchical platforms represent a significant scientific and engineering challenge. This chapter will present the state of the art in the solution of this problem based on the functional performance models of computing devices and nodes.

A comparison of three commodity-level parallel architectures: Multi-core CPU, cell BE and GPU

2010

The CPU has traditionally been the computational work horse in scientific computing, but we have seen a tremendous increase in the use of accelerators, such as Graphics Processing Units (GPUs), in the last decade. These architectures are used because they consume less power and offer higher performance than equivalent CPU solutions. They are typically also far less expensive, as more CPUs, and even clusters, are required to match their performance. Even though these accelerators are powerful in terms of floating point operations per second, they are considerably more primitive in terms of capabilities. For example, they cannot even open a file on disk without the use of the CPU. Thus, most applications can benefit from using accelerators to perform heavy computation, whilst running complex tasks on the CPU. This use of different compute resources is often referred to as heterogeneous computing, and we explore the use of heterogeneous architectures for scientific computing in this thesis. Through six papers, we present qualitative and quantitative comparisons of different heterogeneous architectures, the use of GPUs to accelerate linear algebra operations in MATLAB, and efficient shallow water simulation on GPUs. Our results show that the use of heterogeneous architectures can give large performance gains.