Task-Parallel versus Data-Parallel Library-Based Programming in Multicore Systems (original) (raw)

Hierarchically Tiled Arrays Vs. Intel Threading Building Blocks for Programming Multicore Systems

2008

Multicore systems are now the norm. Programmers can no longer rely on faster clock rates to speed up their applications. Thus, software developers are increasingly forced to face the complexities of parallel programming. The Intel Threading Building Blocks (TBBs) library was designed to facilitate parallel programming. The key notion is to separate logical task patterns, which are easy to understand, from physical threads, and delegate the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with a block-recursive nature. The model underlying HTAs provides programmers with a data parallel, single-threaded view of the execution. The HTA implementation in C++ has been recently extended to support multicore machines. In this work we implement several algorithms using both libraries in order to compare ease of programming and performance.

Hierarchically Tiled Array Vs. Intel Thread Building Blocks for Multicore Systems Programming

Multicore systems are becoming common, while programmers cannot rely on growing clock rate to speed up their application. Thus, software developers are increasingly exposed to the complexity associated with programming parallel shared memory environments. Intel Threading Building Blocks (TBBs) is a library which facilitates the programming of this kind of system. The key notion is to separate logical task patterns, which are easy to understand, from physical threads, and delegate the scheduling of the tasks to the system. On the other hand, Hierarchically Tiled Arrays (HTAs) are data structures that facilitate locality and parallelism of array intensive computations with block-recursive nature. The model underlying HTAs provides programmers with a single-threaded view of the execution. The HTA implementation in C++ has been recently extended to support multicore machines. In this work we implement several algorithms using both libraries in order to compare the ease of programming and the relative performance of both approaches.

Programming for parallelism and locality with hierarchically tiled arrays

Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP '06, 2006

Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.

A Library-Based Approach to Task Parallelism in a Data-Parallel Language

Journal of Parallel and Distributed Computing, 1997

Pure data-parallel languages such as High Performance Fortran version 1 (HPF) do not allow efficient expression of mixed task/data-parallel computations or the coupling of separately compiled data-parallel modules. In this paper, we show how these common parallel program structures can be represented, with only minor extensions to the HPF model, by using a coordination library based on the Message Passing Interface (MPI). This library allows data-parallel tasks to exchange distributed data structures using calls to simple communication functions. We present microbenchmark results that characterize the performance of this library and that quantify the impact of optimizations that allow reuse of communication schedules in common situations. In addition, results from two-dimensional FFT, convolution, and multiblock programs demonstrate that the HPF/ MPI library can provide performance superior to that of pure HPF. We conclude that this synergistic combination of two parallel programming standards represents a useful approach to task parallelism in a data-parallel framework, increasing the range of problems addressable in HPF without requiring complex compiler technology.

Design Issues in Parallel Array Languages for Shared Memory

Lecture Notes in Computer Science, 2008

The Hierarchically Tiled Array (HTA) is a data type that facilitates the definition and manipulation of arrays partitioned into tiles. The data type allows to exploit those tiles to attain both locality and parallelism. Parallel programs written with HTAs are based in data parallelism, and provide the programmer with a single-threaded view of the execution. In our experience, HTAs help to develop parallel codes in a much more productive way than other parallel programming approaches. While we have worked extensively with HTAs in distributed memory environments, only recently have we began to consider their adaption to shared memory environments such as those found in multicore systems. In this paper we review the design issues, opportunities and challenges that this migration raises.

Hierarchically tiled arrays for parallelism and locality

Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006

Parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives containing information about data and task distribution, language extensions specifically designed for parallel computation, or classes that encapsulate parallelism. In this paper, we describe a class developed at Illinois and its MATLAB implementation. This class can be used to conveniently express both parallelism and locality. A C++ implementation is now underway. Its characteristics will be reported in a future paper. We have implemented most of the NAS benchmarks using our HTA MATLAB extensions and found during that HTAs enable the fast prototyping of parallel algorithms and produce programs that are easy to understand and maintain.

Programming for Locality and Parallelism with Hierarchically Tiled Arrays

Lecture Notes in Computer Science, 2004

Tiling has proven to be an effective mechanism to develop high performance implementations of algorithms. Tiling can be used to organize computations so that communication costs in parallel programs are reduced and locality in sequential codes or sequential components of parallel programs is enhanced.

SAC: off-the-shelf support for data-parallelism on multicores

2007

Abstract The advent of multicore processors has raised new demand for harnessing concurrency in the software mass market. We summarise our previous work on the data parallel, functional array processing language S a C. Its compiler technology is geared towards highly runtime-efficient support for shared memory multiprocessors and, thus, is readily applicable to today's off-the-shelf multicore systems.

IT – A Simple Parallel Language for Hierarchical Parallel Architectures

2014

After several decades of continuous research and development of hundreds of parallel programming languages; the dominant mechanism of parallelism, unfortunately, remains to be low level threading or message passing libraries attached to sequential language cores. We are investigating an alternative parallel programming paradigm that strives to strike a balance between low-level, platform-specific programming such as in MPI, Pthreads, or CUDA; and high-level, hardware-agnostic language based approach such as in X10 or Chapel. The result is the IT programming language. IT is a language for high performance scientific computing where expression of parallelism in a program is inseparable from reasoning about the capabilities of its execution platform; but the reasoning is done over an abstract machine model that enables portable high performance without losing programmer's productivity. This report describes IT's programming model, syntax, core features, and results of some early performance experiments with IT sample programs on NVIDIA GPGPU platform.

AceMesh: a structured data driven programming language for high performance computing

CCF Trans. High Perform. Comput., 2020

Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges of contemporary large scale high performance computing systems. In this paper we present AceMesh, a taskbased, data-driven language extension targeting legacy MPI applications. Its language features include data-centric parallelizing template, aggregated task dependence for parallel loops. These features not only relieve the programmer from tedious refactoring details but also provide possibility for structured execution of complex task graphs, data locality exploitation upon data tile templates, and reducing system complexity incurred by complex array sections. We present the prototype implementation, including task shifting, data management and communication-related analysis and transformations. The language extension is evaluated on two supercomputing platforms. We compare the performance of AceMesh with existing programming models, and the results show that NPB/MG achieves at most 1.2X and 1.85X speedups on TaihuLight and TH-2, respectively, and the Tend_lin benchmark attains more than 2X speedup on average and attain at most 3.0X and 2.2X speedups on the two platforms, respectively.