On the parallelization of irregular and dynamic programs (original) (raw)

Optimization techniques for irregular and pointer-based programs

12th Euromicro Conference on Parallel, Distributed and Network-Based Processing, Proceedings, 2004

Current compilers show inefficiencies when optimizing complex applications, both analyzing dependences and exploiting critical performance issues, like data locality and instruction/thread parallelism. Complex applications usually present irregular and/or dynamic (pointer-based) computational/data structures. By irregular we means applications that arrange data as multi-dimensional arrays and issue memory references through array indirections. Pointerbased applications, on the other hand, organize data as pointer-based structures (lists, trees, ...)

Automatic parallelization of irregular applications

Parallel Computing, 2000

Parallel computers are present in a variety of ®elds, having reached a high degree of architectural maturity. However, there is still a lack of convenient software support for implementing ecient parallel applications. This is specially true for the class of irregular applications, whose computational constructs hardly ®t current parallel architectures. In fact, contemporary automatic parallelizers produce, in general, poor parallel code from these applications. This paper discusses techniques and methods to help improve the quality of automatic parallel programs. We focus on two issues: parallelism detection and parallelism implementation. The ®rst issue refers to the detection of speci®c irregular computation constructs or data access patterns. The second issue considers the case that some frequent construct has been detected but has been sub-optimally parallelized. Both issues are dealt with in depth and in the context of sparse computations (for the ®rst issue) and irregular histogram reductions (for the second issue). Ó

How much parallelism is there in irregular applications?

2009

Irregular programs are programs organized around pointer-based data structures such as trees and graphs. Recent investigations by the Galois project have shown that many irregular programs have a generalized form of data-parallelism called amorphous data-parallelism. However, in many programs, amorphous dataparallelism cannot be uncovered using static techniques, and its exploitation requires runtime strategies such as optimistic parallel execution. This raises a natural question: how much amorphous data-parallelism actually exists in irregular programs?

Supporting irregular and dynamic computations in data parallel languages

Lecture Notes in Computer Science, 1996

Data-parallel languages support a single instruction ow; the parallelism is expressed at the instruction level. Actually, data-parallel languages have chosen arrays to support the parallelism. This regular data structure allows a natural development of regular parallel algorithms. The implementation of irregular algorithms necessitates a programming e ort to project the irregular data structures onto regular structures. In this article we present the di erent techniques used to manage the irregularity in data-parallel languages. Each of them will be illustrated with standard or experimental data-parallel language constructions.

Scalable Automatic Parallelization of Irregular Reductions on Shared Memory Multiprocessors

This paper presents a new parallelization method for reductions of arrays with subscripted subscripts on scalable shared memory multiprocessors. The mapping of computations is based on grouping reduction loop iterations into sets that are further distributed across processors. Iterations belonging to the same set are chosen in such a way that update di erent entries in the reduction array. That is, the loop distribution implies a con ict-free write distribution of the reduction array. The iteration sets are set up by building a loopindex prefetching array that allows to reorder properly the loop iterations. The proposed method is general, scalable, and easy to implement on a compiler. In addition it deals in a uniform way with one and multiple subscript arrays. In case of multiple indirection arrays, writes on the reduction vector a ecting di erent sets are solved by de ning con ict-free supersets. A performance evaluation and comparison with other existing techniques is presented. From the experimental results and performance analysis, the proposed method appears as a clear alternative to the array expansion and privatized bu er techniques, usual on state-of-theart parallelizing compilers, like Polaris or SUIF. The scalability problem that those techniques exhibit is missing in our method, as the memory overhead presented does not depend on the number of processors. This work was supported by the Ministry of Education and Science (CICYT) of Spain (TIC96-1125-C03) ture. However, many of these codes exhibit irregular access patterns to the data. Current commercial compilers 17, 18] are insu ciently developed to deal with this data accesses, leading to low parallel e ciencies when they are used on such programs. Reduction operations are frequently found in the core of these applications, as in the next simple loop, do i = 1, N A(f(i)) = A(f(i)) opr expr end do

On the Scalability of an Automatically Parallelized Irregular Application

Lecture Notes in Computer Science, 2008

Irregular applications, i.e., programs that manipulate pointer-based data structures such as graphs and trees, constitute a challenging target for parallelization because the amount of parallelism is input dependent and changes dynamically. Traditional dependence analysis techniques are too conservative to expose this parallelism. Even manual parallelization is difficult, time consuming, and error prone. The Galois system parallelizes such applications using an optimistic approach that exploits higher-level semantics of abstract data types.

Exploiting locality in the run-time parallelization of irregular loops

Proceedings International Conference on Parallel Processing, 2002

The goal of this work is the efficient parallel execution of loops with indirect array accesses, in order to be embedded in a parallelizing compiler framework. In this kind of loop pattern, dependences can not always be determined at compile-time as, in many cases, they involve input data that are only known at run-time and/or the access pattern is too complex to be analyzed. In this paper we propose runtime strategies for the parallelization of these loops. Our approaches focus not only on extracting parallelism among iterations of the loop, but also on exploiting data access locality to improve memory hierarchy behavior and, thus, the overall program speedup. Two strategies are proposed: one based on graph partitioning techniques and other based on a block-cyclic distribution. Experimental results show that both strategies are complementary and the choice of the best alternative depends on some features of the loop pattern.

Runtime support to parallelize adaptive irregular programs

1994

This paper describes how a runtime support library can be used as compiler runtime support in irregular applications. The CHAOS runtime support library carries out optimizations designed to reduce communication costs by performing software caching, communication coalescing and inspector executor preprocessing. CHAOS also supplies special purpose routines to support speci c types of irregular reduction and runtime support for partitioning data and work between processors. A n umber of adaptive irregular codes have been parallelized using the CHAOS library and performance results from these codes are also presented in this paper.

Dynamic and speculative polyhedral parallelization using compiler-generated skeletons

Speculative parallelization is a classic strategy for automatically parallelizing codes that cannot be handled at compile-time due to the use of dynamic data and control structures. Another motivation of being speculative is to adapt the code to the current execution context, by selecting at run-time an efficient parallel schedule. However, since this parallelization scheme requires on-the-fly semantics verification, it is in general difficult to perform advanced transformations for optimization and parallelism extraction. We propose a framework dedicated to speculative parallelization of scientific nested loop kernels, able to transform the code at runtime by re-scheduling the iterations to exhibit parallelism and data locality. The run-time process includes a transformation selection guided by profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are interpolated to build a predictor of the forthcoming accesses. The collected addresses are also used to compute on-the-fly dependence distance vectors by tracking accesses to common addresses. Interpolating functions and distance vectors are then employed in dynamic dependence analysis and in selecting a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallelized version, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases. Parallel code generation is achieved almost at no cost by using binary code patterns that are generated at compile-time and that are simply patched at run-time to result in the transformed code.