Uncovering Hidden Loop Level Parallelism in Sequential Applications (original) (raw)

Speculatively Exploiting Cross-Invocation Parallelism

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation - PACT '16, 2016

Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism to overcome this problem. Using static analysis, they partition iterations among threads to avoid crossthread dependences. However, this approach may fail if dependence pattern information is not available at compile time. To address this limitation, this work proposes SpecCross-the first automatic parallelization technique to exploit cross-invocation parallelism using speculation. With speculation, iterations from different loop invocations can execute concurrently, and the program synchronizes only on misspeculation. This allows SpecCross to adapt to dependence patterns that only manifest on particular inputs at runtime. Evaluation on eight programs shows that Spec-Cross achieves a geomean speedup of 3.43× over parallel execution without cross-invocation parallelization.

Automatically exploiting cross-invocation parallelism using runtime information

Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2013

Automatic parallelization is a promising approach to producing scalable multi-threaded programs for multicore architectures. Many existing automatic techniques only parallelize iterations within a loop invocation and synchronize threads at the end of each loop invocation. When parallel code contains many loop invocations, synchronization can easily become a performance bottleneck. Some automatic techniques address this problem by exploiting crossinvocation parallelism. These techniques use static analysis to partition iterations among threads to avoid crossthread dependences. However, this partitioning is not always achievable at compile-time, because program input determines dependence patterns at run-time. By contrast, this paper proposes DOMORE, the first automatic parallelization technique that uses runtime information to exploit additional cross-invocation parallelism. Instead of partitioning iterations statically, DOMORE dynamically detects crossthread dependences and synchronizes only when necessary. DOMORE consists of a compiler and a runtime library. At compile time, DOMORE automatically parallelizes loops and inserts a custom runtime engine into programs. At runtime, the engine observes dependences and synchronizes iterations only when necessary. For six programs, DOMORE achieves a geomean loop speedup of 2.1× over parallel execution without cross-invocation parallelization and of 3.2× over sequential execution on eight cores.

Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization

2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

This paper describes a tool using one or more executions of a sequential program to detect parallel portions of the program. The tool, called Parwiz, uses dynamic binary instrumentation, targets various forms of parallelism, and suggests distinct parallelization actions, ranging from simple directive tagging to elaborate loop transformations.

Extraction of massive instruction level parallelism

ACM SIGARCH Computer Architecture News, 1993

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the orde...

Detecting the existence of coarse-grain parallelism in general-purpose programs

2008

With the rise of chip-multiprocessors, the problem of parallelizing general-purpose programs has once again been placed on the research agenda. In the 1980s and early 1990s, great successes were obtained to extract parallelism from the inner loops of scientific computations. General-purpose programs, however, stayed out-of-reach due to the complexity of their control flow and data dependences. More recently, thread-level speculation (TLS) has been tauted as the definitive solution for general-purpose programs. TLS again targets inner loops. The program complexity issue is handled by checking and resolving dependences at runtime using complex hardware support. However, results so far have been disappointing and limit studies predict very low potential speedups, in one study just 18%. In this paper we advocate a completely different approach. We show that signficant amounts of coarse-grain parallelism exists in the outer program loops, even in general-purpose programs. This coarse-grain parallelism can be exploited efficiently on CMPs without additional hardware support. This paper presents a technique to extract coarse-grain parallelism from the outer program loops. Application of this technique to the MiBench and SPEC CPU2000 benchmarks shows that significant amounts of outerloop parallelism exist. This leads to a speedup of 5.18 for bzip2 compression and 11.8 for an MPEG2 encoder on a Sun UltraSPARC T1 CMP. The parallelization effort was limited to 10 to 20 person-hours per benchmark while we had no prior knowledge of the programs.

Using thread-level speculation to simplify manual parallelization

ACM SIGPLAN Notices, 2003

In this paper, we provide examples of how thread-level speculation (TLS) simplifies manual parallelization and enhances its performance. A number of techniques for manual parallelization using TLS are presented and results are provided that indicate the performance contribution of each technique on seven SPEC CPU2000 benchmark applications. We also provide indications of the programming effort required to parallelize each benchmark. TLS parallelization yielded a 110% speedup on our four floating point applications and a 70% speedup on our three integer applications, while requiring only approximately 80 programmer hours and 150 lines of non-template code per application. These results support the idea that manual parallelization using TLS is an efficient way to extract fine-grain thread-level parallelism.

Factoring out ordered sections to expose thread-level parallelism

2009

With the rise of multi-core processors, researchers are taking a new look at extending the applicability auto-parallelization techniques. In this paper, we identify a dependence pattern on which autoparallelization currently fails. This dependence pattern occurs for ordered sections, i.e. code fragments in a loop that must be executed atomically and in original program order. We discuss why these ordered sections prohibit current auto-parallelizers from working and we present a technique to deal with them. We experimentally demonstrate the efficacy of the technique, yielding significant overall program speedups.

Automatic Detection of Parallelism: A grand challenge for high performance computing

IEEE Parallel & Distributed Technology: Systems & Applications, 1994

The limited ability of compilers to find the parallelism in programs is a significant barrier to the use of highperformance computers. Ho wever, a combination of static and runtime techniques can improve compilers to the extent that a signzficant group of scientific programs can be pa rallelized automatically.

Detection of Function-level Parallelism

2007

While the chip multiprocessor (CMP) has quickly become the predominant processor architecture, its continuing success largely depends on the parallelizability of complex programs. We present a framework that is able to extract coarse-grain function-level parallelism that can exploit the parallel resources of the CMP.