Extraction of massive instruction level parallelism (original) (raw)
Related papers
Exploiting instruction level parallelism in the presence of conditional branches
Cache prefetching with the assistance of an optimizing compiler is an e ective means of reducing the penalty of long memory access time beyond the primary cache. However, cache prefetching can cause cache pollution and its bene t can be unpredictable. A new architectural support for preloading, the preload bu er, is proposed in this paper. Unlike previously proposed methods of nonbinding cache loads, the preload is a binding access to the memory system. The preload bu er is simple in design and predictable in performance. With simple interleaving, accesses to the preload bu er are independent of the access pattern and processor issue rate, and are therefore free of bank con icts. With trace driven simulation, it is shown that the performance from preloading hides memory latency better than no prefetching and cache prefetching. In addition, both the bus tra c rate and the miss rate are reduced.
An Approach for Compiler Optimization to Exploit Instruction Level Parallelism
Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2- core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.
Exploiting instruction level parallelism in the presence of conditional branches
Speculative execution is an important source of parallelism for VLIW and superscalar processors. A serious challenge with compiler-controlled speculative execution is to e ciently handle exceptions for speculative instructions. In this paper, a set of architectural features and compiletime scheduling support collectively referred to as sentinel scheduling is introduced. Sentinel scheduling provides an e ective framework for both compiler-controlled speculative execution and exception handling. All program exceptions are accurately detected and reported in a timely manner with sentinel scheduling. Recovery from exceptions is also ensured with the model. Experimental results show the e ectiveness of sentinel scheduling for exploiting instruction-level parallelism and the overhead associated with exception handling.
Disjoint eager execution: an optimal form of speculative execution
Proceedings of the 28th Annual International Symposium on Microarchitecture, 1995
Instruction Level Parallelism (ILP) speedups of an order-of-magnitude or greater may be possible using the techniques described herein. Traditional speculative code execution is the execution of code down one path of a branch (branch prediction) or both paths of a b r anch (eager execution), before t h e c ondition of the branch has been evaluated, thereby executing code ahead of time, and improving performance. A t h i r d, optimal, method of speculative execution, Disjoint Eager Execution (DEE), is described herein. A restricted form of DEE, easier to implement than pure DEE, is developed and evaluated. An implementation of both DEE and minimal control dependencies is described. DEE is shown both theoretically and experimentally to yield more p arallelism than both branch prediction and eager execution when the same, nite, execution resources are assumed. ILP speedups of factors in the ten's are demonstrated with constrained r esources.
Characterizing the Impact of Predicated Execution on Branch Prediction
Branch instructions are recognized as a major impediment to exploiting instruction level parallelism. Even with sophisticated branch prediction techniques, many frequently executed branches remain di cult to predict. An architecture supporting predicated execution may allow the compiler to remove many of these hard-to-predict branches, reducing the number of branch mispredictions and thereby improving performance. We present an in-depth analysis of the characteristics of those branches which are frequently mispredicted and examine the e ectiveness of an advanced compiler to eliminate these branches. Over the benchmarks studied, an average of 27 of the dynamic branches and 56 of the dynamic branch mispredictions are eliminated with predicated execution support.
Single Instruction Fetch Does Not Inhibit Instruction-Level Parallelism
Superscalar machines fetch multiple scalar instructions per cycle from the instruction cache. However, machines that fetch no more than one instruction per cycle from the instruction cache, such as Dynamic Trace Scheduled VLIW (DTSVLIW) machines, have shown performances comparable to that of Superscalars. In this paper, we present experiments that show that fetching a single instruction from the instruction cache per cycle allows the same performance achieved fetching multiple instructions per cycle thanks to the execution locality present in programs. We also present the first direct comparison between the Superscalar, Trace Cache and DTSVLIW architectures. Our results show that a DTSVLIW machine capable of executing up to 16 instructions per cycle can perform 21.9% better than a Superscalar and 6.6% better than a Trace Cache with equivalent hardware.
Branch classification to control instruction fetch in simultaneous multithreaded architectures
International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems, 2002
In Simultaneous Multithreaded architectures many separate threads are running concurrently, sharing processor resources, thereby realizing a high utilization rate of the available hardware. However, this also implies that threads are competing for resources and in many cases this competition can actually degrade overall performance. There are two major causes for this: first, instructions that, because of a long latency data cache miss, cause dependent instructions not to proceed for many cycles thereby wasting space in the instruction queues, and second, execution of instructions that belong to a mispredicted path. Both of these have a harmful effect on throughput and the second moreover wastes energy.
Speculative Parallelization in Decoupled Look-ahead
2011
While a canonical out-of-order engine can effectively exploit implicit parallelism in sequential programs, its effectiveness is often hindered by instruction and data supply imperfections manifested as branch mispredictions and cache misses. Accurate and deep look-ahead guided by a slice of the executed program is a simple yet effective approach to mitigate the performance impact of branch mispredictions and cache misses. Unfortunately, program slice-guided lookahead is often limited by the speed of the look-ahead code slice, especially for irregular programs. In this paper, we attempt to speed up the look-ahead agent using speculative parallelization, which is especially suited for the task. First, slicing for look-ahead tends to reduce important data dependences that prohibit successful speculative parallelization. Second, the task for look-ahead is not correctnesscritical and thus naturally tolerates dependence violations. This enables an implementation to forgo violation detection altogether, simplifying architectural support tremendously. In a straightforward implementation, incorporating speculative parallelization to the look-ahead agent further improves system performance by up to 1.39x with an average of 1.13x.
Improving branch prediction and predicate execution in out-of-order processors
HPCA'07: Proceedings of the …, 2007
If-conversion is a compiler technique that reduces the misprediction penalties caused by hard-to-predict branches, transforming control dependencies into data dependencies. Although it is globally beneficial, it has a negative side-effect because the removal of branches eliminates useful correlation information necessary for conventional branch predictors. The remaining branches may become harder to predict. However, in predicated ISAs with a compare-branch model, the correlation information not only resides in branches, but also in compare instructions that compute their guarding predicates. When a branch is removed, its correlation information is still available in its compare instruction. We propose a branch prediction scheme based on predicate prediction. It has three advantages: First, since the prediction is not done on a branch basis but on a predicate define basis, branch removal after if-conversion does not lose any correlation information, so accuracy is not degraded. Second, the mechanism we propose permits using the computed value of the branch predicate when available, instead of the predicted value, thus effectively achieving 100% accuracy on such early-resolved branches. Third, as shown in previous work, the selective predicate prediction is a very effective technique to implement if-conversion on outof-order processors, since it avoids the problem of multiple register definitions and reduces the unnecessary resource consumption of nullified instructions. Hence, our approach enables a very efficient implementation of if-conversion for an out-of-order processor, with almost no additional hardware cost, because the same hardware is used to predict the predicates of if-converted code and to predict branches without accuracy degradation.
Achieving high levels of instruction-level parallelism with reduced hardware complexity
1997
Over the past two and a half decades, the computer industry has grown accustomed to, and has come to take for granted, the spectacular rate of increase of microprocessor performance, all of this without requiring a fundamental rewriting of the program in a parrallel form, using a different algorithm or language, and often without even recompiling the program. The benefits of this have been enormous.