Modelling instruction-level parallelism for software pipelining (original) (raw)

SIRA: Schedule Independent Register Allocation for Software Pipelining

2001

The register allocation in loops is generally carried out after or during the software pipelining process. This is because doing the register allocation at first step without assuming a schedule lacks the information of interferences between values live ranges. The register allocator introduces extra false dependencies which reduces dramatically the original ILP (Instruction Level Parallelism). In this paper, we give a new formulation to carry out the register allocation before the scheduling process, directly on the data dependence graph by inserting some anti dependencies arcs (reuse edges). This graph extension is first constrained by minimizing the critical cycle and hence minimizing the ILP loss due to the register pressure. The second constraint is to ensure that there is always a cyclic register allocation with the set of available registers, and this for any software pipelining of the new graph. We give the exact formulation of this problem with linear integer programming.

Enhanced co-scheduling: A software pipelining method using modulo-scheduled pipeline theory

2000

Instruction scheduling methods which use the concepts developed by the classical pipeline theory have been proposed for architectures involving deeply pipelined function units. These methods rely on the construction of state diagrams (or automatons) to (i) e ciently represent the complex resource usage pattern, and (ii) analyze legal initiation sequences, i.e., those which do not cause a structural hazard. In this paper, we propose a state-diagram based approach for modulo scheduling or software pipelining, an instruction scheduling method for loops. Our approach adapts the classical pipeline theory for modulo scheduling, and, hence, the resulting theory is called Modulo-Scheduled pipeline (MS-pipeline) theory. The state diagram, called the Modulo-Scheduled (MS) state diagram is helpful in identifying legal initiation or latency sequences, that improve the number of instructions initiated in a pipeline. An e cient method, called Co-scheduling, which uses the legal initiation sequences as guidelines for constructing software pipelined schedules has been proposed in this paper. However, the complexity of the constructed MS-state diagram limits the usefulness of our Co-scheduling method.

Reconciling repeatable timing with pipelining and memory hierarchy

… of the Workshop …, 2009

This paper argues that repeatable timing is more important and more achievable than predictable timing. It describes microarchitecture approaches to pipelining and memory hierarchy that deliver repeatable timing and promise comparable or better performance compared to established techniques. Specifically, threads are interleaved in a pipeline to eliminate pipeline hazards, and a hierarchical memory architecture is outlined that hides memory latencies.

Heuristics for register-constrained software pipelining

Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29

Software Pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. There has been a significant eflort to produce throughput-optimal schedules under resource constraints, and more recently to produce throughput-optimal schedules with minimum register requirements. Unfortunately even a throughput-optimal schedule with minimum register requirements is useless if it requires more registers than those available in the target machine. This paper evaluates several techniques for producing register-constrained modulo schedules: increasing the initiation interval (11) and adding spill code. We show that, in general, increasing the 11 performs poorly and might not converge for some loops. The paper also presents an iterative spilling mechanism that can be applied to any software several heuristics in process. pipelining technique and proposes order to speed-up the scheduling

On reducing misspeculations in a pipelined scheduler

2009 IEEE International Symposium on Parallel & Distributed Processing, 2009

Pipelining the scheduling logic, which exposes and exploits the instruction level parallelism, degrades processor performance. In a 4-issue processor, our evaluations show that pipelining the scheduling logic over two cycles degrades performance by 10% in SPEC-2000 integer benchmarks. Such a performance degradation is due to sacrificing the ability to execute dependent instructions in consecutive cycles. Speculative selection is a previously proposed technique that boosts the performance of a processor with a pipelined scheduling logic. However, this new speculation source increases the overall number of misspeculated instructions, and this unuseful work wastes energy. In this work we introduce a non-speculative mechanism named Dependence Level Scheduler (DLS) which not only tolerates the scheduling-logic latency but also reduces the number of misspeculated instructions with respect to a scheduler with speculative selection. In DLS, the selection of a group of one-cycle instructions (producer-level) is overlapped with the wake up in advance of its group of dependent instructions. DLS is not speculative because the group of woken in advance instructions will compete for selection only after issuing all producer-level instructions. On average, DLS reduces the number of misspeculated instructions with respect to a speculative scheduler by 17.9%. From the IPC point of view, the speculative scheduler outperforms DLS by 0.3%. Moreover, we propose two non-speculative improvements to DLS.

Register constrained modulo scheduling

IEEE Transactions on Parallel and Distributed Systems, 2004

Software pipelining is an instruction scheduling technique that exploits the instruction level parallelism (ILP) available in loops by overlapping operations from various successive loop iterations. The main drawback of aggressive software pipelining techniques is their high register requirements. If the requirements exceed the number of registers available in the target architecture, some steps need to be applied to reduce the register pressure (incurring some performance degradation): reduce iteration overlapping or spilling some lifetimes to memory. In the first part of this paper, we propose a set of heuristics to improve the spilling process and to better decide between adding spill code or directly decreasing the execution rate of iterations. The experimental evaluation, over a large number of representative loops and for a processor configuration, reports an increase in performance by a factor of 1.29 and a reduction of memory traffic by a factor of 1.36. In the second part of this paper, we analyze the use of backtracking and propose a novel approach for simultaneous instruction scheduling and register spilling in modulo scheduling: MIRS (Modulo Scheduling with Integrated Register Spilling). The experimental evaluation reports an increase in performance by a factor of 1.46 and a reduction of the memory traffic by a factor of 1.66 (or an additional 1.13 and 1.22 with regard to the proposal in the first part of the paper). These improvements are achieved at the expense of a reasonable increase in the compilation time.

On Improving a Pipelined Scheduling Logic

Pipelining the scheduling logic, which exposes and exploits the instruction level parallelism, degrades processor performance. Our evaluations show that pipelining the scheduling logic over two cycles degrades performance a 14% in SPEC-2000 integer benchmarks. Such a performance degradation is due to sacrificing the ability to execute dependent instructions in consecutive cycles. In this work we introduce a non-speculative mechanism named Dependence Level Scheduler (DLS) which tolerates the scheduling-logic latency. In DLS, the selection of a group of one-cycle instructions (producer level) is overlapped with the wake up in advance of its group of dependent instructions. DLS is not speculative because the group of woken in advance instructions will compete for selection only after issuing all producer-level instructions. Moreover, we compare it with a speculative mechanism. In SPEC-2000 integer benchmarks, DLS performs within 4.0% of an ideal scheduler (unpipelined) and, on average, outperforms the speculative mechanism.

Efficient instruction scheduling for a pipelined architecture

Sigplan Notices, 1986

As part of an effort to develop an optimizing compiler for a pipelined architecture, a code reorganization algorithm has been developed that significantly reduces the number of runtime pipeline interlocks. In a pass after code generation, the algorithm uses a dag representation to heuristically schedule the instructions in each basic block.

Resource-constrained software pipelining

IEEE Transactions on Parallel and Distributed Systems, 1995

This paper presents a software pipelining algorithm for the automatic extraction of ne-grain parallelism in general loops. The algorithm accounts for machine resource constraints in a way that smoothly integrates the management of resource constraints with software pipelining. Furthermore, generality in the software pipelining algorithm is not sacri ced to handle resource constraints, and scheduling choices are made with truly global information. Proofs of correctness and the results of experiments with an implementation are also presented.

Instruction Scheduling in the Presence of Structural Hazards: An Integer Programming Approach to Software Pipelining

Software pipelining is an e cient instruction scheduling method to exploit the multiple instructions issue capability of modern VLIW architectures. In this paper we develop a precise mathematical formulation based on ILP (Integer Linear Programming) for the software pipelining problem for architectures involving structural hazards. Compared to other heuristic methods as well as an ILP-based method 1], a distinct feature of the proposed formulation is that it uses classical pipeline theory, in particular those relating to the forbidden latency set.