Context Switching with Multiple Register Windows: A RISC Performance Study (original) (raw)

Reducing Instruction Fetch Cost by Packing Instructions into RegisterWindows

38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05), 2005

Instruction packing is a combination compiler/architectural approach that allows for decreased code size, reduced power consumption and improved performance. The packing is obtained by placing frequently occurring instructions into an Instruction Register File (IRF). Multiple IRF entries can then be accessed using special packed instructions. Previous IRF efforts focused on using a single 32-entry register file for the duration of an application. This paper presents software and hardware extensions to the IRF supporting multiple instruction register windows to allow a greater number of relevant instructions to be available for packing in each function. Windows are shared among similar functions to reduce the overall costs involved in such an approach. The results indicate that significant improvements in instruction fetch cost can be obtained by using this simple architectural enhancement. We also show that using an IRF with a loop cache, which is also used to reduce energy consumption, results in much less energy consumption than using either feature in isolation.

Reducing Context Switch Overhead with Compiler-Assisted Threading

2008 IEEE/IFIP International Conference on Embedded and Ubiquitous Computing, 2008

Multithreading is an important software modularization technique. However, it can incur substantial overheads, especially in processors where the amount of architecturally visible state is large.

Minimizing register usage penalty at procedure calls

ACM SIGPLAN Notices, 1988

Inter-procedural register allocation can minimize the register usage penalty at procedure calls by reducing the saving and restoring of registers at procedure boundaries. A one-pass inter-procedural register allocation scheme based on processing the procedures in a depth-first traversal of the calI graph is presented. This scheme can be overlayed on top of intra-procedural register allocation via a simple extension to the priority-based coloring algorithm. Using two different usage conventions for the registers, the scheme can distribute register saves/restores throughout the call graph even in the presence of recursion, indirect calls or separate compilation. A natural and efficient way to pass parameters emerges from this scheme. A separate technique uses data flow analysis to optimize the placement of the save/restore code for registers within individual procedures. The techniques described have been implemented in a production compiler suite. Measurements of the effects of these techniques on a set of practical programs are presented and the results analysed.

Instruction window size trade-offs and characterization of program parallelism

IEEE Transactions on Computers, 1994

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. This paper presents an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection.

Improving Program Efficiency by Packing Instructions into Registers

ACM SIGARCH Computer Architecture News, 2005

New processors, both embedded and general purpose, often have conflicting design requirements involving space, power, and performance. Architectural features and compiler optimizations often target one or more design goals at the expense of the others. This paper presents a novel architectural and compiler approach to simultaneously reduce power requirements, decrease code size, and improve performance by integrating an instruction register file (IRF) into the architecture. Frequently occurring instructions are placed in the IRF. Multiple entries in the IRF can be referenced by a single packed instruction in ROM or L1 instruction cache. Unlike conventional code compression, our approach allows the frequent instructions to be referenced in arbitrary combinations. The experimental results show significant improvements in space and power, as well as some improvement in execution time when using only 32 entries. These advantages make packing instructions into registers an effective approach for improving overall efficiency.

Scalar Program Performance on Multiple-Instruction-Issue Processors with a Limited Number of Registers

In this paper the performance of multiple-instructionissue processors with variable register le sizes is examined for a set of scalar programs. We make several important observations. First, multiple-instruction-issue processors can perform e ectively without a large number of registers. In fact, the register les of many existing architectures 16 32 registers are capable of sustaining a high instruction execution rate. Second, even for small register les 8 12 registers, substantial performance gains can be obtained by increasing the issue rate of a processor. In general, the percentage increase in performance achieved by increasing the issue rate is relatively constant for all register le sizes. Finally, code transformations designed for multiple-instruction-issue processors are found to be e ective for all register le sizes; however, for small register les, the performance improvement is limited due to the excessive spill code introduced by the transformations.

Multiple register window file for lisp-oriented RISC architectures

Microprocessors and Microsystems, 1988

This paper proposes a multiple register window organization suitable for LisP-oriented architectures. Various Lisp programs are studied to determine the statistics of free and bound variables, as well as the statistics of depth of nesting of LiSP procedures. On the basis of this study a multiple register window organization consisting of 64 registers is determined which is suitable for LiSP programs. Various strategies to manage register windows in a LISP environment are analysed to determine the best strategy. The results obtained indicate that the depth of nesting of any program can be predicted from its behaviour. The VLSI hardware implementation of the proposed multiple register file for a LiSP-oriented architecture is described. RISCs LISP register windows Much current research is directed at specialized hardware for artificial intelligence (AI) applications, with the main emphasis being on the realization of high-performance AI machines, but usually based on conventional approaches: the Lisp machine at MIT 1 , the Symbolics 3600 computer 2, a LISP architecture 3 at the University of Illinois, USA, and others. A multiprocessor workstation called SPUR for parallel processing in LISP has been developed at the University of Califomia at Berkeley 4' s, USA, and its design is based on RISC principles. In this paper we also apply the RISC philosophy 6 to the architecture and implementation of a usP processor. One of the techniques used commonly in RISC architectures is a multiple register window file. The Berkeley RISC processors 7-9 and the Pyramid computer 1° use the

Fast context switches: compiler and architectural support for preemptive scheduling

Microprocessors and Microsystems, 1995

This paper addresses the possibility of reducing the overhead due to preemptive context switching in real-time systems that use preemptive scheduling. The method introduced in this paper attempts to avoid saving and restoring registers by performing context switches at points in the program where only a small subset of the registers are live. When context switches occur frequently, which is the case in some realtime systems, performing context switches at fast context switch points is found to signi cantly reduce the total number of memory references. A new technique, known as register remapping, is introduced which increases the number of these fast context switch points without degrading the e ciency of the code.

High-level control flow transformations for performance improvement of address-dominated multimedia applications

This paper describes a set of novel highlevel control flow transformations for performance improvement of typical address-dominated multimedia applications. We show that these transformations applied at the source code level can have a very large impact on execution time at the cost of limited overhead in code size for a broad range of instruction set processor families (i. e. CISC, RISC, DSP, VLIW, . . . ). For a profound evaluation, all transformations are applied to the C-codes of two real-life applications selected from the video and image processing domains. A detailed analysis of the effect of the transformations is done by compiling and executing the transformed programs on seven different programmable processors. The measured runtimes indicate quite significant improvements in all processor families when comparing the performance of the transformed codes to their initial version even when these are compiled using their native optimizing compilers with their most aggressive optimization features enabled. The average gains in execution time range from 40.2% and 87.7% depending on the driver, with an average overhead in code size between 21.1% and 100.9%.

Ruby B. Lee, A. Murat Fiskiran, Zhijie Shi and Xiao Yang, "Refining Instruction Set Architecture for

Multimedia processing in software has been significantly accelerated by the addition of subword-parallel instructions to the instruction set architectures (ISAs) of modern microprocessors. While some of these multimedia instructions are simple and effective, others are very complex, requiring large, special-purpose functional units that are not practical for constrained environments such as handheld multimedia information appliances. For such environments, low-power and low-cost are as important as the high performance required for real-time multimedia processing and the general-purpose programmability required to support an ever growing range of applications. In this paper, we introduce PLX, a concise ISA that selects the most useful features from the first two generations of multimedia instructions added to microprocessors, and explores new ISA features for high-performance yet low-cost multimedia processing with small footprint processors. PLX is unique in that it is designed from scratch as a fully subword-parallel architecture with novel features like datapath scalability from 32-bit to 128-bit words, and a new definition of predication for reducing conditional branches. We illustrate the use of PLX's architectural features with four frequently used multimedia kernels: discrete cosine transform, pixel padding, clip test and median filter. Our performance results show that a 64-bit PLX implementation achieves significant speedups compared to a basic 64-bit RISC processor and to IA-32 processors with MMX and SSE multimedia extensions. PLX's datapath scalability feature often provides an additional 2x speedup in a cost-effective way.