In-depth analysis of x86 instruction set condition codes influence on superscalar execution (original) (raw)

Evaluating x86 condition codes impact on superscalar execution

2006

The design of instruction sets is a fundamental aspect of computer architecture. A critical requirement of instruction set design is to allow for concurrent execution, avoiding those constructs that may produce data dependencies among instructions. Therefore, it is important to count on methods and tools for the evaluation of the behavior of instruction sets and quantify the influence of particular

Evaluation of Instruction Sets for Superscalar Execution

Instruction set design is a fundamental aspect of computer architecture. A critical requirement of instruction sets design is to allow for concurrent execution, avoiding those constructs that may produce data dependencies. Therefore, it is important to count on methods and tools for the evaluation of the behavior of instruction sets and quantify the influence of particular features of its architecture into the overall available parallelism. We propose an analysis method that applies graph theory to gather metrics and evaluate the impact of different characteristics of instruction sets as sources of coupling, quantifying available parallelism. We present a case study using the x86 instruction set and obtain some measures of the influence of condition flags in code coupling.

Superscalar instruction issue

IEEE Micro, 1997

Clearly, instruction issue and execution are closely related: The more parallel the instruction execution, the higher the requirements for the parallelism of instruction issue. Thus, we see the continuous and harmonized increase of parallelism in instruction issue and execution. This article focuses on superscalar instruction issue, tracing the way parallel instruction execution and issue have increased performance. It also spans the design space of instruction issue, identifying important design aspects and available design choices. The article also demonstrates a concise way to represent the design space using DS trees, reviews the most frequently used issue schemes, and highlights trends for each design aspect of instruction issue

Software Exploits of Instruction-level parallelism for Supercomputers

For decades hardware algorithms have dominated the field of parallel processing. But with the Moore’s law reaching its limit need for software pipelining is being felt. This area has eluded researchers since long. Significant measure of success has been obtained in graphics processing using software approaches to pipelining. This project aims at developing software to detect various kinds of data dependencies like data flow dependency, anti-dependency and output dependency for a basic code block. Graphs would be generated for the various kinds of dependencies present in a code block and would be combined to obtain a single data dependency graph. This graph would be further processed to obtain a transitive closure graph and finally an ILP graph. The ILP graph can be used to predict the possible combinations of instructions that may be executed in parallel. A scheduling algorithm would be developed to obtain an instruction schedule in the form of instruction execution start times. The schedule obtained would be used to compute various performance metrics like speed-up factor, efficiency, throughput, etc

The impact of cache organisation on the instruction issue rate of a superscalar processor

Proceedings of the Seventh Euromicro Workshop on Parallel and Distributed Processing. PDP'99, 1999

Much of the research on multiple-instruction-issue processor architecture assumes a perfect memory hierarchy and concentrates on increasing the instruction issue rate of the processor either through aggressive out-of-order instruction issue or through static instruction scheduling. In this paper we describe a trace driven simulation tool that we have developed to quantify the impact of the memory hierarchy on the performance of a superscalar processor that we have developed to support static instruction scheduling. We describe some initial studies performed using our simulator. As well as examining the more conventional split cache configurations, we also quantify the performance impact of using a unified cache. Finally, we examine the benefits of using two-level caches and victim caches.

Instruction window size trade-offs and characterization of program parallelism

IEEE Transactions on Computers, 1994

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. This paper presents an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection.

Extraction of massive instruction level parallelism

ACM SIGARCH Computer Architecture News, 1993

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the orde...

Single Instruction Fetch Does Not Inhibit Instruction-Level Parallelism

Superscalar machines fetch multiple scalar instructions per cycle from the instruction cache. However, machines that fetch no more than one instruction per cycle from the instruction cache, such as Dynamic Trace Scheduled VLIW (DTSVLIW) machines, have shown performances comparable to that of Superscalars. In this paper, we present experiments that show that fetching a single instruction from the instruction cache per cycle allows the same performance achieved fetching multiple instructions per cycle thanks to the execution locality present in programs. We also present the first direct comparison between the Superscalar, Trace Cache and DTSVLIW architectures. Our results show that a DTSVLIW machine capable of executing up to 16 instructions per cycle can perform 21.9% better than a Superscalar and 6.6% better than a Trace Cache with equivalent hardware.

Adding static data dependence collapsing to a high-performance instruction scheduler

Journal of Systems Architecture, 2001

State-of-the-art processors achieve high performance by executing multiple instructions in parallel. However, the parallel execution of instructions is ultimately limited by true data dependencies between individual instructions. The objective of this paper is to present and quantify the bene®ts of static data dependence collapsing, a non-speculative technique for reducing the impact of true data dependencies on program execution time. Data dependence collapsing involves combining a pair of instructions when the second instruction is directly dependent on the ®rst. The two instructions are then treated as a single entity and are executed together in a single functional unit that is optimised to handle functions with three input operands instead of the traditional two inputs. Dependence collapsing can be accomplished either dynamically at run time or statically at compile time. Since dynamic dependence collapsing has been studied extensively elsewhere, this paper concentrates on static dependence collapsing. To quantify the bene®ts of static dependence collapsing, we added a new dependence collapsing option to the Hat®eld Superscalar Scheduler (HSS), a state-of-the-art instruction scheduler that targets the Hat®eld Superscalar Architecture (HSA). We demonstrate that the addition of dependence collapsing to HSS delivers a signi®cant performance increase of up to 15%. Furthermore, since HSA already executes over four instructions in each processor cycle without dependence collapsing, dependence collapsing enables 0.4 additional instructions to be executed in each processor cycle. Ó