Single instruction stream parallelism is greater than two (original) (raw)

Achieving high levels of instruction-level parallelism with reduced hardware complexity

1997

Over the past two and a half decades, the computer industry has grown accustomed to, and has come to take for granted, the spectacular rate of increase of microprocessor performance, all of this without requiring a fundamental rewriting of the program in a parrallel form, using a different algorithm or language, and often without even recompiling the program. The benefits of this have been enormous.

Instruction window size trade-offs and characterization of program parallelism

IEEE Transactions on Computers, 1994

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. This paper presents an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection.

Measuring the Parallelism Available for Very Long Instruction Word Architectures

IEEE Transactions on Computers, 2000

Long instruction word architectures, such as attached scientific processors and horizontally microcoded CPU's, are a popular means of obtaining code speedup via fine-grained parallelism. The failing cost of hardware holds out the hope of using these architectures for much more parallelism. But this hope has been diminished by experiments measuring how much parallelism is available in the code to start with. These experiments implied that even if we had-infinite hardware, long instruction word architectures could not provide a speedup of more than a factor of 2 or 3 on real programs. These experiments measured only the parallelism within basic blocks. Given the machines that prompted them, it made no sense to measure anything else. Now it does. A recently developed code compaction technique, called trace scheduling (9], could exploit parallelism in operations even hundreds of blocks apart. Does such parallelism exist? .In this paper we show that it does. We did analogous experiments, but we disregarded basic block boundaries. We found huge amounts of parallelism available. Our measurements were made on standard Fortran programs in common use. The actual programs tested averaged about a factor of 90 parallelism. It ranged from about a factor of 4 to virtually unlimited amounts, restricted only by the size of the data. An important question is how much of this parallelism can actually be found and used by a real code generator. In the experiments, an oracle is used to resolve dynamic questions at compile time. It tells us which way jumps went and whether indirect references are to the same or different locations. Trace scheduling attempts to get the effect of the oracle at compile time with static index analysis and dynamic estimates of jump probabilities. We argue that most scientific code is so static that the oracle is fairly realistic. A real trace-scheduling code generator [7] might very well be able to find and use much of this parallelism. Index Terms-Memory antialiasing, microcode, multiprocessors, parallelism, trace scheduling, VLIW (very long instruction word) architectures. I. INTRODUCTION IN this paper we describe experiments we have done to empirically measure the maximum parallelism available to very long instruction word (VLIW) architectures. The most familiar examples of VLIW architectures are horizontally microcoded CPU's and some very popular specialized scientific processors, such as the floating point systems AP-120b and FPS-164. Very long instruction word architectures take advantage of fine-grained parallelism to speed up execution time.

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on …, 1997

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting

Extraction of massive instruction level parallelism

ACM SIGARCH Computer Architecture News, 1993

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the orde...

Single Instruction Fetch Does Not Inhibit Instruction-Level Parallelism

Superscalar machines fetch multiple scalar instructions per cycle from the instruction cache. However, machines that fetch no more than one instruction per cycle from the instruction cache, such as Dynamic Trace Scheduled VLIW (DTSVLIW) machines, have shown performances comparable to that of Superscalars. In this paper, we present experiments that show that fetching a single instruction from the instruction cache per cycle allows the same performance achieved fetching multiple instructions per cycle thanks to the execution locality present in programs. We also present the first direct comparison between the Superscalar, Trace Cache and DTSVLIW architectures. Our results show that a DTSVLIW machine capable of executing up to 16 instructions per cycle can perform 21.9% better than a Superscalar and 6.6% better than a Trace Cache with equivalent hardware.

Quantifying the benefits of SPECint distant parallelism in simultaneous multithreading architectures

1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425)

In this paper we exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that can not be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We will show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described herein are totally orthogonal to any other architectural techniques targeting a single thread.

Forcing Some Architectural Ceilings of the Actual Processor Paradigm

In our previously published research we discovered some very difficult to predict branches, called unbiased branches that have a "random" dynamic behavior. We developed some state of the art branch predictors to successfully predict them. Even these powerful predictors obtained very modest average prediction accuracies on the unbiased branches whereas their global average prediction accuracies are high. The unbiased branches still restrict the ceiling of dynamic branch prediction and therefore accurately predicting them remains an open problem. Since the overall performance of modern superscalar processors is seriously affected by misprediction recovery, especially these difficult branches represent a source of important performance penalties. Our statistics show that about 28.68% of branches are dependent on critical Load instructions. Moreover, 5.61% of branches are unbiased and depend on critical Loads, too. These dependences involve high-penalty mispredictions becoming serious performance obstacles and causing significant performance degradation. The negative impact of (unbiased) branches over global performance should be seriously attenuated by anticipating the results of longlatency instructions, including critical Loads. On the other hand, hiding instructions' long latencies in a pipelined superscalar processor represents an important challenge itself.

A survey of processors with explicit multithreading

ACM Computing Surveys, 2003

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.