Measuring the Parallelism Available for Very Long Instruction Word Architectures (original) (raw)

Instruction window size trade-offs and characterization of program parallelism

IEEE Transactions on Computers, 1994

Detecting independent operations is a prime objective for computers that are capable of issuing and executing multiple operations simultaneously. The number of instructions that are simultaneously examined for detecting those that are independent is the scope of concurrency detection. This paper presents an analytical model for predicting the performance impact of varying the scope of concurrency detection as a function of available resources, such as number of pipelines in a superscalar architecture. The model developed can show where a performance bottleneck might be: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection.

Extraction of massive instruction level parallelism

ACM SIGARCH Computer Architecture News, 1993

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the orde...

A VLIW architecture for a trace scheduling compiler

IEEE Transactions on Computers, 1988

Very long instruction word (VLIW) architectures were promised to deliver far more than the factor of two or three that current architectures achieve from overlapped execution. Using a new type of compiler which compacts ordinary sequential code into long instruction words, a large-scale VLIW machine was expected to provide from ten to thirty times the performance of a more conventional machine built of the same implementation technology. Multiflow Computer, Inc., has now built a VLIW called the TRACE"' along with its companion Trace Scheduling" compacting compiler. This machine has three hardware configurations, capable of executing 7, 14, or 28 operations simultaneously. The "seven-wide" achieves a performance improvement of a factor of five or six for a wide range of scientific code, compared to machines of higher cost and faster chip implementation technology (such as the VAX 8700). The TRACE extends some basic reduced-instruction-set precepts: the architecture is load/store, the microarchitecture is exposed to the compiler, there is no microcode, and there is almost no hardware devoted to synchronization, arbitration, or interlocking of any kind (the compiler has sole responsibility for run-time resource usage). This paper discusses the design of this machine and presents some initial performance results.

Measures of parallelism at compile time

1993 Euromicro Workshop on Parallel and Distributed Processing

trabajo ha sido subvencionado por el Ministerio de Educación bajo los contratos TIC-392/89 y TIC-880/92

The Challenges of Efficient Code-Generation for Massively Parallel Architectures

2006

Abstract. Overcoming the memory wall [15] may be achieved by increasing the bandwidth and reducing the latency of the processor to memory connection, for example by implementing Cellular architectures, such as the IBM Cyclops. Such massively parallel architectures have sophisticated memory models. In this paper we used DIMES (the Delaware Iterative Multiprocessor Emulation System), developed by CAPSL at the University of Delaware, as a hardware evaluation tool for cellular architectures. The authors contend that there is an open question regarding the potential, ideal approach to parallelism from the programmer’s perspective. For example, at language-level such as UPC or HPF, or using trace-scheduling, or at a library-level, for example OpenMP or POSIX-threads. To investigate this, we have chosen to use a threaded Mandelbrot-set generator with a work-stealing algorithm to evaluate the DIMES cthread programming model for writing a simple multi-threaded program. 1

An Approach for Compiler Optimization to Exploit Instruction Level Parallelism

Instruction Level Parallelism (ILP) is not the new idea. Unfortunately ILP architecture not well suited to for all conventional high level language compilers and compiles optimization technique. Instruction Level Parallelism is the technique that allows a sequence of instructions derived from a sequential program (without rewriting) to be parallelized for its execution on multiple pipelining functional units. As a result, the performance is increased while working with current softwares. At implicit level it initiates by modifying the compiler and at explicit level it is done by exploiting the parallelism available with the hardware. To achieve high degree of instruction level parallelism, it is necessary to analyze and evaluate the technique of speculative execution control dependence analysis and to follow multiple flows of control. The researchers are continuously discovering the ways to increase parallelism by an order of magnitude beyond the current approaches. In this paper we present impact of control flow support on highly parallel architecture with 2- core and 4-core. We also investigated the scope of parallelism explicitly and implicitly. For our experiments we used trimaran simulator. The benchmarks are tested on abstract machine models created through trimaran simulator.

Automatic Detection of Parallelism: A grand challenge for high performance computing

IEEE Parallel & Distributed Technology: Systems & Applications, 1994

The limited ability of compilers to find the parallelism in programs is a significant barrier to the use of highperformance computers. Ho wever, a combination of static and runtime techniques can improve compilers to the extent that a signzficant group of scientific programs can be pa rallelized automatically.

Compile-time techniques for efficient utilization of parallel memories

ACM SIGPLAN Notices, 1988

The partitioning of shared memory into a number of memory modules is an approach to achieve high memory bandwidth for parallel processors. Memory access conflicts can occur when several processors simultaneously request data from the same memory module. Although work has been done to improve access performance for vectors, no work has been reported to improve the access performance of scalars. For systems in which the processors operate in a lock-step mode, a large percentage of memory access conflicts can be predicted at compile-time. These conflicts can be avoided by appropriate distribution of data among the memory modules at compile-time. A long instruction word machine is an example of a system in which the functional units operate in a lock-step mode performing operations on data fetched in parallel from multiple memory modules. In this paper, compile-time techniques for distribution of scalars to avoid memory access conflicts are presented. Furthermore, algorithms to schedule...

SOFTWARE EXPLOITS OF INSTRUCTION-LEVEL PARALLELISM FOR

For decades hardware algorithms have dominated the field of parallel processing. But with the Moore's law reaching its limit need for software pipelining is being felt. This area has eluded researchers since long. Significant measure of success has been obtained in graphics processing using software approaches to pipelining. This project aims at developing software to detect various kinds of data dependencies like data flow dependency, anti-dependency and output dependency for a basic code block. Graphs would be generated for the various kinds of dependencies present in a code block and would be combined to obtain a single data dependency graph. This graph would be further processed to obtain a transitive closure graph and finally an ILP graph. The ILP graph can be used to predict the possible combinations of instructions that may be executed in parallel. A scheduling algorithm would be developed to obtain an instruction schedule in the form of instruction execution start times. The schedule obtained would be used to compute various performance metrics like speed-up factor, efficiency, throughput, etc.

Compiler Optimizations for High Performance Architectures

We describe two ongoing compiler projects for high performance architectures at the University of Maryland being developed us- ing the Stanford SUIF compiler infrastructure. First, we are in- vestigating the impact of compilation techniques for eliminat- ing synchronization overhead in compiler-parallelized programs running on software distributed-shared-memory (DSM) systems. Second, we are evaluating data layout transformations to im- prove cache performance on uniprocessors by eliminating conflict misses through inter- and intra-variable padding. Our optimiza- tions have been implemented in SUIF and tested on a number of programs. Preliminary results are encouraging.