Performance of a micro-threaded pipeline (original) (raw)

Instruction level parallelism through microthreading—A scalable approach to chip multiprocessors

The Computer Journal, 2006

Most microprocessor chips today use an out-of-order instruction execution mechanism. This mechanism allows superscalar processors to extract reasonably high levels of instruction level parallelism (ILP). The most significant problem with this approach is a large instruction window and the logic to support instruction issue from it. This includes generating wake-up signals to waiting instructions and a selection mechanism for issuing them. Wide-issue width also requires a large multi-ported register file, so that each instruction can read and write its operands simultaneously. Neither structure scales well with issue width leading to poor performance relative to the gates used. Furthermore, to obtain this ILP, the execution of instructions must proceed speculatively. An alternative, which avoids this complexity in instruction issue and eliminates speculative execution, is the microthreaded model. This model fragments sequential code at compile time and executes the fragments out of order while maintaining in-order execution within the fragments. The only constraints on the execution of fragments are the dependencies between them, which are managed in a distributed and scalable manner using synchronizing registers. The fragments of code are called microthreads and they capture ILP and loop concurrency. Fragments can be interleaved on a single processor to give tolerance to latency in operands or distributed to many processors to achieve speedup. The implementation of this model is fully scalable. It supports distributed instruction issue and a fully scalable register file, which implements a distributed, shared-register model of communication and synchronization between multiple processors on a single chip. This paper introduces the model, compares it with current approaches and presents an analysis of some of the implementation issues. It also presents results showing scalable performance with issue width over several orders of magnitude, from the same binary code.

A microthreaded architecture and its compiler

2006

A different approach to ILP based on code fragmentation, first proposed some 10 years ago, is being used for novel CMP processor designs. The technique, called microthreading, enables binary compatibility across arbitrary schedules. Chip architectures have been proposed that contain many simple pipelines with hardware support for ultra-fast context switching. The concurrency described in the binary code is parametric and a typical microthread is an iteration of a loop. The ISA contains instructions to create a family of micro-threads, i.e., the collection of all loop iterations. In case a microthread encounters a (possibly) long latency operation (e.g., a load that may miss in the cache) this thread is switched out and another thread is switched in under program control. In this way, latencies can effectively be hidden, if there are a sufficient number of threads available. The creation of families of threads is the responsibility of the compiler. In this presentation, we give an overview of the microthreaded model of computation and we show by some small examples that it provides an efficient way of executing loops. Moreover, we show that this model has excellent scaling properties. Finally, we discuss the compiler support required and propose some compiler transformations that can be used to expose large families of threads.

Microthreading: model and compiler

Journal of The Peripheral Nervous System, 2011

There are two ways to improve processor performance, either by increasing the number of instructions issued per cycle or by increasing the speed of the processor’s clock. However, the former increases circuit complexity for diminishing returns and the latter increases power dissipation. Our Microthreading model proposes an alternative approach to ILP based on code fragmentation. These code fragments are called

Implementation and evaluation of a microthread architecture

Journal of Systems Architecture, 2009

a b s t r a c t Future many-core processor systems require scalable solutions that conventional architectures currently do not provide. This paper presents a novel architecture that demonstrates the required scalability. It is based on a model of computation developed in the AETHER project to provide a safe and composable approach to concurrent programming. The model supports a dynamic approach to concurrency that enables self-adaptivity in any environment so the model is quite general. It is implemented here in the instruction set of a dynamically scheduled RISC processor and many such processors form a microgrid. Binary compatibility over arbitrary clusters of such processors and an inherent scalability in both area and performance with concurrency exploited make this a very promising development for the era of many-core chips. This paper introduces the model, the processor and chip architecture and its emulation on a range of computational kernels. It also estimates the area of the structures required to support this model in silicon.

Multithreaded Processors

The Computer Journal, 2002

The instruction-level parallelism found in a conventional instruction stream is limited. Studies have shown the limits of processor utilization even for today's superscalar microprocessors. One solution is the additional utilization of more coarse-grained parallelism. The main approaches are the (single) chip multiprocessor and the multithreaded processor which optimize the throughput of multiprogramming workloads rather than single-thread performance. The chip multiprocessor integrates two or more complete processors on a single chip. Every unit of a processor is duplicated and used independently of its copies on the chip. In contrast, the multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. Unused instruction slots, which arise from pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor.

A survey of processors with explicit multithreading

ACM Computing Surveys, 2003

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.

Simultaneous multithreading–blending thread-level and instruction-level parallelism in advanced microprocessors

Proc. 5th Word Multiconf. on …, 2001

The paper discusses the reasons and possibilities of exploiting thread-level parallelism in modern microprocessors. The performance of a superscalar processor suffers when instruction-level parallelism is low. The underutilization due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor such that the full issue bandwidth is utilized by potentially issuing instructions from different threads simultaneously. Depending on the specific simultaneous multithreaded processor design, only a single instruction pipeline is used, or a single issue unit issues instructions from different instruction buffers simultaneously.

On the Design and Performance of Pipelined Architectures TR 87-022 August , 1987

2010

Pipelining is a widely used technique for implementing architectures which have inherent temporal parallelism when there is an operational requirement for high throughput. Many variations on the basic theme have been proposed, with varying degrees of success. The aims of this paper are twofold. The first is to present a critical review of conventional pipelined architectures, and put some well known problems in sharp relief. It is argued that conventional pipelined architectures have underlying limitations which can only be dealt with by adopting a different view of pipelining. These limitations are explained in terms of discontinuities in the fiow of instructions and data, and representative machines are examined in support of this argument. The second aim is to introduce an alternative theory of pipelining, which we call Context Flow, and show how it can be used to construct efficient parallel systems.

Multiscalar Processors

2003

The revolution of semiconductor technology has continued to provide microprocessor architects with an ever increasing number of faster transistors with which to build microprocessors. Microprocessor architects have responded by using the available transistors to build faster microprocessors which exploit instruction-level parallelism (ILP) to attain their performance objectives. Starting with serial instruction processing in the 1970s microprocessors progressed to pipelined and superscalar instruction processing in the 1980s and eventually (mid 1990s) to the currently popular dynamically-scheduled instruction processing models. During this progression, microprocessor architects borrowed heavily from ideas that were initially developed for processors of mainframe computers and rapidly adopted them for their designs. In the late 1980s it was clear that most of the ideas developed for high-performance instruction processing were either already adopted, or were soon going to be adopted. New ideas would have to be developed to continue the march of microprocessor performance. The initial multi scalar ideas were developed with this background in the late 1980s at the University of Wisconsin. The objective was to develop an instruction processing paradigm for future microprocessors when transistors were abundant, but other constraints such as wire delays and design verification were important. The multiscalar research at Wisconsin started out small but quickly grew to a much larger effort as the ideas generated interest in the research community. Manoj Franklin's Ph.D thesis was the first to develop and study the initial ideas. This was followed by the Wisconsin Ph.D theses of Scott Breach, T.N. Vijaykumar, Andreas Moshovos, Quinn Jacobson and Eric Rotenberg which studied various aspects of the multi scalar execution models. A significant amount of research on processing models derived from multi scalar was also carried out at other universities and research labs in the 1990s. Today variants of the basic multiscalar paradigm and other follow-on models continue to be the focus of significant research activity as researchers continue to build the knowledge base that will be crucial to the design of future microprocessors. vi This book provides an excellent synopsis of a large body of research carried out on multiscalar processors in the 1990s. It will be a valuable resource for designers of future microprocessors as well as for students interested in learning about the concepts of speculative multithreading.