Instruction level parallelism through microthreading—A scalable approach to chip multiprocessors (original) (raw)

Scalable Instruction-Level Parallelism

Lecture Notes in Computer Science, 2004

This paper presents a model for instruction-level distributed computing that allows the implementation of scalable chip multiprocessors. Based on explicit microthreading it serves as a replacement for outof-order instruction issue; it defines the model and explores implementations issues. The model results in a fully distributed implementation in which data is distributed to one register file per processor, which is scalable as the number of ports in each register file is constant. The only component with less than ideal scaling properties is the the switching network between processors.

Intrathreads: Techniques for parallelizing sequential code

6th Workshop on Multithreaded Execution, …, 2002

The inthreads architecture enables low-level parallelization of serial computation. This paper describes the inthreads architecture and shows several code transformations that can be used for optimizing code with low instruction-level parallelism. Such code can be optimized neither with conventional techniques due to complex branching, nor with conventional concurrent programming due to very low granularity of the parallelizable code sequences.

Microthreading a model for distributed instruction-level concurrency

Parallel processing letters, 2006

This paper analyses the micro-threaded model of concurrency making comparisons with both data and instruction-level concurrency. The model is fine grain and provides synchronisation in a distributed register file, making it a promising candidate for scalable chip-multiprocessors. The micro-threaded model was first proposed in 1996 as a means to tolerate high latencies in data-parallel, distributed-memory multi-processors. This paper explores the model's opportunity to provide the simultaneous issue of instructions, required for chip multiprocessors, and discusses the issues of scalability with regard to support structures implementing the model and communication in supporting it. The model supports deterministic distribution of code fragments and dynamic scheduling of instructions from within those fragments. The hardware also recognises different classes of variables from the register specifiers, which allows the hardware to manage locality and optimise communication so that it is both efficient and scalable.

A microthreaded architecture and its compiler

2006

A different approach to ILP based on code fragmentation, first proposed some 10 years ago, is being used for novel CMP processor designs. The technique, called microthreading, enables binary compatibility across arbitrary schedules. Chip architectures have been proposed that contain many simple pipelines with hardware support for ultra-fast context switching. The concurrency described in the binary code is parametric and a typical microthread is an iteration of a loop. The ISA contains instructions to create a family of micro-threads, i.e., the collection of all loop iterations. In case a microthread encounters a (possibly) long latency operation (e.g., a load that may miss in the cache) this thread is switched out and another thread is switched in under program control. In this way, latencies can effectively be hidden, if there are a sufficient number of threads available. The creation of families of threads is the responsibility of the compiler. In this presentation, we give an overview of the microthreaded model of computation and we show by some small examples that it provides an efficient way of executing loops. Moreover, we show that this model has excellent scaling properties. Finally, we discuss the compiler support required and propose some compiler transformations that can be used to expose large families of threads.

Exploiting Instruction-Level Parallelism in the Presence of Conditional Branches

Wide issue superscalar and VLIW processors utilize instruction-level parallelism ILP to achieve high performance. However, if insu cient ILP is found, the performance potential of these processors su ers dramatically. Branch instructions, which are one of the major lim-DEDICATION Dedicated to the fond memory of my grandfather, Richard Bannon. v ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Professor Wen-mei Hwu, for his guidance throughout my graduate studies. Most importantly, I w ould like to thank him for his patience. I think I truly tested the limits of this patience on several occasions. But through everything, he provided me with continued encouragement and support. Next, I would like to extend my gratitude to the members of my dissertation committee, Professor Janak Patel, Dr. Bob Rau, and Professor Pen-Chung Yew. Their numerous comments, questions, and suggestions improved the quality of this work immensely. Also, I would like to thank Vinod Kathail, Bob Rau, and Mike S c hlansker at Hewlett-Packard Laboratories. Their teaching and suggestions had a strong in uence on the directions of this work. This research truly would not have been possible without the support, hard work, and friendship of the members of the IMPACT research group. Members of the group were always there to discuss ideas, debate solutions, practice talks, and develop software. I feel extremely fortunate to have been a part of this group. I w ould rst like to thank two members of the group, Pohua Chang and William Chen. Pohua put a great deal of time and e ort in to educating me in the area of ILP compilation. His energy and endless supply of ideas provided a strong motivation for my w ork. William was a close friend and colleague throughout graduate school. He was always there to discuss research, brainstorm new ideas, and provide helpful suggestions. The group members for which I o w e many thanks to are those who worked on hyperblocks and predicated execution. The research really began with David Lin, who helped formulate most of the original ideas for hyperblocks and predicate compilation. Rick Hank, John Gylvi lenhaal, and Roger Bringmann contributed to almost every aspect of the research with their ideas, insight, and suggestions. Rick's work on code generation, data ow analysis, register allocation, and emulation was central to this research. John's work on emulation, pro ling, and simulation was equally important. Roger's research provided the scheduling framework used in this dissertation. Dave Gallagher was a willing sounding board for all of my ideas and was always there to debate any issue. He also provided important work on the predicate analysis modules. Jim McCormick contributed his thoughts and e ort in the area of partial predication. Finally, D a vid August provided valuable contributions with his ideas and work on loop peeling and hyperblock optimization. I would also like to thank him for all the invaluable comments and suggestions he provided on this dissertation. There are several other group members that I wish to thank. My o cemates, Sadun Anik, Tom Conte, Dave Gallagher, and Nancy Warter made o ce life very enjoyable with their thoughts and discussions. Grant Haab, Sabrina Hwu, Tokuzo Kiyohara, and Dan Lavery provided invaluable feedback on ideas, papers, and talks. Dan Connors, Brian Deitrich, Cheng-Hsueh Hsieh, and Teresa Johnson provided helpful views and software tools. Next, I would like to extend special thanks to my friends on Copper, Black Knight, Sojourn, and Shayol-Ghul Dikumuds. Although at times mud bordered on an addiction, it provided a much needed escape from the harsh realities of life. Many thanks to Mookie, Allenbri, Old, Tryth, Tang, Orcus, Namu, Ima, Cucumber, Dragnar, Proteus, Cython, and Miax. Last, I would like to acknowledge the support of some friends and family. Brian Upper, Brad Gilbert, Tom Begnel, Jim Falling, and all the Groundhogs provided many good times through my d a ys in Illinois. My parents, Jeanne and Monte, and my sister, Laura, gave me the encouragement and consistent support that I needed to make it through graduate school.

SOFTWARE EXPLOITS OF INSTRUCTION-LEVEL PARALLELISM FOR

For decades hardware algorithms have dominated the field of parallel processing. But with the Moore's law reaching its limit need for software pipelining is being felt. This area has eluded researchers since long. Significant measure of success has been obtained in graphics processing using software approaches to pipelining. This project aims at developing software to detect various kinds of data dependencies like data flow dependency, anti-dependency and output dependency for a basic code block. Graphs would be generated for the various kinds of dependencies present in a code block and would be combined to obtain a single data dependency graph. This graph would be further processed to obtain a transitive closure graph and finally an ILP graph. The ILP graph can be used to predict the possible combinations of instructions that may be executed in parallel. A scheduling algorithm would be developed to obtain an instruction schedule in the form of instruction execution start times. The schedule obtained would be used to compute various performance metrics like speed-up factor, efficiency, throughput, etc.

Sequential code parallelization for multi-core embedded systems: A survey of models, algorithms and tools

2014

In recent years the industry experienced a shift in the design and manufacture of processors. Multiple-core processors in one single chip started replacing the common used single-core processors. This design trend reached the develop of System-on-Chip, widely used in embedded systems, and turned them into powerful Multiprocessor System-onChip. These multi-core systems have presented not only an improvement in performance but also in energy efficiency. Millions of lines of code have been developed over the years, most of them using sequential programming languages such as C. Possible performance gains of legacy sequential code executed in multi-core systems is limited by the amount of parallelism that can be extracted and exploit from that code. For this reason, several tools have been developed to extract parallelism from sequential program and produce a parallel version of the original code. Nevertheless, most of these tools have been designed for high-performance computing systems...

Microthreading: model and compiler

Journal of The Peripheral Nervous System, 2011

There are two ways to improve processor performance, either by increasing the number of instructions issued per cycle or by increasing the speed of the processor’s clock. However, the former increases circuit complexity for diminishing returns and the latter increases power dissipation. Our Microthreading model proposes an alternative approach to ILP based on code fragmentation. These code fragments are called

Extraction of massive instruction level parallelism

ACM SIGARCH Computer Architecture News, 1993

Our goal is to dramatically increase the performance of uniprocessors through the exploitation of instruction level parallelism, i.e. that parallelism which exists amongst the machine instructions of a program. Speculative execution may help a lot, but, it is argued, both branch prediction and eager execution are insufficient to achieve performances in speedup factors in the tens (with respect to sequential execution), with reasonable hardware costs. A new form of code execution, Disjoint Eager Execution (DEE), is proposed which uses less hardware than pure eager execution, and has more performance than pure branch prediction; DEE is a continuum between branch prediction and eager execution. DEE is shown to be optimal, when processing resources are constrained. Branches are predicted in DEE, but the predictions should be made in parallel in order to obtain high performance. This is not allowed, however, by the use of the standard instruction stream model, the dynamic model (the orde...

Advanced Compilers, Architectures and Parallel Systems

1994

Abstract Multithreaded node architectures have been proposed for future multiprocessor systems. However, some open issues remain: can e cient multithreading support be provided in a multiprocessor machine such that it is capable of tolerating the synchronization and communication latencies, without intruding on the performance of sequentially-executed code?