Jack Lo - Academia.edu (original) (raw)
Papers by Jack Lo
… IEEE Transactions on, 1999
This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.
ACM SIGARCH …
Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues m... more Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its pe@ormance on database systems is still an open question. In ...
Proceedings of the 23rd annual international symposium on …
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...
Acm Transactions on Computer Systems, Aug 1, 1997
To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting
Isca, 1995
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
[57] ABSTRACT A method and organization for implementing the registers required in a computer sys... more [57] ABSTRACT A method and organization for implementing the registers required in a computer system supporting multithreading and dynamic out-of-order execution. Multithreaded computer systems are those in which the processor supports multiple contexts (threads), and either rapid context switching from thread to thread or scheduling of instructions from different threads within a single cycle. An important component of processors for such systems is the register file; the processor needs a large register file or ...
(57) ABSTRACT A system and a method is described for freeing renaming registers that have been al... more (57) ABSTRACT A system and a method is described for freeing renaming registers that have been allocated to architectural registers prior to another instruction redefining the architectural register. Renaming registers are used by a processor to dynamically execute instructions out-of-order. The present invention may be employed by any single or multi-threaded processor that executes instructions out-of-order. A mechanism is described for freeing renaming registers that consists of a set of instructions, used by a compiler, to ...
IEEE Transactions on Cloud Computing, 2015
IEEE Transactions on Parallel and Distributed Systems, 1999
This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.
... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...
hpl.hp.com
Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions... more Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions from multiple threads each cycle. Using instruction-level and thread-level parallelism interchangeably, SMT addresses multiple sources of lost resource utilization in wide-issue superscalars. The result is better performance for a variety of workloads. For a mix of independent programs (multiprogramming), the overall throughput is improved: when one program has no instructions that are ready to issue, instructions can be used from one ...
International Journal of …, 1999
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to signifi... more Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SMT processor is capable of issuing mul- tiple instructions from multiple threads to a ...
2014 IEEE Network Operations and Management Symposium (NOMS), 2014
This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than ... more This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execu- tion contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. ...
Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96, 1996
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
ACM Transactions on Computer Systems, 1997
To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting
Proceedings of the First SUIF Compiler Workshop, 1996
Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washingt... more Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washington .edu ... Department of Computer Science & Engineering University of Washington ... Simultaneous multithreading (SMT) is a technique that permits multiple independent threads to issue multi- ... Unlike conventional multithreaded architectures [LGH94][ALKK90][Smi81][ACC+90], which depend on fast ... This benefit of SMT can be realized without extensive changes to a conventional wide-issue superscalar, by ... Thus far, we have only evaluated simultaneous ...
… IEEE Transactions on, 1999
This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.
ACM SIGARCH …
Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues m... more Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its pe@ormance on database systems is still an open question. In ...
Proceedings of the 23rd annual international symposium on …
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...
Acm Transactions on Computer Systems, Aug 1, 1997
To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting
Isca, 1995
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
[57] ABSTRACT A method and organization for implementing the registers required in a computer sys... more [57] ABSTRACT A method and organization for implementing the registers required in a computer system supporting multithreading and dynamic out-of-order execution. Multithreaded computer systems are those in which the processor supports multiple contexts (threads), and either rapid context switching from thread to thread or scheduling of instructions from different threads within a single cycle. An important component of processors for such systems is the register file; the processor needs a large register file or ...
(57) ABSTRACT A system and a method is described for freeing renaming registers that have been al... more (57) ABSTRACT A system and a method is described for freeing renaming registers that have been allocated to architectural registers prior to another instruction redefining the architectural register. Renaming registers are used by a processor to dynamically execute instructions out-of-order. The present invention may be employed by any single or multi-threaded processor that executes instructions out-of-order. A mechanism is described for freeing renaming registers that consists of a set of instructions, used by a compiler, to ...
IEEE Transactions on Cloud Computing, 2015
IEEE Transactions on Parallel and Distributed Systems, 1999
This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.
... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...
hpl.hp.com
Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions... more Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions from multiple threads each cycle. Using instruction-level and thread-level parallelism interchangeably, SMT addresses multiple sources of lost resource utilization in wide-issue superscalars. The result is better performance for a variety of workloads. For a mix of independent programs (multiprogramming), the overall throughput is improved: when one program has no instructions that are ready to issue, instructions can be used from one ...
International Journal of …, 1999
Simultaneous Multithreading (SMT) is a processor architectural technique that promises to signifi... more Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SMT processor is capable of issuing mul- tiple instructions from multiple threads to a ...
2014 IEEE Network Operations and Management Symposium (NOMS), 2014
This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than ... more This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execu- tion contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. ...
Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96, 1996
Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.
ACM Transactions on Computer Systems, 1997
To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting
Proceedings of the First SUIF Compiler Workshop, 1996
Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washingt... more Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washington .edu ... Department of Computer Science & Engineering University of Washington ... Simultaneous multithreading (SMT) is a technique that permits multiple independent threads to issue multi- ... Unlike conventional multithreaded architectures [LGH94][ALKK90][Smi81][ACC+90], which depend on fast ... This benefit of SMT can be realized without extensive changes to a conventional wide-issue superscalar, by ... Thus far, we have only evaluated simultaneous ...