Jack Lo - Academia.edu (original) (raw)

Papers by Jack Lo

Research paper thumbnail of Software-directed register deallocation for simultaneous multithreaded processors

… IEEE Transactions on, 1999

This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.

Research paper thumbnail of An analysis of database workload performance on simultaneous multithreaded …

ACM SIGARCH …

Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues m... more Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its pe@ormance on database systems is still an open question. In ...

Research paper thumbnail of Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading …

Proceedings of the 23rd annual international symposium on …

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Exploiting thread-level parallelism on simultaneous multithreaded processors

... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...

Research paper thumbnail of Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Acm Transactions on Computer Systems, Aug 1, 1997

To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting

Research paper thumbnail of Examining the Interaction Between Balanced Scheduling and Other Compiler Optimizations

Research paper thumbnail of Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading proce

Isca, 1995

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Shared register storage mechanisms for multithreaded computer systems with out-of-order execution

[57] ABSTRACT A method and organization for implementing the registers required in a computer sys... more [57] ABSTRACT A method and organization for implementing the registers required in a computer system supporting multithreading and dynamic out-of-order execution. Multithreaded computer systems are those in which the processor supports multiple contexts (threads), and either rapid context switching from thread to thread or scheduling of instructions from different threads within a single cycle. An important component of processors for such systems is the register file; the processor needs a large register file or ...

Research paper thumbnail of Mechanism for Freeing Registers on Processors That Perform Dynamic Out-Of-Order Execution of Instructions Using Renaming Registers

(57) ABSTRACT A system and a method is described for freeing renaming registers that have been al... more (57) ABSTRACT A system and a method is described for freeing renaming registers that have been allocated to architectural registers prior to another instruction redefining the architectural register. Renaming registers are used by a processor to dynamically execute instructions out-of-order. The present invention may be employed by any single or multi-threaded processor that executes instructions out-of-order. A mechanism is described for freeing renaming registers that consists of a set of instructions, used by a compiler, to ...

Research paper thumbnail of Improving Flash Resource Utilization at Minimal Management Cost in Virtualized Flash-based Storage Systems

IEEE Transactions on Cloud Computing, 2015

Research paper thumbnail of Software-directed register deallocation for simultaneous multithreaded processors

IEEE Transactions on Parallel and Distributed Systems, 1999

This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.

Research paper thumbnail of Exploiting thread-level parallelism on simultaneous multithreaded processors

... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...

Research paper thumbnail of WRL Technical Note TN-52

hpl.hp.com

Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions... more Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions from multiple threads each cycle. Using instruction-level and thread-level parallelism interchangeably, SMT addresses multiple sources of lost resource utilization in wide-issue superscalars. The result is better performance for a variety of workloads. For a mix of independent programs (multiprogramming), the overall throughput is improved: when one program has no instructions that are ready to issue, instructions can be used from one ...

Research paper thumbnail of Tuning compiler optimizations for simultaneous multithreading

International Journal of …, 1999

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to signifi... more Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SMT processor is capable of issuing mul- tiple instructions from multiple threads to a ...

Research paper thumbnail of VFRM: Flash Resource Manager in VMware ESX Server

2014 IEEE Network Operations and Management Symposium (NOMS), 2014

Research paper thumbnail of Thread-Sensitive Scheduling for SMT Processors

This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than ... more This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execu- tion contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. ...

Research paper thumbnail of Exploiting choice

Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96, 1996

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems, 1997

To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting

Research paper thumbnail of Simultaneous multithreading: a platform for next-generation processors

Research paper thumbnail of Compilation issues for a simultaneous multithreading processor

Proceedings of the First SUIF Compiler Workshop, 1996

Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washingt... more Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washington .edu ... Department of Computer Science & Engineering University of Washington ... Simultaneous multithreading (SMT) is a technique that permits multiple independent threads to issue multi- ... Unlike conventional multithreaded architectures [LGH94][ALKK90][Smi81][ACC+90], which depend on fast ... This benefit of SMT can be realized without extensive changes to a conventional wide-issue superscalar, by ... Thus far, we have only evaluated simultaneous ...

Research paper thumbnail of Software-directed register deallocation for simultaneous multithreaded processors

… IEEE Transactions on, 1999

This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.

Research paper thumbnail of An analysis of database workload performance on simultaneous multithreaded …

ACM SIGARCH …

Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues m... more Simultaneous multithreading (SMT) is an architec- tural technique in which the processor issues multiple instructions from multiple threads each cycle. While SMT has been shown to be effective on scientific workloads, its pe@ormance on database systems is still an open question. In ...

Research paper thumbnail of Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading …

Proceedings of the 23rd annual international symposium on …

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Exploiting thread-level parallelism on simultaneous multithreaded processors

... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...

Research paper thumbnail of Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Acm Transactions on Computer Systems, Aug 1, 1997

To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting

Research paper thumbnail of Examining the Interaction Between Balanced Scheduling and Other Compiler Optimizations

Research paper thumbnail of Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading proce

Isca, 1995

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Shared register storage mechanisms for multithreaded computer systems with out-of-order execution

[57] ABSTRACT A method and organization for implementing the registers required in a computer sys... more [57] ABSTRACT A method and organization for implementing the registers required in a computer system supporting multithreading and dynamic out-of-order execution. Multithreaded computer systems are those in which the processor supports multiple contexts (threads), and either rapid context switching from thread to thread or scheduling of instructions from different threads within a single cycle. An important component of processors for such systems is the register file; the processor needs a large register file or ...

Research paper thumbnail of Mechanism for Freeing Registers on Processors That Perform Dynamic Out-Of-Order Execution of Instructions Using Renaming Registers

(57) ABSTRACT A system and a method is described for freeing renaming registers that have been al... more (57) ABSTRACT A system and a method is described for freeing renaming registers that have been allocated to architectural registers prior to another instruction redefining the architectural register. Renaming registers are used by a processor to dynamically execute instructions out-of-order. The present invention may be employed by any single or multi-threaded processor that executes instructions out-of-order. A mechanism is described for freeing renaming registers that consists of a set of instructions, used by a compiler, to ...

Research paper thumbnail of Improving Flash Resource Utilization at Minimal Management Cost in Virtualized Flash-based Storage Systems

IEEE Transactions on Cloud Computing, 2015

Research paper thumbnail of Software-directed register deallocation for simultaneous multithreaded processors

IEEE Transactions on Parallel and Distributed Systems, 1999

This paper proposes and evaluates software techniques that increase register file utilization for... more This paper proposes and evaluates software techniques that increase register file utilization for simultaneous multithreading (SMT) processors. SMT processors require large register files to hold multiple thread contexts that can issue instructions out of order every cycle. By supporting better inter-thread sharing and management of physical registers, an SMT processor can reduce the number of registers required and can improve performance for a given register file size. Our techniques specifically target register deallocation. While out-of-order processors with register renaming are effective at knowing when a new physical register must be allocated, they have limited knowledge of when physical registers can be deallocated. We propose architectural extensions that permit the compiler and operating system to (1) free registers immediately upon their last use, and (2) free registers allocated to idle thread contexts. Our results, based on detailed instruction-level simulations of an SMT processor, show that these techniques can increase performance significantly for register-intensive, multithreaded programs.

Research paper thumbnail of Exploiting thread-level parallelism on simultaneous multithreaded processors

... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz B... more ... I would also like to thank Joel Emer and Rebecca Stamm from Digital Semiconductor, and Luiz Barroso and Kourosh Gharachorloo, from Digital's Western Research Laboratory. Special thanks to Luiz for hosting me during my summer internship at WRL. ...

Research paper thumbnail of WRL Technical Note TN-52

hpl.hp.com

Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions... more Simultaneous multithreading (SMT) is a processor design that allows the CPU to issue instructions from multiple threads each cycle. Using instruction-level and thread-level parallelism interchangeably, SMT addresses multiple sources of lost resource utilization in wide-issue superscalars. The result is better performance for a variety of workloads. For a mix of independent programs (multiprogramming), the overall throughput is improved: when one program has no instructions that are ready to issue, instructions can be used from one ...

Research paper thumbnail of Tuning compiler optimizations for simultaneous multithreading

International Journal of …, 1999

Simultaneous Multithreading (SMT) is a processor architectural technique that promises to signifi... more Simultaneous Multithreading (SMT) is a processor architectural technique that promises to significantly improve the utilization and performance of modern wide-issue superscalar processors. An SMT processor is capable of issuing mul- tiple instructions from multiple threads to a ...

Research paper thumbnail of VFRM: Flash Resource Manager in VMware ESX Server

2014 IEEE Network Operations and Management Symposium (NOMS), 2014

Research paper thumbnail of Thread-Sensitive Scheduling for SMT Processors

This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than ... more This paper examines thread-sensitive scheduling for SMT processors. When more threads exist than hardware execu- tion contexts, the operating system is responsible for selecting which threads to execute at any instant, inherently deciding which threads will compete for resources. ...

Research paper thumbnail of Exploiting choice

Proceedings of the 23rd annual international symposium on Computer architecture - ISCA '96, 1996

Simultaneous multithreading is a technique that permits multiple independent threads to issue mul... more Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized model. In this paper we show that the throughput gains from simultaneous multithreading can be achieved without extensive changes to a conventional wide-issue superscalar, either in hardware structures or sizes. We present an architecture for simultaneous multithreading that achieves three goals: (1) it minimizes the architectural impact on the conventional superscalar design, (2) it has minimal performance impact on a single thread executing alone, and (3) it achieves significant throughput gains when running multiple threads. Our simultaneous multithreading architecture achieves a throughput of 5.4 instructions per cycle, a 2.5-fold improvement over an unmodified superscalar with similar hardware resources. This speedup is enhanced by an advantage of multithreading previously unexploited in other architectures: the ability to favor for fetch and issue those threads most efficiently using the processor each cycle, thereby providing the "best" instructions to the processor.

Research paper thumbnail of Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems, 1997

To achieve high performance, contemporary computer systems rely on two forms of parallelism: inst... more To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue superscalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to compete for and share all of the processor's resources every cycle. The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting

Research paper thumbnail of Simultaneous multithreading: a platform for next-generation processors

Research paper thumbnail of Compilation issues for a simultaneous multithreading processor

Proceedings of the First SUIF Compiler Workshop, 1996

Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washingt... more Jack L. Lo, Susan J. Eggers, Henry M. Levy, Dean M. Tullsen {jlo,eggers,levy,tullsen}@cs.washington .edu ... Department of Computer Science & Engineering University of Washington ... Simultaneous multithreading (SMT) is a technique that permits multiple independent threads to issue multi- ... Unlike conventional multithreaded architectures [LGH94][ALKK90][Smi81][ACC+90], which depend on fast ... This benefit of SMT can be realized without extensive changes to a conventional wide-issue superscalar, by ... Thus far, we have only evaluated simultaneous ...