Andreas Moshovos | University of Toronto (original) (raw)
Papers by Andreas Moshovos
Abstract An increasing number of architectural techniques have relied on hardware counting bloom ... more Abstract An increasing number of architectural techniques have relied on hardware counting bloom filters (CBFs) to improve upon the energy, delay, and complexity of various processor structures. CBFs improve the energy and speed of membership tests by maintaining an imprecise and compact representation of a large set to be searched. This paper studies the energy, delay, and area characteristics of two implementations for CBFs using full custom layouts in a commercial 0.13-mum fabrication technology.
Abstract As the existing techniques that empower the modern high-performance processors are being... more Abstract As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence.
Abstract On-chip last-level caches are increasing to tens of megabytes to accommodate application... more Abstract On-chip last-level caches are increasing to tens of megabytes to accommodate applications with large memory footprints and to compensate for high memory latencies and limited off-chip bandwidth. This paper reviews two on-going research efforts that exploit such large caches: coarse-grain cache management, and predictor virtualization. Coarse-grain cache management collects and stores cache information at a large memory region granularity (eg, 1 KB to 8 KB).
Abstract Using full-custom layouts in 130 nm technology, this work studies how the latency and en... more Abstract Using full-custom layouts in 130 nm technology, this work studies how the latency and energy of a checkpointed, CAM-based Register Alias Table (cRAT) vary as a function of the window size, the issue width, and the number of embedded global checkpoints (GCs). These results are compared to those of the SRAM-based RAT (sRAT). Understanding these variations is useful during the early stages of architectural exploration where physical level information is not yet available.
Abstract Register renaming is a performance-critical component of modern, dynamically-scheduled p... more Abstract Register renaming is a performance-critical component of modern, dynamically-scheduled processors. Register renaming latency increases as a function of several architectural parameters (eg, processor issue width, processor window size, and processor checkpoint count). Pipelining of the register renaming logic can help avoid restricting the processor clock frequency. This work presents a full-custom, two-stage register renaming implementation in a 130-nm fabrication technology.
Abstract This work proposes a novel checkpoint store compression method for giga-scale, coarse-gr... more Abstract This work proposes a novel checkpoint store compression method for giga-scale, coarse-grain checkpoint/restore. This mechanism can be useful for debugging, post-mortem analysis and error-recovery. The effectiveness of our compression method lies in exploiting value locality in the memory data and address streams. Previously proposed dictionary-based hardware compressors exploit the same properties however they are expensive and relatively slow.
Abstract L1 instruction-cache misses pose a critical performance bottleneck in commercial server ... more Abstract L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow.
Abstract We revisit the idea of using small line buffers in-front of caches. We propose ReCast, a... more Abstract We revisit the idea of using small line buffers in-front of caches. We propose ReCast, a tiny tag set cache that filters a significant number of tag probes to the L2 tag array thus reducing power. The key contribution in ReCast is S-Shift, a simple indexing function (no logic involved just wires) that greatly improves the utility of line buffers with no additional hardware cost.
Abstract We identify that typical programs which exhibit highly regular read-after-read (RAR) mem... more Abstract We identify that typical programs which exhibit highly regular read-after-read (RAR) memory dependence streams. We exploit this regularity by introducing read-after-read (RAR) memory dependence prediction. We also present two RAR memory dependence prediction-based memory latency reduction techniques. In the first technique, a load can obtain a value by simply naming a preceding load with which a RAR dependence is predicted.
Abstract We propose power optimizations for the register renaming unit. Our optimizations reduce ... more Abstract We propose power optimizations for the register renaming unit. Our optimizations reduce power dissipation in two ways. First, they reduce the number of read and write ports that are needed at the register alias table. Second, they reduce the number of internal checkpoints that are required to allow highly-aggressive control speculation and rapid recovery from control flow miss-speculations.
Abstract Many hardware optimizations rely on collecting information about program behavior at run... more Abstract Many hardware optimizations rely on collecting information about program behavior at runtime. This information is stored in lookup tables. To be accurate and effective, these optimizations usually require large dedicated on-chip tables. Although technology advances offer an increased amount of on-chip resources, these resources are allocated to increase the size of on-chip conventional cache hierarchies.
Abstract We investigate instruction distribution methods for quad-cluster, dynamically-scheduled ... more Abstract We investigate instruction distribution methods for quad-cluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both Pion-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions.
Abstract Reconfigurable hardware has the potential for significant performance improvements by pr... more Abstract Reconfigurable hardware has the potential for significant performance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically-scheduled superscalar processor. Chimaera is capable of performing 9-input/1-output operations on integer data.
Abstract Modern processors use Branch Target Buffers (BTB) to predict the target address of branc... more Abstract Modern processors use Branch Target Buffers (BTB) to predict the target address of branches so that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be large enough to capture the entire working set of the application and small enough for fast access and practical on-chip dedicated storage. Depending on the application, these requirements are at odds. For example, commercial applications that exhibit large instruction footprints benefit from large BTBs.
ABSTRACT We present a number of power-aware instruction front-end (fetch/decode) throttling metho... more ABSTRACT We present a number of power-aware instruction front-end (fetch/decode) throttling methods for high-performance dynamically-scheduled superscalar processors. Our methods reduce power dissipation by selectively turning on and off instruction fetch and decode. Moreover, they have a negligible impact on performance as they deliver instructions just in time for exploiting the available parallelism. Previously proposed front-end throttling methods rely on branch prediction confidence estimation.
Abstract We introduce a dynamic scheme that captures the accesspat-terns of linked data structure... more Abstract We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identzj+ ing producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal.
Abstract A key challenge in architecting a CMP with many cores is maintaining cache coherence in ... more Abstract A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The tagless coherence directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information.
Abstract Graphics processors (GPU) offer the promise of more than an order of magnitude speedup o... more Abstract Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (eg, Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU.
Abstract We propose methods for reducing the energy consumed by snoop requests in snoopy bus-base... more Abstract We propose methods for reducing the energy consumed by snoop requests in snoopy bus-based symmetric multiprocessor (SMP) systems. Observing that a large fraction of snoops do not find copies in many of the other caches, we introduce JETTY, a small, cache-like structure. A JETTY is introduced in-between the bus and the L2 backside of each processor. There it filters the vast majority of snoops that would not find a locally cached copy.
Abstract Designers have invested much effort in developing accurate branch predictors with short ... more Abstract Designers have invested much effort in developing accurate branch predictors with short learning periods. Such techniques rely on exploiting complex and relatively large structures. Although exploiting such structures is necessary to achieve high accuracy and fast learning, once the short learning phase is over, a simple structure can efficiently predict the branch outcome for the majority of branches.
Abstract An increasing number of architectural techniques have relied on hardware counting bloom ... more Abstract An increasing number of architectural techniques have relied on hardware counting bloom filters (CBFs) to improve upon the energy, delay, and complexity of various processor structures. CBFs improve the energy and speed of membership tests by maintaining an imprecise and compact representation of a large set to be searched. This paper studies the energy, delay, and area characteristics of two implementations for CBFs using full custom layouts in a commercial 0.13-mum fabrication technology.
Abstract As the existing techniques that empower the modern high-performance processors are being... more Abstract As the existing techniques that empower the modern high-performance processors are being refined and as the underlying technology trade-offs change, new bottlenecks are exposed and new challenges are raised. This thesis introduces a new tool, Memory Dependence Prediction that can be useful in combating these bottlenecks and meeting the new challenges. Memory dependence prediction is a technique to guess whether a load or a store will experience a dependence.
Abstract On-chip last-level caches are increasing to tens of megabytes to accommodate application... more Abstract On-chip last-level caches are increasing to tens of megabytes to accommodate applications with large memory footprints and to compensate for high memory latencies and limited off-chip bandwidth. This paper reviews two on-going research efforts that exploit such large caches: coarse-grain cache management, and predictor virtualization. Coarse-grain cache management collects and stores cache information at a large memory region granularity (eg, 1 KB to 8 KB).
Abstract Using full-custom layouts in 130 nm technology, this work studies how the latency and en... more Abstract Using full-custom layouts in 130 nm technology, this work studies how the latency and energy of a checkpointed, CAM-based Register Alias Table (cRAT) vary as a function of the window size, the issue width, and the number of embedded global checkpoints (GCs). These results are compared to those of the SRAM-based RAT (sRAT). Understanding these variations is useful during the early stages of architectural exploration where physical level information is not yet available.
Abstract Register renaming is a performance-critical component of modern, dynamically-scheduled p... more Abstract Register renaming is a performance-critical component of modern, dynamically-scheduled processors. Register renaming latency increases as a function of several architectural parameters (eg, processor issue width, processor window size, and processor checkpoint count). Pipelining of the register renaming logic can help avoid restricting the processor clock frequency. This work presents a full-custom, two-stage register renaming implementation in a 130-nm fabrication technology.
Abstract This work proposes a novel checkpoint store compression method for giga-scale, coarse-gr... more Abstract This work proposes a novel checkpoint store compression method for giga-scale, coarse-grain checkpoint/restore. This mechanism can be useful for debugging, post-mortem analysis and error-recovery. The effectiveness of our compression method lies in exploiting value locality in the memory data and address streams. Previously proposed dictionary-based hardware compressors exploit the same properties however they are expensive and relatively slow.
Abstract L1 instruction-cache misses pose a critical performance bottleneck in commercial server ... more Abstract L1 instruction-cache misses pose a critical performance bottleneck in commercial server workloads. Cache access latency constraints preclude L1 instruction caches large enough to capture the application, library, and OS instruction working sets of these workloads. To cope with capacity constraints, researchers have proposed instruction prefetchers that use branch predictors to explore future control flow.
Abstract We revisit the idea of using small line buffers in-front of caches. We propose ReCast, a... more Abstract We revisit the idea of using small line buffers in-front of caches. We propose ReCast, a tiny tag set cache that filters a significant number of tag probes to the L2 tag array thus reducing power. The key contribution in ReCast is S-Shift, a simple indexing function (no logic involved just wires) that greatly improves the utility of line buffers with no additional hardware cost.
Abstract We identify that typical programs which exhibit highly regular read-after-read (RAR) mem... more Abstract We identify that typical programs which exhibit highly regular read-after-read (RAR) memory dependence streams. We exploit this regularity by introducing read-after-read (RAR) memory dependence prediction. We also present two RAR memory dependence prediction-based memory latency reduction techniques. In the first technique, a load can obtain a value by simply naming a preceding load with which a RAR dependence is predicted.
Abstract We propose power optimizations for the register renaming unit. Our optimizations reduce ... more Abstract We propose power optimizations for the register renaming unit. Our optimizations reduce power dissipation in two ways. First, they reduce the number of read and write ports that are needed at the register alias table. Second, they reduce the number of internal checkpoints that are required to allow highly-aggressive control speculation and rapid recovery from control flow miss-speculations.
Abstract Many hardware optimizations rely on collecting information about program behavior at run... more Abstract Many hardware optimizations rely on collecting information about program behavior at runtime. This information is stored in lookup tables. To be accurate and effective, these optimizations usually require large dedicated on-chip tables. Although technology advances offer an increased amount of on-chip resources, these resources are allocated to increase the size of on-chip conventional cache hierarchies.
Abstract We investigate instruction distribution methods for quad-cluster, dynamically-scheduled ... more Abstract We investigate instruction distribution methods for quad-cluster, dynamically-scheduled superscalar processors. We study a variety of methods with different cost, performance and complexity characteristics. We investigate both Pion-adaptive and adaptive methods and their sensitivity both to inter-cluster communication latencies and pipeline depth. Furthermore, we develop a set of models that allow us to identify how well each method attacks issue-bandwidth and inter-cluster communication restrictions.
Abstract Reconfigurable hardware has the potential for significant performance improvements by pr... more Abstract Reconfigurable hardware has the potential for significant performance improvements by providing support for application-specific operations. We report our experience with Chimaera, a prototype system that integrates a small and fast reconfigurable functional unit (RFU) into the pipeline of an aggressive, dynamically-scheduled superscalar processor. Chimaera is capable of performing 9-input/1-output operations on integer data.
Abstract Modern processors use Branch Target Buffers (BTB) to predict the target address of branc... more Abstract Modern processors use Branch Target Buffers (BTB) to predict the target address of branches so that they can fetch ahead in the instruction stream increasing concurrency and performance. Ideally, BTBs would be large enough to capture the entire working set of the application and small enough for fast access and practical on-chip dedicated storage. Depending on the application, these requirements are at odds. For example, commercial applications that exhibit large instruction footprints benefit from large BTBs.
ABSTRACT We present a number of power-aware instruction front-end (fetch/decode) throttling metho... more ABSTRACT We present a number of power-aware instruction front-end (fetch/decode) throttling methods for high-performance dynamically-scheduled superscalar processors. Our methods reduce power dissipation by selectively turning on and off instruction fetch and decode. Moreover, they have a negligible impact on performance as they deliver instructions just in time for exploiting the available parallelism. Previously proposed front-end throttling methods rely on branch prediction confidence estimation.
Abstract We introduce a dynamic scheme that captures the accesspat-terns of linked data structure... more Abstract We introduce a dynamic scheme that captures the accesspat-terns of linked data structures and can be used to predict future accesses with high accuracy. Our technique exploits the dependence relationships that exist between loads that produce addresses and loads that consume these addresses. By identzj+ ing producer-consumer pairs, we construct a compact internal representation for the associated structure and its traversal.
Abstract A key challenge in architecting a CMP with many cores is maintaining cache coherence in ... more Abstract A key challenge in architecting a CMP with many cores is maintaining cache coherence in an efficient manner. Directory-based protocols avoid the bandwidth overhead of snoop-based protocols, and therefore scale to a large number of cores. Unfortunately, conventional directory structures incur significant area overheads in larger CMPs. The tagless coherence directory (TL) is a scalable coherence solution that uses an implicit, conservative representation of sharing information.
Abstract Graphics processors (GPU) offer the promise of more than an order of magnitude speedup o... more Abstract Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (eg, Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU.
Abstract We propose methods for reducing the energy consumed by snoop requests in snoopy bus-base... more Abstract We propose methods for reducing the energy consumed by snoop requests in snoopy bus-based symmetric multiprocessor (SMP) systems. Observing that a large fraction of snoops do not find copies in many of the other caches, we introduce JETTY, a small, cache-like structure. A JETTY is introduced in-between the bus and the L2 backside of each processor. There it filters the vast majority of snoops that would not find a locally cached copy.
Abstract Designers have invested much effort in developing accurate branch predictors with short ... more Abstract Designers have invested much effort in developing accurate branch predictors with short learning periods. Such techniques rely on exploiting complex and relatively large structures. Although exploiting such structures is necessary to achieve high accuracy and fast learning, once the short learning phase is over, a simple structure can efficiently predict the branch outcome for the majority of branches.