Design and optimization of large size and low overhead off-chip caches (original) (raw)
2004, IEEE Transactions on Computers
Large off-chip L3 caches can significantly improve the performance of memory-intensive applications. However, conventional L3 SRAM caches are facing two issues as those applications require increasingly large caches. First, an SRAM cache has a limited size due to the low density and high cost of SRAM and, thus, cannot hold the working sets of many memory-intensive applications. Second, since the tag checking overhead of large caches is nontrivial, the existence of L3 caches increases the cache miss penalty and may even harm the performance of some memory-intensive applications. To address these two issues, we present a new memory hierarchy design that uses cached DRAM to construct a large size and low overhead off-chip cache. The high density DRAM portion in the cached DRAM can hold large working sets, while the small SRAM portion exploits the spatial locality appearing in L2 miss streams to reduce the access latency. The L3 tag array is placed off-chip with the data array, minimizing the area overhead on the processor for L3 cache, while a small tag cache is placed on-chip, effectively removing the off-chip tag access overhead. A prediction technique accurately predicts the hit/miss status of an access to the cached DRAM, further reducing the access latency. Conducting execution-driven simulations for a 2GHz 4-way issue processor and with 11 memory-intensive programs from the SPEC 2000 benchmark, we show that a system with a cached DRAM of 64MB DRAM and 128KB on-chip SRAM cache as the off-chip cache outperforms the same system with an 8MB SRAM L3 off-chip cache by up to 78 percent measured by the total execution time. The average speedup of the system with the cached-DRAM off-chip cache is 25 percent over the system with the L3 SRAM cache.
Sign up for access to the world's latest research
checkGet notified about relevant papers
checkSave papers to use in your research
checkJoin the discussion with peers
checkTrack your impact
Related papers
2017
In computer systems, the cache memory architecture has a significant impact on both, system performance and system cost. Further, the gap between processor performance and cache memory performance is widening at the disadvantage of the overall system performance. In this paper, we explore the important aspects that impact cache memory architecture performance and cost, including: (1) An overview of present state-of-the-art cache memory architectures. (2) We examine the latest advances in cache controllers and energy management. (3) We explore important aspects of cache memory organization, including cache mapping, spatial cache and temporal cache techniques. (4) We provide an analysis of performance of state-of-the-art cache memory architecture implementations including new promising memory technologies. (5) We end by considering future research areas that may prove promising in narrowing the performance gap between cache memory performance and processor performance. Overall, improv...
IEEE Transactions on Circuits and Systems I: Regular Papers, 2017
As 2.5D/3D die stacking technology emerges, stacked dynamic random access memory (DRAM) has been proposed as a cache due to its large capacity in order to bridge the latency gap between off-chip memory and SRAM caches. The main problems in utilizing a DRAM cache are the high tag storage overhead and the high lookup latency. To address these, we propose tags-in-eDRAM (embedded DRAM) due to its higher density and lower latency. This paper presents an eTag DRAM cache architecture that is composed of a novel tag-comparison-inmemory scheme to achieve direct data access. It eliminates access latency and comparison power by pushing tag-comparison into the sense amplifier. Furthermore, we propose a Merged Tag to enhance the eTag DRAM cache by comparing last-level cache tags and DRAM cache tags in parallel. Simulation results show that the eTag DRAM cache improves energy efficiency by 15.4% and 33.9% in 4-core and 8-core workloads, respectively. Additionally, the Merged Tag achieves 32.1% and 48.7% energy efficiency improvements in 4-core and 8-core workloads, respectively.
An Overview of Hardware Based Cache Optimization Techniques
Cache Memory is a high speed semiconductor memory acts as a buffer between CPU and Main Memory. In current generation processors, the processor-memory bandwidth is the main bottleneck, because a number of processor cores sharing it through the same processor memory interface or bus. The on chip memory hierarchy is an important resource that should be managed efficiently against the raising performance gap between processor and memory. This Paper yields a comprehensive survey to improve the cache performance on the basis of miss rate, hit rate, latency, efficiency and cost.
On the design of on-chip instruction caches
Microprocessors and Microsystems, 1988
In designing VLSI microprocessors, on-chip caching provides performance advantages but severely constrains layout flexibility. Carl McCrosky and Brian yen der Buhs examine the design tradeoffs involved and outline an appropriate caching strategy Cache memories reduce memory latency and traffic in computing systems. Most existing caches are implemented as board-based systems. Advancing VLSl technology will soon permit significant caches to be integrated on chip with the processors they support In designing on-chip caches, the constraints of VLSl become significant The primary constraints are economic limitations on circuit area and off-chip communications. The paper explores the design of on-chip instruction-only caches in terms of these constraints. The primary contribution of this work is the development of a unified economic model of on-chip instruction-only cache design which integrates the points of view of the cache designer and of the floorplan architect With suitable data, this model permits the rational allocation of constrained resources to the achievement of a desired cache performance. Specific conclusions are that random line replacement is superior to LRU replacement, due to an increased flexibility in VLSI floorplan design; that variable set associativity can be an effective tool in regulating a chip's floorplan; and that sectoring permits area efficient caches while avoiding high transfer widths. Results are reported on economic functionality, from chip area and transfer width to miss ratio. These results, or the underlying analysis, can be used by microprocessor architects to make intelligent decisions regarding appropriate cache organizations and resource allocations. microprocessors cache memory VLSI
A Study of Reconfigurable Split Data Caches and Instruction Caches
2006
In this paper we show that cache memories for embedded applications can be designed to both increase performance and reduce energy consumed. We show that using separate (data) caches for indexed or stream data and scalar data items can lead to substantial improvements in terms of cache misses. The sizes of the various cache structure should be customized to meet applications' needs. We show that reconfigurable split data caches can be designed to meet wide-ranging embedded applications' performance, energy and silicon area budgets. The optimal cache organizations can lead to on average 62% and 49% reduction in the overall cache size, 37% and 21% reduction in cache access time and 47% and 52% reduction in power consumption for instruction and data cache respectively when compared to an 8k byte instruction and an 8k byte unified data cache for media benchmarks from MiBench suite.
2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014
This paper analyzes the trade-offs in architecting stacked DRAM either as part of main memory or as a hardware-managed cache. Using stacked DRAM as part of main memory increases the effective capacity, but obtaining high performance from such a system requires Operating System (OS) support to migrate data at a page-granularity. Using stacked DRAM as a hardware cache has the advantages of being transparent to the OS and perform data management at a line-granularity but suffers from reduced main memory capacity. This is because the stacked DRAM cache is not part of the memory address space. Ideally, we want the stacked DRAM to contribute towards capacity of main memory, and still maintain the hardware-based fine-granularity of a cache. We propose CAMEO, a hardware-based CAche-like MEmory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis. CAMEO retains recently accessed data lines in stacked DRAM and swaps out the victim line to offchip memory. Since CAMEO can change the physical location of a line dynamically, we propose a low overhead Line Location Table (LLT) that tracks the physical location of all data lines. We also propose an accurate Line Location Predictor (LLP) to avoid the serialization of the LLT look-up and memory access. We evaluate a system that has 4GB stacked memory and 12GB off-chip memory. Using stacked DRAM as a cache improves performance by 50%, using as part of main memory improves performance by 33%, whereas CAMEO improves performance by 78%. Our proposed design is very close to an idealized memory system that uses the 4GB stacked DRAM as a hardware-managed cache and also increases the main memory capacity by an additional 4GB.
Analysis of the Effectiveness of a Third Level Cache
With the increasing availability of an in-line chip area available for cache, most contemporary microprocessors have moved the L2 cache onto the processor chip and added an L3 cache as a way of enhancing faster access to system memory to meet the performance need of advanced processors such as Pentium and PowerPC. We want to determine if the inclusion of additional level of cache memory is a solution to the problem of slow memory access. In this paper, we did analysis of the effectiveness of an L3 cache in uniprocessor. Simulation studies with SPECint2006 and SPECfp2006 benchmarks shows that a 4MB L3 with 8-way set associativity is capable of running in a shared mode on uniprocessor with over a 96% hit rate, decreasing the average memory access time (AMAT) by 6.5% and reducing the traffic by 95% from a system without an L3. There is a dramatic reduction in bus traffic and thus an improvement in AMAT.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.