Reducing set-associative cache energy via way-prediction and selective direct-mapping (original) (raw)
Related papers
2008
red data way. 6 Way-prediction is another effective approach that speculatively selects a way to access before making a normal cache access. Figures 1b and 1c illustrate the access patterns for phased and way-prediction n-way setassociative caches. Compared with the conventional implementation, the phased cache only probes one data subarray instead of n data subarrays (each way comprises a tag subarray and a data subarray). However, the sequential accesses of tag and data will increase the cache access latency. The way-prediction cache first accesses the tag and data subarrays of the predicted way. If the prediction is not correct, it then probes the rest of tag and data subarrays simultaneously. An access in a phased cache consumes more energy and has longer latency than a correctly predicted access in way-prediction cache, but consumes less energy than a mispredicted access. Hence, when the prediction accuracy is high, the way-prediction cache is more energy-efficient than the pha
Improving Cache Power Efficiency with an ASymmetric Set-ASSociative
2001
Data caches are widely used in general-purpose processors as a means to hide long memory latencies. Set-associativity in these caches helps programs avoid performance problems due to cache mapping conflicts. Current set associative caches are symmetric in the sense that each way has the same number of cache lines. Moreover, each way is searched in parallel so energy is consumed by all the ways even though at most one way will hit. With this in mind, this paper proposes an asymmetric cache structure in which the size of each way can be different. The ways of the cache are different powers of two, and allow for a "tree-structured" cache in which extra associativity can be shared. We accomplish this by having two cache blocks from the large ways align with individual cache blocks in the smaller ways. This structure achieves performance comparable to a conventional cache of similar size and equal associativity. Most notably, the asymmetric cache has the nice property that accesses hit in the smaller ways can immediately terminate accesses to larger ways so that power can be saved. For the SPEC2000 benchmarks, we found cache energy per access was reduced by as much as 23% on average. The characteristics of the asymmetric set-associative design (low power, uncompromised performance, compact layout) make them particularly attractive for low power processors.
2001
Caches are p artitioned i n t o s u b arrays for optimal timing. In a set-associative cache, if the way holding the data is known before a n a c cess, only subarrays for that way need t o b e accessed. Reduction in cache switching activities results in energy savings. In this paper, we propose to extend the branch prediction framework to enable wayfootprint prediction. The next fetch address and its way-footprint are p r edicted simultaneously for one-way instruction cache access. Because the way-footprint prediction shares some prediction hardware with the branch prediction, additional hardware c ost is small. To e n l a r ge the number of one-way cache accesses, we have made modi cations to the branch prediction. Speci cally, we have investigated t h r ee B T B a l l o cation policies. Each policy results in average 29%, 33% and 62% energy savings with normalized e x e cution time 1, 1, and 1.001 respectively.
Dynamic Associative Caches: Reducing Dynamic Energy of First Level Caches
We propose Dynamic Associative Cache (DAC) -a low complexity design to improve the energy-efficiency of the data caches with negligible performance overhead. The key idea of DAC is to perform dynamic adaptation of cache associativity -switching the cache operation between direct-mapped and setassociative regimes -during the program execution. To monitor the program needs in terms of cache associativity, the DAC design employs a subset of shadow tags: when the main cache operates in the set-associative mode, the shadow tags operate in the directmapped mode and vice versa. The difference in the hit rates between the main tags and the shadow tags is used as an indicator for the cache mode switching. We show that DAC performs most of its accesses in the direct-mapped mode resulting in significant energy savings, at the same time maintaining performance close to that of set-associative L1 D-cache.
Reducing Cache Hierarchy Energy Consumption by Predicting Forwarding and Disabling Associative Sets
Journal of Circuits, Systems and Computers, 2012
The first level data cache in modern processors has become a major consumer of energy due to its increasing size and high frequency access rate. In order to reduce this high energy consumption, we propose in this paper a straightforward filtering technique based on a highly accurate forwarding predictor. Specifically, a simple structure predicts whether a load instruction will obtain its corresponding data via forwarding from the load-store structure-thus avoiding the data cache access-or if it will be provided by the data cache. This mechanism manages to reduce the data cache energy consumption by an average of 21.5% with a negligible performance penalty of less than 0.1%. Furthermore, in this paper we focus on the cache static energy consumption too by disabling a portion of sets of the L2 associative cache. Overall, when merging both proposals, the combined LI and L2 total energy consumption is reduced by an average of 29.2% with a performance penalty of just 0.25%.
Reducing power consumption for high-associativity data caches in embedded processors
2003 Design, Automation and Test in Europe Conference and Exhibition, 2003
Modern embedded processors use data caches with higher and higher degrees of associativity in order to increase performance. A set-associative data cache consumes a significant fraction of the total power budget in such embedded processors. This paper describes a technique for reducing the D-cache power consumption and shows its impact on power and performance of an embedded processor. The technique utilizes cache line address locality to determine (rather than predict) the cache way prior to the cache access. It thus allows only the desired way to be accessed for both tags and data. The proposed mechanism is shown to reduce the average L1 data cache power consumption when running the MiBench embedded benchmark suite for 8, 16 and 32-way set-associate caches by, respectively, an average of 66%, 72% and 76%. The absolute power savings from this technique increase significantly with associativity. The design has no impact on performance and, given that it does not have mis-prediction penalties, it does not introduce any new non-deterministic behavior in program execution.
Reducing L1 caches power by exploiting software semantics
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design - ISLPED '12, 2012
To access a set-associative L1 cache in a high-performance processor, all ways of the selected set are searched and fetched in parallel using physical address bits. Such a cache is oblivious of memory references' software semantics such as stack-heap bifurcation of the memory space, and user-kernel ring levels. This constitutes a waste of energy since e.g., a user-mode instruction fetch will never hit a cache block that contains kernel code. Similarly, a stack access will not hit a cacheline that contains heap data. We propose to exploit software semantics in cache design to avoid unnecessary associative searches, thus reducing dynamic power consumption. Specifically, we utilize virtual memory region properties to optimize the data cache and ring level information to optimize the instruction cache. Our design does not impact performance, and incurs very small hardware cost. Simulations results using SPEC CPU and SPECjapps indicate that the proposed designs help to reduce cache block fetches from DL1 and IL1 by 27% and 57% respectively, resulting in average savings of 15% of DL1 power and more than 30% of IL1 power compared to an aggressively clock-gated baseline.
Reducing set-associative L1 data cache energy by early load data dependence detection (ELD3)
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014, 2014
Fast set-associative level-one data caches (L1 DCs) access all ways in parallel during load operations for reduced access latency. This is required in order to resolve data dependencies as early as possible in the pipeline, which otherwise would suffer from stall cycles. A significant amount of energy is wasted due to this fast access, since the data can only reside in one of the ways. While it is possible to reduce L1 DC energy usage by accessing the tag and data memories sequentially, hence activating only one data way on a tag match, this approach significantly increases execution time due to an increased number of stall cycles. We propose an early load data dependency detection (ELD 3 ) technique for in-order pipelines. This technique makes it possible to detect if a load instruction has a data dependency with a subsequent instruction. If there is no such dependency, then the tag and data accesses for the load are sequentially performed so that only the data way in which the data resides is accessed. If there is a dependency, then the tag and data arrays are accessed in parallel to avoid introducing additional stall cycles. For the MiBench benchmark suite, the ELD 3 technique enables about 49% of all load operations to access the L1 DC sequentially. Based on 65-nm data using commercial SRAM blocks, the proposed technique reduces L1 DC energy by 13%.
Design of low power L2 cache architecture using partial way tag information
2014 International Conference on Green Computing Communication and Electrical Engineering (ICGCCEE), 2014
Today high-performance microprocessors make use of cache write-through policy for performance improvement and achieving good tolerance to soft errors in on-chip cache. However write through policy incurs large power utilization, while accessing the cache at low level (L2 cache) during write operation. In new method, way_tagged cache was used under write-through policy, it's consumed more energy. By maintaining the wag tag of L2 cache in the L1 cache during read operation. The proposed technique enables L2 cache to work in direct mapping manner during write hit and reducing tag comparison of cache miss prediction, if cache miss is predicted there is no need to access the L2 cache. So that significant portion of energy will be reduced, without performance degradation. Simulation results are obtained both L1 and L2 cache configuration. The proposed technique achieves 70.7%energy saving in L2 cache on average with only 0.02% area overhead and no performance degradation, when compare with existing methods.