Effective algorithms for cache-level compression (original) (raw)

cache compression for microprocessor performance

—Computer systems and micro architecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. In this work, I present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.

A Novel Approach for a High Performance Lossless Cache Compression Algorithm

2015

Speed is one of the major issues for any electronic component. Speed based microprocessor system mainly depends on speed of the microprocessor and memory access time. The off-chip memory takes more time for accessing than on-chip memory. For these reasons, microprocessor system designers find cache compression is such a technique to increase the speed of a microprocessor based system, as it increases the cache capacity and off-chip bandwidth. Previous work on cache compression has made unsubstantiated assumptions about performance, power consumption and area overheads of the proposed compression algorithm and hardware. In this work we propose a lossless compression algorithm that has been designed for high performance, fast on-line data compression and particularly for cache compression. This algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words whi...

Analysis of Compression Algorithms for Program Data

Insufficient available memory in an application-specific embedded system is a critical problem affecting the reliability and performance of the device. A novel solution for dealing with this issue is to compress blocks of memory that are infrequently accessed when the system runs out of memory, and to decompress the memory when it is needed again, thus freeing memory that can be reallocated. In order to determine an appropriate compression technique for this purpose, a variety of compression algorithms were studied, several of which were then implemented and evaluated based both on efficiency in speed and compression ability of actual program data

A code compression scheme for improving SoC performance

Proceedings. 2003 International Symposium on System-on-Chip (IEEE Cat. No.03EX748), 2003

Code compression is an effective technique for reducing the instruction memory requirement in an embedded system. This paper presents a code compression approach in which the boundary between compressed and uncompressed space lies between the instruction cache (ICache) and the microprocessor core. The approach achieves better compression ratios (around 0.57) than other reported implementations, and, as the ICache holds compressed instructions, its effective size is increased and the hit ratio is improved. The implementation of branch prediction as part of the decompression hardware further improves the system's performance. The work has required the resolutions of issues that arise from both memory and ICache data misalignment and from the compressed to uncompressed address mapping.

Compression-based program characterization for improving cache memory performance

IEEE Transactions on Computers, 1997

It is well known that compression and prediction are interrelated in that high compression implies good predictability, and vice versa. We use this correlation to find predictable properties of program behavior and apply them to appropriate cache management tasks. In particular, we look at two properties of program references: 1) Inter Reference Gaps: defined as the time interval between successive references to the same address by the processor, and 2) Cache Misses: references which access the next level of the memory hierarchy. Using compression, we show that these two properties are highly predictable and exploit them to improve Cache Replacement and Cache Prefetching, respectively. Using trace-driven simulations on SPEC and Dinero benchmarks, we demonstrate the performance of our predictive schemes, and compare them with other methods for doing the same. We show that, using our predictive replacement scheme, miss ratio in cache memories can be improved up to 43 percent over the well-known Least Recently Used (LRU) algorithm, which covers the gap between the LRU and the off-line optimal (MIN) miss ratios, by more than 84 percent. For cache prefetching, we show that our scheme eliminates up to 62 percent of the total misses in D-caches. An equivalent sequential prefetch scheme only removes up to 42 percent of the misses. For I-caches, our scheme performs almost the same as the sequential scheme and removes up to 78 percent of the misses.

The feasibility of using compression to increase memory system performance

We investigate the feasibility of using instruction compression at some level in a multi-level memory hierarchy to increase memory system performance. Compression effectively increases the memory size and the line size reducing the miss rate at the expense of increased access latency due to decompression delays.

Compresso: Pragmatic Main Memory Compression

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

Today, larger memory capacity and higher memory bandwidth are required for better performance and energy efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction to achieve this without increasing system cost. Unfortunately, current memory compression solutions face two significant challenges. First, keeping memory compressed requires additional memory accesses, sometimes on the critical path, which can cause performance overheads. Second, they require changing the operating system to take advantage of the increased capacity, and to handle incompressible data, which delays deployment. We propose Compresso, a hardware memory compression architecture that minimizes memory overheads due to compression, with no changes to the OS. We identify new data-movement trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a holistic evaluation for compressed systems. Our results show that Compresso achieves a 1.85x compression for main memory on average, with a 24% speedup over a competitive hardware compressed system for single-core systems and 27% for multi-core systems. As compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also increases performance gain from higher memory capacity. I. I N T R O D U C T I O N Memory compression can improve performance and reduce cost for systems with high memory demands, such as those used for machine learning, graph analytics, databases, gaming, and autonomous driving. We present Compresso, the first compressed main-memory architecture that: (1) explicitly optimizes for new trade-offs between compression mechanisms and the additional data movement required for their implementation, and (2) can be used without any modifications to either applications or the operating system. Compressing data in main memory increases its effective capacity, resulting in fewer accesses to secondary storage, thereby boosting performance. Fewer I/O accesses also improve tail latency [1] and decrease the need to partition tasks across nodes just to reduce I/O accesses [2, 3]. Additionally, transferring compressed cache lines from memory requires fewer bytes, thereby reducing memory bandwidth usage. The saved bytes may be used to prefetch other data [4, 5], or may

Transparent Dual Memory Compression Architecture

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017

The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms. Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latencyoptimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.

Hardware compressed main memory: operating system support and performance evaluation

IEEE Transactions on Computers, 2001

AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.