A Survey Of Architectural Approaches for Data Compression in Cache and Main Memory Systems (original) (raw)

cache compression for microprocessor performance

—Computer systems and micro architecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. In this work, I present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.

Effective algorithms for cache-level compression

Proceedings of the 11th Great Lakes symposium on VLSI, 2001

Compression at the cache level has the potential to increase microprocessor performance by decreasing the cache miss rate and increasing the e ective bandwidth by transmitting compressed data. This paper presents four compression algorithms that would be suitable for use in a compressed cache architecture and shows the results of using them to compress SPEC95 benchmarks. These algorithms exhibit a 7.8% to 99.8% improvement in compression ratio over an algorithm known to be e ective for cache compression.

Compresso: Pragmatic Main Memory Compression

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

Today, larger memory capacity and higher memory bandwidth are required for better performance and energy efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction to achieve this without increasing system cost. Unfortunately, current memory compression solutions face two significant challenges. First, keeping memory compressed requires additional memory accesses, sometimes on the critical path, which can cause performance overheads. Second, they require changing the operating system to take advantage of the increased capacity, and to handle incompressible data, which delays deployment. We propose Compresso, a hardware memory compression architecture that minimizes memory overheads due to compression, with no changes to the OS. We identify new data-movement trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a holistic evaluation for compressed systems. Our results show that Compresso achieves a 1.85x compression for main memory on average, with a 24% speedup over a competitive hardware compressed system for single-core systems and 27% for multi-core systems. As compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also increases performance gain from higher memory capacity. I. I N T R O D U C T I O N Memory compression can improve performance and reduce cost for systems with high memory demands, such as those used for machine learning, graph analytics, databases, gaming, and autonomous driving. We present Compresso, the first compressed main-memory architecture that: (1) explicitly optimizes for new trade-offs between compression mechanisms and the additional data movement required for their implementation, and (2) can be used without any modifications to either applications or the operating system. Compressing data in main memory increases its effective capacity, resulting in fewer accesses to secondary storage, thereby boosting performance. Fewer I/O accesses also improve tail latency [1] and decrease the need to partition tasks across nodes just to reduce I/O accesses [2, 3]. Additionally, transferring compressed cache lines from memory requires fewer bytes, thereby reducing memory bandwidth usage. The saved bytes may be used to prefetch other data [4, 5], or may

The feasibility of using compression to increase memory system performance

We investigate the feasibility of using instruction compression at some level in a multi-level memory hierarchy to increase memory system performance. Compression effectively increases the memory size and the line size reducing the miss rate at the expense of increased access latency due to decompression delays.

Compacted CPU/GPU Data Compression via Modified Virtual Address Translation

Proc. ACM Comput. Graph. Interact. Tech., 2020

We propose a method to reduce the footprint of compressed data by using modified virtual address translation to permit random access to the data. This extends our prior work on using page translation to perform automatic decompression and deswizzling upon accesses to fixed rate lossy or lossless compressed data. Our compaction method allows a virtual address space the size of the uncompressed data to be used to efficiently access variable-size blocks of compressed data. Compression and decompression take place between the first and second level caches, which allows fast access to uncompressed data in the first level cache and provides data compaction at all other levels of the memory hierarchy. This improves performance and reduces power relative to compressed but uncompacted data. An important property of our method is that compression, decompression, and reallocation are automatically managed by the new hardware without operating system intervention and without storing compression...

Transparent Dual Memory Compression Architecture

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017

The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms. Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latencyoptimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.

A Novel Approach for a High Performance Lossless Cache Compression Algorithm

2015

Speed is one of the major issues for any electronic component. Speed based microprocessor system mainly depends on speed of the microprocessor and memory access time. The off-chip memory takes more time for accessing than on-chip memory. For these reasons, microprocessor system designers find cache compression is such a technique to increase the speed of a microprocessor based system, as it increases the cache capacity and off-chip bandwidth. Previous work on cache compression has made unsubstantiated assumptions about performance, power consumption and area overheads of the proposed compression algorithm and hardware. In this work we propose a lossless compression algorithm that has been designed for high performance, fast on-line data compression and particularly for cache compression. This algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words whi...

Enabling technologies for memory compression: Metadata, mapping, and prediction

2016

Future systems dealing with big-data workloads will be severely constrained by the high performance and energy penalty imposed by data movement. This penalty can be reduced by storing datasets in DRAM or NVM main memory in compressed formats. Prior compressed memory systems have required significant changes to the operating system, thus limiting commercial viability. The first contribution of this paper is to integrate compression metadata with ECC metadata so that the compressed memory system can be implemented entirely in hardware with no OS involvement. We show that in such a system, read operations are unable to exploit the benefits of compression because the compressibility of the block is not known beforehand. To address this problem, we introduce a compressibility predictor that yields an accuracy of 97%. We also introduce a new data mapping policy that is able to maximize read/write parallelism and NVM endurance, when dealing with compressed blocks. Combined, our proposals a...

Custom parallel caching schemes for hardware-accelerated image compression

Journal of Real-Time Image Processing, 2008

In an effort to achieve lower bandwidth requirements, video compression algorithms have become increasingly complex. Consequently, the deployment of these algorithms on Field Programmable Gate Arrays (FPGAs) is becoming increasingly desirable, because of the computational parallelism on these platforms as well as the measure of flexibility afforded to designers. Typically, video data is stored in large and slow external memory arrays, but the impact of the memory access bottleneck may be reduced by buffering frequently used data in fast on-chip memories. The order of the memory accesses, resulting from many compression algorithms are dependent on the input data [18]. These data dependent memory accesses complicate the exploitation of data re-use, and subsequently reduce the extent to which an application may be accelerated. In this paper, we present a hybrid memory subsystem which is able to capture data re-use effectively in spite of data dependent memory accesses. This memory subsystem is made up of a custom parallel cache and a scratchpad memory. Further, the framework is capable of exploiting two dimensional spatial locality, which is frequently exhibited in the access patterns of image processing applications. In a case study involving the Quad-tree Structured Pulse Code Modulation (QSDPCM) application, the impact of data dependence on memory accesses is shown to be significant. In comparison with an implementation which only employs an SPM, performance improvements of up to 1.7× and 1.4× are observed through actual implementation on two modern FPGA platforms. These performance improvements are more pronounced for image sequences exhibiting greater inter-frame movements. In addition, reductions of on-chip memory resources by up to 3.2× are achievable using this framework. These results indicate that, on custom hardware platforms,