The feasibility of using compression to increase memory system performance (original) (raw)

Improving system performance with compressed memory

Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001, 2001

This paper summarizes our research on implementing a compressed memory in computer systems. The basic premise is that the throughput for applications whose working set size does not fit in main memory degrades significantly due to an increase in the number of page faults. Hence we propose compressing memory pages that need to be paged out and storing them in memory. This hides the large latencies associated with a disk access, since the page has to be merely uncompressed when a page fault occurs. Our implementation is in the form of a device driver for Linux. We show results with some applications from the SPEC 2000 CPU benchmark suite and a computing kernel. It is seen that speed-ups ranging from 5 % to 250 % can be obtained, depending on the paging behavior of the application.

cache compression for microprocessor performance

—Computer systems and micro architecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. In this work, I present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.

Performance of Hardware Compressed Main Memory

2000

A novel memory subsystem called Memory Expansion Technology (MXT) has been built for compressing main memory contents. This allows effectively a memory expansion that presents a "real" memory larger than the physically available memory. This paper provides an overview of the architecture and OS support and in-depth analysis of the performance impact of memory compression using the SPEC2000 benchmarks. Our

Hardware compressed main memory: operating system support and performance evaluation

IEEE Transactions on Computers, 2001

AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.

Compresso: Pragmatic Main Memory Compression

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

Today, larger memory capacity and higher memory bandwidth are required for better performance and energy efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction to achieve this without increasing system cost. Unfortunately, current memory compression solutions face two significant challenges. First, keeping memory compressed requires additional memory accesses, sometimes on the critical path, which can cause performance overheads. Second, they require changing the operating system to take advantage of the increased capacity, and to handle incompressible data, which delays deployment. We propose Compresso, a hardware memory compression architecture that minimizes memory overheads due to compression, with no changes to the OS. We identify new data-movement trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a holistic evaluation for compressed systems. Our results show that Compresso achieves a 1.85x compression for main memory on average, with a 24% speedup over a competitive hardware compressed system for single-core systems and 27% for multi-core systems. As compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also increases performance gain from higher memory capacity. I. I N T R O D U C T I O N Memory compression can improve performance and reduce cost for systems with high memory demands, such as those used for machine learning, graph analytics, databases, gaming, and autonomous driving. We present Compresso, the first compressed main-memory architecture that: (1) explicitly optimizes for new trade-offs between compression mechanisms and the additional data movement required for their implementation, and (2) can be used without any modifications to either applications or the operating system. Compressing data in main memory increases its effective capacity, resulting in fewer accesses to secondary storage, thereby boosting performance. Fewer I/O accesses also improve tail latency [1] and decrease the need to partition tasks across nodes just to reduce I/O accesses [2, 3]. Additionally, transferring compressed cache lines from memory requires fewer bytes, thereby reducing memory bandwidth usage. The saved bytes may be used to prefetch other data [4, 5], or may

Linearly compressed pages

Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture - MICRO-46, 2013

Data compression is a promising approach for meeting the increasing memory capacity demands expected in future systems. Unfortunately, existing compression algorithms do not translate well when directly applied to main memory because they require the memory controller to perform non-trivial computation to locate a cache line within a compressed memory page, thereby increasing access latency and degrading system performance. Prior proposals for addressing this performance degradation problem are either costly or energy inefficient.

Compacted CPU/GPU Data Compression via Modified Virtual Address Translation

Proc. ACM Comput. Graph. Interact. Tech., 2020

We propose a method to reduce the footprint of compressed data by using modified virtual address translation to permit random access to the data. This extends our prior work on using page translation to perform automatic decompression and deswizzling upon accesses to fixed rate lossy or lossless compressed data. Our compaction method allows a virtual address space the size of the uncompressed data to be used to efficiently access variable-size blocks of compressed data. Compression and decompression take place between the first and second level caches, which allows fast access to uncompressed data in the first level cache and provides data compaction at all other levels of the memory hierarchy. This improves performance and reduces power relative to compressed but uncompacted data. An important property of our method is that compression, decompression, and reallocation are automatically managed by the new hardware without operating system intervention and without storing compression...

Memory link compression to speedup scientific workloads

Limited off-chip memory bandwidth poses a significant challenge for today's multicore systems. Link compression provides an effective solution, however its efficacy varies due to the diversity of transferred data. This work focuses on scientific applications that typically transfer huge amounts of floating-point data, which are notoriously difficult to compress. We explore the potential of BFPC, a recently prosposed software compression algorithm, but by modeling it for hardware compression and comparing it to three state-of-the-art schemes. We find that it can reduce floating-point data traffic by up to 40%, which for memory bound executions translates to significant performance improvement.

Reducing Memory Traffic Via Redundant Store Instructions

1999

Some memory writes have the particular behaviour of not modifying memory since the value they write is equal to the value before the write. These kind of stores are what we call Redundant Stores. In this paper we study the behaviour of these particular stores and show that a significant saving on memory traffic between the first and second level caches can be avoided by exploiting this feature. We show that with no additional hardware (just a simple comparator) and without increasing the cache lalency, we can achieve on average a 10% of memory traffic reduction.