Compacted CPU/GPU Data Compression via Modified Virtual Address Translation (original) (raw)

Automatic GPU Data Compression and Address Swizzling for CPUs via Modified Virtual Address Translation

2020

We describe how to modify hardware page translation to enable CPU software access to compressed and swizzled GPU data arrays as if they were decompressed and stored in row-major order. In a shared memory system, this allows CPU to directly access the GPU data without copying the data or losing the performance and bandwidth benefits of using compression and swizzling on the GPU. Our method is flexible enough to support a wide variety of existing and future swizzling and compression schemes, including block-based lossless compression that requires per-block meta-data. Providing automatic compression can improve performance, even without considering the cost of copying data. In our experiments, we observed up to 33% reduction in CPU/memory energy use and up to 35% reduction in CPU computation time. CCS CONCEPTS • Computing methodologies → Image compression; Graphics processors; • Computer systems organization → Processors and memory architectures.

Compresso: Pragmatic Main Memory Compression

2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018

Today, larger memory capacity and higher memory bandwidth are required for better performance and energy efficiency for many important client and datacenter applications. Hardware memory compression provides a promising direction to achieve this without increasing system cost. Unfortunately, current memory compression solutions face two significant challenges. First, keeping memory compressed requires additional memory accesses, sometimes on the critical path, which can cause performance overheads. Second, they require changing the operating system to take advantage of the increased capacity, and to handle incompressible data, which delays deployment. We propose Compresso, a hardware memory compression architecture that minimizes memory overheads due to compression, with no changes to the OS. We identify new data-movement trade-offs and propose optimizations that reduce additional memory movement to improve system efficiency. We propose a holistic evaluation for compressed systems. Our results show that Compresso achieves a 1.85x compression for main memory on average, with a 24% speedup over a competitive hardware compressed system for single-core systems and 27% for multi-core systems. As compared to competitive compressed systems, Compresso not only reduces performance overhead of compression, but also increases performance gain from higher memory capacity. I. I N T R O D U C T I O N Memory compression can improve performance and reduce cost for systems with high memory demands, such as those used for machine learning, graph analytics, databases, gaming, and autonomous driving. We present Compresso, the first compressed main-memory architecture that: (1) explicitly optimizes for new trade-offs between compression mechanisms and the additional data movement required for their implementation, and (2) can be used without any modifications to either applications or the operating system. Compressing data in main memory increases its effective capacity, resulting in fewer accesses to secondary storage, thereby boosting performance. Fewer I/O accesses also improve tail latency [1] and decrease the need to partition tasks across nodes just to reduce I/O accesses [2, 3]. Additionally, transferring compressed cache lines from memory requires fewer bytes, thereby reducing memory bandwidth usage. The saved bytes may be used to prefetch other data [4, 5], or may

Enabling technologies for memory compression: Metadata, mapping, and prediction

2016

Future systems dealing with big-data workloads will be severely constrained by the high performance and energy penalty imposed by data movement. This penalty can be reduced by storing datasets in DRAM or NVM main memory in compressed formats. Prior compressed memory systems have required significant changes to the operating system, thus limiting commercial viability. The first contribution of this paper is to integrate compression metadata with ECC metadata so that the compressed memory system can be implemented entirely in hardware with no OS involvement. We show that in such a system, read operations are unable to exploit the benefits of compression because the compressibility of the block is not known beforehand. To address this problem, we introduce a compressibility predictor that yields an accuracy of 97%. We also introduce a new data mapping policy that is able to maximize read/write parallelism and NVM endurance, when dealing with compressed blocks. Combined, our proposals a...

The feasibility of using compression to increase memory system performance

We investigate the feasibility of using instruction compression at some level in a multi-level memory hierarchy to increase memory system performance. Compression effectively increases the memory size and the line size reducing the miss rate at the expense of increased access latency due to decompression delays.

Operating system support for fast hardware compression of main memory contents

2000

A novel computer system hardware has been built for compressing main memory contents. This presents to the operating systems an expanded real memory larger than the physically available memory. Two to one or better compression ratio has been observed for most applications. As the compression ratio of applications dynamically changes so does the real memory size that is managed by the OS. In this paper, we describe and evaluate the operating system techniques developed for compressed memory systems that can deal with such dynamically changing memory size conditions.

cache compression for microprocessor performance

—Computer systems and micro architecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. In this work, I present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.

Hardware compressed main memory: operating system support and performance evaluation

IEEE Transactions on Computers, 2001

AbstractÐA new memory subsystem, called Memory Xpansion Technology (MXT), has been built for compressing main memory contents. MXT effectively doubles the physically available memory transparently to the CPUs, input/output devices, device drivers, and application software. An average compression ratio of two or greater has been observed for many applications. Since compressibility of memory contents varies dynamically, the size of the memory managed by the operating system is not fixed. In this paper, we describe operating system techniques that can deal with such dynamically changing memory sizes. We also demonstrate the performance impact of memory compression using the SPEC CPU2000 and SPECweb99 benchmarks. Results show that the hardware compression of memory has a negligible performance penalty compared to a standard memory for many applications. For memory starved applications and benchmarks such as SPECweb99, memory compression improves the performance significantly. Results also show that the memory contents of many applications can be compressed, usually by a factor of two to one.

Effective algorithms for cache-level compression

Proceedings of the 11th Great Lakes symposium on VLSI, 2001

Compression at the cache level has the potential to increase microprocessor performance by decreasing the cache miss rate and increasing the e ective bandwidth by transmitting compressed data. This paper presents four compression algorithms that would be suitable for use in a compressed cache architecture and shows the results of using them to compress SPEC95 benchmarks. These algorithms exhibit a 7.8% to 99.8% improvement in compression ratio over an algorithm known to be e ective for cache compression.

An analytical model for software-only main memory compression

Proceedings of the 3rd workshop on Memory performance issues in conjunction with the 31st international symposium on computer architecture - WMPI '04, 2004

Many applications with large data spaces that cannot run on a typical workstation (due to page faults) call for techniques to expand the effective memory size. One such technique is memory compression. Understanding what applications under what conditions can benefit from main memory compression is complicated due to various tradeoffs and the dynamic characteristics of applications. For instance, a large area to store compressed data increases the effective memory size considerably but also decreases the amount of memory that can hold uncompressed data. This paper presents an analytical model that states the conditions for a compressed-memory system to yield performance improvements. Parameters of the model are the compression algorithm efficiency, the amount of data being compressed, and the application memory access pattern. Such a model can be used by an operating system to compute the size of the compressed-memory level that can improve an application's performance.