Custom parallel caching schemes for hardware-accelerated image compression (original) (raw)
Related papers
Automatic On-chip Memory Minimization for Data Reuse
2007
FPGA-based computing engines have become a promising option for the implementation of computationally intensive applications due to high flexibility and parallelism. However, one of the main obstacles to overcome when trying to accelerate an application on an FPGA is the bottleneck in off-chip communication, typically to large memories. Often it is known at compile-time that the same data item is accessed many times, and as a result can be loaded once from large off-chip RAM onto scarce on-chip RAM, alleviating this bottleneck. This paper addresses how to automatically derive an address mapping that reduces the size of the required on-chip memory for a given memory access pattern. Experimental results demonstrate that, in practice, our approach reduces on-chip storage requirements to the minimum, corresponding to a reduction in on-chip memory size of up to 40times (average 10times) for some benchmarks compared to a naive approach. At the same time, no clock period penalty or increase in control logic area compared to this approach is observed for these benchmarks.
cache compression for microprocessor performance
—Computer systems and micro architecture researchers have proposed using hardware data compression units within the memory hierarchies of microprocessors in order to improve performance, energy efficiency, and functionality. However, most past work, and all work on cache compression, has made unsubstantiated assumptions about the performance, power consumption, and area overheads of the proposed compression algorithms and hardware. In this work, I present a lossless compression algorithm that has been designed for fast on-line data compression, and cache compression in particular. The algorithm has a number of novel features tailored for this application, including combining pairs of compressed lines into one cache line and allowing parallel compression of multiple words while using a single dictionary and without degradation in compression ratio. We reduced the proposed algorithm to a register transfer level hardware design, permitting performance, power consumption, and area estimation.
Transparent Dual Memory Compression Architecture
2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2017
The increasing memory requirements of big data applications have been driving the precipitous growth of memory capacity in server systems. To maximize the efficiency of external memory, HW-based memory compression techniques have been proposed to increase effective memory capacity. Although such memory compression techniques can improve the memory efficiency significantly, a critical trade-off exists in the HW-based compression techniques. As the memory blocks need to be decompressed as quickly as possible to serve cache misses, latency-optimized techniques apply compression at the cacheline granularity, achieving the decompression latency of less than a few cycles. However, such latency-optimized techniques can lose the potential high compression ratios of capacity-optimized techniques, which compress larger memory blocks with longer latency algorithms. Considering the fundamental trade-off in the memory compression, this paper proposes a transparent dual memory compression (DMC) architecture, which selectively uses two compression algorithms with distinct latency and compression characteristics. Exploiting the locality of memory accesses, the proposed architecture compresses less frequently accessed blocks with a capacity-optimized compression algorithm, while keeping recently accessed blocks compressed with a latencyoptimized one. Furthermore, instead of relying on the support from the virtual memory system to locate compressed memory blocks, the study advocates a HW-based translation between the uncompressed address space and compressed physical space. This OS-transparent approach eliminates conflicts between compression efficiency and large page support adopted to reduce TLB misses. The proposed compression architecture is applied to the Hybrid Memory Cube (HMC) with a logic layer under the stacked DRAMs. The experimental results show that the proposed compression architecture provides 54% higher compression ratio than the state-of-the-art latency-optimized technique, with no performance degradation over the baseline system without compression.
2017
Thread-level and data-level parallel architectures have become the design of choice in many of today’s energy-efficient computing systems. However, these architectures put substantially higher requirements on the memory subsystem than scalar architectures, making memory latency and bandwidth critical in their overall efficiency. Data reuse exploration aims at reducing the pressure on the memory subsystem by exploiting the temporal locality in data accesses. In this paper, we investigate the effects on performance and energy from a data reuse methodology combined with parallelization and vectorization in multiand many-core processors. As a test case, a full-search motion estimation kernel is evaluated on Intel R © CoreTM i7-4700K (Haswell) and i7-2600K (Sandy Bridge) multi-core processors, as well as on an Intel R © Xeon PhiTM many-core processor (Knights Landing) with Streaming Single Instruction Multiple Data (SIMD) Extensions (SSE) and Advanced Vector Extensions (AVX) instruction ...
A survey of different approaches for overcoming the processor-memory bottleneck
International Journal of Computer Science and Information Technology, 2017
The growing rate of technology improvements has caused dramatic advances in processor performances, causing significant speed-up of processor working frequency and increased amount of instructions which can be processed in parallel. The given development of processor's technology has brought performance improvements in computer systems, but not for all the types of applications. The reason for this resides in the well known Von-Neumann bottleneck problem which occurs during the communication between the processor and the main memory into a standard processor-centric system. This problem has been reviewed by many scientists, which proposed different approaches for improving the memory bandwidth and latency. This paper provides a brief review of these techniques and also gives a deep analysis of various memory-centric systems that implement different approaches of merging or placing the memory near to the processing elements. Within this analysis we discuss the advantages, disadvantages and the application (purpose) of several well-known memory-centric systems.
International Symposium on System Synthesis (IEEE Cat. No.01EX526), 2001
The ever increasing gap between processor and memory speeds has motivated the design of embedded systems with deeper cache hierarchies. To avoid excessive miss rates, instead of using bigger cache memories and more complex cache controllers, program transformations have been proposed to reduce the amount of capacity and conflict misses. This is achieved however by complicating the memory index arithmetic code which results in performance degradation when executing the code on programmable processors with limited address capabilities. However, when these are complemented by high-level address code transformations, the overhead introduced can be largely eliminated at compile time. In this paper, the clear benefits of the combined approach is illustrated on two real-life applications of industrial relevance, using popular programmable processor architectures and showing important gains in energy (a factor 2 less) with a relatively small penalty in execution time (8-25%) instead of factors overhead without the address optimisation stage. The results of this paper leads to a systematic Pareto optimal trade-off (supported by tools) between memory power and CPU cycles which has up to now not been feasible for the targeted systems.
Configurable parallel memory architecture for multimedia computers
Journal of Systems Architecture, 2002
This paper presents a novel parallel memory architecture for multimedia computers. Applying a configurable or programmable addressing circuitry capable of parallel memory accesses, the memory management of multimedia applications can be enhanced. Necessary computer architecture changes to virtual address representation, paging, virtual memory, address computation circuitry and data permutation are discussed. These changes allow the memory to be partitioned for different access functions. In addition, the same memory area can be accessed by multiple access patterns. Therefore, a general-purpose computing system that is capable of exploiting the repeating memory access patterns in its applications can be built. Performance of the configurable parallel memory architecture (CPMA) is analyzed in the case of a selection of algorithms from a video encoder. These motion estimation algorithms and zigzag scanning benefit from the multiple memory access functions, which is apparent from the comparisons to the traditional sequential memory accesses.
Pattern-driven prefetching for multimedia applications on embedded processors
Journal of Systems Architecture, 2006
Multimedia applications in general and video processing, such as the MPEG4 Visual stream decoders, in particular are increasingly popular and important workloads for future embedded systems. Due to the high computational requirements, the need for low power, high performance embedded processors for multimedia applications is growing very fast. This paper proposes a new data prefetch mechanism called pattern-driven prefetching. PDP inspects the sequence of data cache misses and detects recurring patterns within that sequence. The patterns that are observed are based on the notions of the inter-miss stride (memory address stride between two misses) and the inter-miss interval (number of cycles between two misses). According to the patterns being detected, PDP initiates prefetch actions to anticipate future accesses and hide memory access latencies. PDP includes a simple yet effective stop criterion to avoid cache pollution and to reduce the number of additional memory accesses. The additional hardware needed for PDP is very limited making it an effective prefetch mechanism for embedded systems. In our experimental setup, we use cyclelevel power/performance simulations of the MPEG4 Visual stream decoders from the MoMuSys reference software with various video streams. Our results show that PDP increases performance by as much as 45%, 24% and 10% for 2KB, 4KB and 8KB data caches, respectively, while the increase in external memory accesses remains under 0.6%. In conjunction with these performance increases, system-level (on-chip plus off-chip) energy reductions of 20%, 11.5% and 8% are obtained for 2KB, 4KB and 8KB data caches, respectively. In addition, we report significant speedups (up to 160%) for various other multimedia applications. Finally, we also show that PDP outperforms stream buffers.
The feasibility of using compression to increase memory system performance
We investigate the feasibility of using instruction compression at some level in a multi-level memory hierarchy to increase memory system performance. Compression effectively increases the memory size and the line size reducing the miss rate at the expense of increased access latency due to decompression delays.
Towards an Efficient Memory Architecture for Video Decoding Systems
2012 Brazilian Symposium on Computing System Engineering, 2012
Multimedia applications are known to use large amounts of memory. The video modules need also high throughput memory port for coding and decoding high resolution video sequences. The design of a multimedia Systemon-Chip (SoC) could implement embedded block RAMs but it is much more cost-effective to use a single external memory at the expense of a multichannel memory controller. This paper presents the design and implementation of an efficient memory hierarchy for a Set-Top Box (STB) SoC with a video decoder. To use efficiently the Double Data Rate (DDR) external memory it must be accessed in burst mode whenever possible. In this paper we develop an analysis and implementation of a four level memory hierarchy targeting data latency reduction and bandwidth optimization of the memory port. The case study is DDR2 SDRAM memory used as the main system video memory in a digital television set-top box implemented on a Virtex-5 FPGA. This paper presents the architecture of the system and shows that the memory hierarchy efficiently uses the DDR characteristics while serving four client processes. The proposed memory architecture can reduce data latency in 78% when compared to a direct demand-access procedure.