Improving Data Prefetching Efficacy in Multimedia Applications (original) (raw)

Hardware prefetching techniques for cache memories in multimedia applications

Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception, 2000

The workload of multimedia applications has a strong impact on cache memory performance, since the locality of memory references embedded in multimedia programs differs from that of traditional programs. In many cases, standard cache memory organization achieves poorer performance when used for multimedia. A widely explored approach to improve cache performance is hardware prefetching that allows the pre-loading of data in the cache before they are referenced. However, existing hardware prefetching approaches partially miss the potential performance improvement, since they are not tailored to multimedia locality. In this paper we propose novel effective approaches to hardware prefetching to be used in image processing programs for multimedia. Experimental results are reported for a suite of multimedia image processing programs including convolutions with kernels, MPEG-2 decoding, and edge chain coding.

Temporal analysis of cache prefetching strategies for multimedia applications

2001

Prefetching is a widely adopted technique for improving performance of cache memories. Performances are typically affected by the design parameters, such as cache size and associativity, but also by the type of locality embodied in the programs. In particular multimedia tools and programs handling images and video are characterized by a bi-dimensional spatial locality that could be greatly exploited by the inclusion of prefetching in the cache architecture. In this paper we compare some prefetching techniques for multimedia programs (such as MPEG compression, image processing, visual object segmentation) by performing a detailed evaluation of the memory access time. The goal is to prove that a significant speedup can be achieved by using either standard prefecthing techniques (such as OBL or adaptive prefetching) or some innovative and imageoriented prefetching methods, like the neighbor prefetching described in the paper. Performance are measured with the PRIMA trace-driven simulator.

Hardware versus hybrid data prefetching in multimedia processors: A case study

… , 2000. IPCCC'00. …, 2000

Data prefetching is a promising technique for hiding the penalties due to compulsory cache misses. In this paper, we present a case study on two types of data prefetching in the context of multimedia processing: a purely hardware-based technique and a more low-cost hybrid hardware/software technique. Moreover, we also propose a technique for increasing the so-called prefetch distance in hardware prefetching and a scheme to reduce trashing in the data cache. Our results demonstrate that the low-cost hybrid prefetching scheme slightly outperforms hardwarebased prefetching for the code segments for which both solutions have been applied, while hardware prefetching potentially allows more code to benefit from the prefetching.

Neighbor Cache Prefetching for Multimedia Image and Video Processing

IEEE Transactions on Multimedia, 2004

Cache performance is strongly influenced by the type of locality embodied in programs. In particular, multimedia programs handling images and videos are characterized by a bidimensional spatial locality, which is not adequately exploited by standard caches. In this paper we propose novel cache prefetching techniques for image data, called neighbor prefetching, able to improve exploitation of bidimensional spatial locality. A performance comparison is provided against other assessed prefetching techniques on a multimedia workload (with MPEG-2 and MPEG-4 decoding, image processing, and visual object segmentation), including a detailed evaluation of both the miss rate and the memory access time. Results prove that neighbor prefetching achieves a significant reduction in the time due to delayed memory cycles (more than 97% on MPEG-4 with respect to 75% of the second performing technique). This reduction leads to a substantial speedup on the overall memory access time (up to 140% for MPEG-4). Performance has been measured with the PRIMA trace-driven simulator, specifically devised to support cache prefetching.

Data-type dependent cache prefetching for MPEG applications

2002

Data cache prefetching is an effective technique to improve performance of cache memories, whenever the prefetching algorithm is able to correctly predict useful data to be prefetched. To this aim, adequate information on the program's data locality must be used by the prefetching algorithm. In particular, multimedia applications are characterized by a substantial amount of image and video processing, which exhibits spatial locality in both the dimensions of the 2D data structures used for images and frames. However, in multimedia programs many memory references are made also to non-image data, characterized by standard spatial locality. In this work, we explore the adoption of different prefetching techniques in dependence of the data type (i.e., image and non-image), thus making it possible to tune the prefetching algorithms to the different forms of locality, and achieving overall performance optimization. In order to prevent interference between the two different data types, a split cache with two separated caches for image and non-image data is also evaluated as an alternative to a standard unified cache. Results on a multimedia workload (MPEG-2 and MPEG-4 decoders) show that standard prefetching techniques such as One-block-lookahead and the Stride Prediction effective for standard data, while novel 2D prefetching techniques perform best on image data. In addition, at a parity of size, unified caches offer in general better performance that split caches, thank to the more flexible allocation of a unified cache space.

Pattern-driven prefetching for multimedia applications on embedded processors

Journal of Systems Architecture, 2006

Multimedia applications in general and video processing, such as the MPEG4 Visual stream decoders, in particular are increasingly popular and important workloads for future embedded systems. Due to the high computational requirements, the need for low power, high performance embedded processors for multimedia applications is growing very fast. This paper proposes a new data prefetch mechanism called pattern-driven prefetching. PDP inspects the sequence of data cache misses and detects recurring patterns within that sequence. The patterns that are observed are based on the notions of the inter-miss stride (memory address stride between two misses) and the inter-miss interval (number of cycles between two misses). According to the patterns being detected, PDP initiates prefetch actions to anticipate future accesses and hide memory access latencies. PDP includes a simple yet effective stop criterion to avoid cache pollution and to reduce the number of additional memory accesses. The additional hardware needed for PDP is very limited making it an effective prefetch mechanism for embedded systems. In our experimental setup, we use cyclelevel power/performance simulations of the MPEG4 Visual stream decoders from the MoMuSys reference software with various video streams. Our results show that PDP increases performance by as much as 45%, 24% and 10% for 2KB, 4KB and 8KB data caches, respectively, while the increase in external memory accesses remains under 0.6%. In conjunction with these performance increases, system-level (on-chip plus off-chip) energy reductions of 20%, 11.5% and 8% are obtained for 2KB, 4KB and 8KB data caches, respectively. In addition, we report significant speedups (up to 160%) for various other multimedia applications. Finally, we also show that PDP outperforms stream buffers.

Application-specific file prefetching for multimedia programs

2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532)

This paper describes the design, implementation, and evaluation of an automatic application-specific file prefetching mechanism that is designed to improve the I/O performance of multimedia programs with complicated access patterns. The key idea of the proposed approach is to convert an application into two threads: a computation thread, which is the original program containing both computation and disk I/O, and a prefetch thread, which contains all the instructions in the original program that are related to disk accesses. At run time, the prefetch thread is scheduled to run far ahead of the computation thread, so that disk blocks can be prefetched and put in the file system buffer cache before the computation thread needs them. A source-to-source translator is developed to automatically generate the prefetch and computation thread from a given application program without any user intervention. We have successfully implemented a prototype of this automatic application-specific file prefetching mechanism under Linux. The prototype is shown to provide as much as 54% overall performance improvement for real-world multimedia applications.