Systematic speed-power memory data-layout exploration for cache controlled embedded multimedia applications (original) (raw)

Data and instruction memory exploration of embedded systems for multimedia applications

2001

A methodology for power optimization of the data memory hierarchy and instruction memory, is introduced. The effect of the methodology on a set of widely used multimedia application kernels, namely full search, hierarchical search, and parallel hierarchical one dimension search, is demonstrated. Three different target architecture models are used. The issues of the data memory power reduction and instruction memory are tackled separately. We find the power optimal data memory hierarchy applying the appropriate data-use transformation, while the instruction power optimization is done using suitable cache memory. Using data-reuse transformations, performance optimization techniques, and instruction-level transformations, we perform exhaustive exploration of all the possible alternatives to reach power efficient solutions. The experimental results prove the efficiency of the methodology in terms of power for all the multimedia kernels

Memory hierarchy exploration for low power architectures in embedded multimedia applications

2001

Multimedia applications are characterized by an increased number of data transfer and storage operations due to real time requirements. Appropriate transformations can be applied at the algorithmic level to improve crucial implementation characteristics. In this paper, the effect of the data-reuse transformations on power consumption, area and performance of multimedia applications realized on embedded cores is examined. As demonstrators, widely applicable video processing algorithmic kernels, namely the row-column decomposition DCT and its fast implementation found in MPEG-X, are used. Experimental results prove that significant improvements in power consumption can be achieved without performance degradation by the application of data-reuse transformations in combination with the use of a custom memory hierarchy.

Cache conscious data layout organization for embedded multimedia applications

2001

Cache misses form a major bottleneck for real-time multimedia applications due to the off-chip accesses to the main memory. This results in both a major access bandwidth overhead (and related power consumption) as well as performance penalties. In this paper, we propose a new technique for organizing data in the main memory for data dominated multimedia applications so as to reduce majority of the conflict cache misses. The focus of this paper is on the formal and heuristic algorithms we use to steer the data layout decisions and the experimental results obtained using a prototype tool. Experiments on real-life demonstrators illustrate that we are able to reduce upto ¢ ¤ £ % of the conflict misses for applications that are already aggressively transformed at the source-level. At the same time, we also reduce the off-chip data accesses by upto 78% and combined with address optimizations we are able to reduce the execution time. Thus our approach is complimentary to the more conventional way of reducing misses by reorganizing the execution order. ¡£ © ¡£ and ¡£ ¤ ¡£ , we have to let some of the locations, corresponding to the multiple of base address(es) of ¡£ © ¡£ and ¡£ ¤ ¡£

Memory Hierarchy Optimization of Multimedia Applications on Programmable Embedded Cores 1

2001

Data memory hierarchy optimization and partitioning for a widely used multimedia application kernel known as the hierarchical motion estimation algorithm is undertaken, with the use of global loop and data-reuse transformations for three different embedded processor architecture models. Exhaustive exploration of the obtained results clarifies the effect of the transformations on power, area, and performance and also indicates a relation between the complexity of the application and the power savings obtained by this strategy. Furthermore, the significant contribution of the instruction memory even after the application of performance optimizations to the total power budget becomes evident and a methodology is introduced in order to reduce this component

Effective Cache Configuration for High Performance Embedded Systems

Any embedded system contains both on-chip and off-chip memory modules with different access times. During system integration, the decision to map critical data on to faster memories is crucial. In order to obtain good performance targeting less amounts of memory, the data buffers of the application need to be placed carefully in different types of memory. There have been huge research efforts intending to improve the performance of the memory hierarchy. Recent advancements in semiconductor technology have made power consumption also a limiting factor for embedded system design. SRAM being faster than the DRAM, cache memory comprising of SRAM is configured between the CPU and the main memory. The CPU can access the main memory (DRAM) only via the cache memory. Cache memories are employed in all the computing applications along with the processors. The size of cache allowed for inclusion on a chip is limited by the large physical size and large power consumption of the SRAM cells used in cache memory. Hence, its effective configuration for small size and low power consumption is very crucial in embedded system design. We present an optimal cache configuration technique for the effective reduction of size and high performance. The proposed methodology was tested in real time hardware using FPGA. Matrix multiplication algorithm with various sizes of workloads is hence validated. For the validation of the proposed approach we have used Xilinx ISE 9.2i for simulation and synthesis purposes. The prescribed design was implemented in VHDL.

Power, Performance and Area Exploration for Data Memory Assignment of Multimedia Applications

2004

New embedded systems will feature more and more multimedia applications. In most multimedia applications, the dominant cost factor is related to organization of the memory architecture. One of the primary challenges in embedded system design is designing the memory hierarchy and restructuring the application to take advantage of it. Although in the past there has been extensive prior research on optimizing a system in terms of power or performance, this is, perhaps, the first technique that takes into consideration data reuse and limited lifetime of the arrays of a data dominated application, and performs a thorough exploration for different on-chip memory sizes, presenting not a single optimum, but a number of optimum implementations. We have developed a prototype tool that performs an automatic exploration and discovers all the performance, power consumption and on-chip memory size tradeoffs, which has been tested successfully on five applications.

Data And Instruction Memory Exploration Of Embedded Systems

A methodology for power optimization of the data memory hierarchy and instruction memory, is introduced. The effect of the methodology on a set of widely used multimedia application kernels, namely Full Search (FS), Hierarchical Search (HS), and Parellel Hierarchical One Dimension Search (PHODS), is demonstrated . Three different target architecture models are used. The issues of the data memory power reduction and instruction memory are tackled separately. We find the power optimal data memory hierarchy applying the appropriate data-use transformation, while the instruction power optimization is done using suitable cache memory. Using data-reuse transformations, performance optimizations techniques, and instruction-level transformations, we perform exhaustive exploration of all the possible alternatives to reach power efficient solutions. The experimental results prove the efficiency of the methodology in terms of power for all the multimedia kernels.

Address Bus Power Exploration in Programmable Processors for Realization of Multimedia Applications

Address bus encoding schemes are used in this paper to reduce the address bus power consumption in a general multimedia architecture executing four common motion estimation algorithms. The effect of previously applied data reuse transformations in order to reduce the power consumed on the data as well as on the instruction memories of the programmable architectures in combination with these encoding techniques is thoroughly explored and the results are extended to a multiprocessor environment.

Power efficient instruction caches for embedded systems

2005

Instruction caches typically consume 27% of the total power in modern high-end embedded systems. We propose a compiler-managed instruction store architecture (K-store) that places the computation intensive loops in a scratchpad like SRAM memory and allocates the remaining instructions to a regular instruction cache. At runtime, execution is switched dynamically between the instructions in the traditional instruction cache and the ones in the K-store, by inserting jump instructions. The necessary jump instructions add 0.038% on an average to the total dynamic instruction count. We compare the performance and energy consumption of our K-store with that of a conventional instruction cache of equal size. When used in lieu of a 8KB, 4-way associative instruction cache, K-store provides 32% reduction in energy and 7% reduction in execution time. Unlike loop caches, K-store maps the frequent code in a reserved address space and hence, it can switch between the kernel memory and the instruction cache without any noticeable performance penalty.

Design space optimization of embedded memory systems via data remapping

Sigplan Notices, 2002

In this paper, we provide a novel compile-time data remapping algorithm that runs in linear time. This remapping algorithm is the first fully automatic approach applicable to pointer-intensive dynamic applications. We show that data remapping can be used to significantly reduce the energy consumed as well as the memory size needed to meet a user-specified performance goal (i.e., execution time) -relative to the same application executing without being remapped. These twin advantages afforded by a remapped program -reduced cache size and energy needs -constitute a key step in a framework for design space exploration: for any given performance goal, remapping allows the user to reduce the primary and secondary cache size by 50%, yielding a concomitant energy savings of 57%. Additionally, viewed as a compiler optimization for a fixed processor, we show that remapping improves the energy consumed by the cache subsystem by 25%. All of the above savings are in the context of the cache subsystem in isolation. We also show that remapping yields an average 20% energy saving for an ARM-like processor and cache subsystem. All of our improvements are achieved in the context of DIS, OLDEN and SPEC2000 pointer-centric benchmarks.