Systematic speed-power memory data-layout exploration for cache controlled embedded multimedia applications (original) (raw)

The ever increasing gap between processor and memory speeds has motivated the design of embedded systems with deeper cache hierarchies. To avoid excessive miss rates, instead of using bigger cache memories and more complex cache controllers, program transformations have been proposed to reduce the amount of capacity and conflict misses. This is achieved however by complicating the memory index arithmetic code which results in performance degradation when executing the code on programmable processors with limited address capabilities. However, when these are complemented by high-level address code transformations, the overhead introduced can be largely eliminated at compile time. In this paper, the clear benefits of the combined approach is illustrated on two real-life applications of industrial relevance, using popular programmable processor architectures and showing important gains in energy (a factor 2 less) with a relatively small penalty in execution time (8-25%) instead of factors overhead without the address optimisation stage. The results of this paper leads to a systematic Pareto optimal trade-off (supported by tools) between memory power and CPU cycles which has up to now not been feasible for the targeted systems.