Instruction combining for coalescing memory accesses using global code motion (original) (raw)

Optimizing the memory bandwidth with loop morphing

Proceedings. 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2004., 2004

The memory bandwidth largely determines the performance of embedded systems. However, very often compilers ignore the actual behavior of the memory architecture, causing large performance loss. To better utilize the memory bandwidth, several researchers have introduced instruction scheduling/data assignment techniques. Because they only optimize the bandwidth inside each basic block, they often fail to use all available bandwidth. Loop fusion is an interesting alternative to more globally optimize the memory access schedule. By fusing loops we increase the number of independent memory operations inside each basic block. The compiler can then better exploit the available bandwidth and increase the system's performance. However, existing fusion techniques can only combine loops with a conformable header. To overcome this limitation we present loop morphing: we combine fusion with strip mining and loop splitting. We also introduce a technique to steer loop morphing such that we find a compact memory access schedule. Experimental results show that with our approach we can decrease the execution time up to 38%.

A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

The Journal of Supercomputing, 2019

The advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimizations-transformations, that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g. data reuse); therefore compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be Execution Time (ET), Energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases. Firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general purpose CPUs and for seven well known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand written optimized code and Polly. The exploration space from which the near-optimum parameters are selected, is reduced from 17 up to 30 orders of magnitude. Keywords code optimizations • data cache • register blocking • loop tiling • high performance • energy consumption • data reuse Address(es) of author(s) should be given /* Execute MMM */ cnt_2=0; cnt=0; for (kk=0;kk!=N;kk+=KK){ //Tiling for L2 for (ii=0;ii!=N;ii+=II){ cnt_1=cnt_2;//Tiling for L2 for (jj=0;jj!=N;jj+=JJ){ //Tiling for L1 for (i=ii;i!=ii+II;i++){ b=cnt_1; for (j=jj;j!=jj+JJ;j+=4

Dynamic coalescing for 16-bit instructions

ACM Transactions on Embedded Computing Systems, 2005

In the embedded domain, memory usage and energy consumption are critical constraints. Embedded processors such as the ARM and MIPS provide a 16-bit instruction set, (called Thumb in the case of the ARM family of processors), in addition to the 32-bit instruction set to address these concerns. Using 16-bit instructions one can achieve code size reduction and instruction cache energy savings at the cost of performance. This paper presents a novel approach that enhances the performance of 16-bit Thumb code. We have observed that throughout Thumb code there exist Thumb instruction pairs that are equivalent to a single ARM instruction. We have developed enhancements to the processor microarchitecture and the Thumb instruction set to exploit this property. We enhance the Thumb instruction set by incorporating Augmenting eXtensions (AX). A Thumb instruction pair that can be combined into a single ARM instruction is replaced by an AXThumb instruction pair by the compiler. The AX instruction is coalesced with the immediately following Thumb instruction to generate a single ARM instruction at decode time. The enhanced microarchitecture ensures that coalescing does not introduce pipeline delays or increase cycle time thereby resulting in reduction of both instruction counts and cycle counts. Using AX instructions and coalescing hardware we are also able to support efficient predicated execution in 16-bit mode.

Compiler Optimizations for High Performance Architectures

We describe two ongoing compiler projects for high performance architectures at the University of Maryland being developed us- ing the Stanford SUIF compiler infrastructure. First, we are in- vestigating the impact of compilation techniques for eliminat- ing synchronization overhead in compiler-parallelized programs running on software distributed-shared-memory (DSM) systems. Second, we are evaluating data layout transformations to im- prove cache performance on uniprocessors by eliminating conflict misses through inter- and intra-variable padding. Our optimiza- tions have been implemented in SUIF and tested on a number of programs. Preliminary results are encouraging.

Optimizing the instruction cache performance of the operating system

IEEE Transactions on Computers, 1998

High instruction cache hit rates are key to high performance. One known technique to improve the hit rate of caches is to minimize cache interference by improving the layout of the basic blocks of the code. However, the performance impact of this technique has been reported for application code only, even though there is evidence that the operating system often uses the cache heavily and with less uniform patterns than applications. It is unknown how well existing optimizations perform for systems code and whether better optimizations can be found. We address this problem in this paper. This paper characterizes, in detail, the locality patterns of the operating system code and shows that there is substantial locality. Unfortunately, caches are not able to extract much of it: Rarely-executed special-case code disrupts spatial locality, loops with few iterations that call routines make loop locality hard to exploit, and plenty of loop-less code hampers temporal locality. Based on our observations, we propose an algorithm to expose these localities and reduce interference in the cache. For a range of cache sizes, associativities, lines sizes, and organizations, we show that we reduce total instruction miss rates by 31-86 percent, or up to 2.9 absolute points. Using a simple model, this corresponds to execution time reductions of the order of 10-25 percent. In addition, our optimized operating system combines well with optimized and unoptimized applications.

Improving Program Efficiency by Packing Instructions into Registers

ACM SIGARCH Computer Architecture News, 2005

New processors, both embedded and general purpose, often have conflicting design requirements involving space, power, and performance. Architectural features and compiler optimizations often target one or more design goals at the expense of the others. This paper presents a novel architectural and compiler approach to simultaneously reduce power requirements, decrease code size, and improve performance by integrating an instruction register file (IRF) into the architecture. Frequently occurring instructions are placed in the IRF. Multiple entries in the IRF can be referenced by a single packed instruction in ROM or L1 instruction cache. Unlike conventional code compression, our approach allows the frequent instructions to be referenced in arbitrary combinations. The experimental results show significant improvements in space and power, as well as some improvement in execution time when using only 32 entries. These advantages make packing instructions into registers an effective approach for improving overall efficiency.

Compiler optimizations for I/O-intensive computations

Proceedings of the 1999 International Conference on Parallel Processing, 1999

This paper describes transformation techniques for out-of-core programs (i.e., those that deal with very large quantities of data) based on exploiting locality using a combination of loop and data transformations. Writing efficient out-of-core program is an arduous task. As a result, compiler optimizations directed at improving I/O performance are becoming increasingly important. We describe how a compiler can improve the performance of the code by determining appropriate file layouts for out-of-core arrays and finding suitable loop transformations. In addition to optimizing a single loop nest, our solution can handle a sequence of loop nests. We also show how to generate code when the file layouts are optimized. Experimental results obtained on an Intel Paragon distributed-memory message-passing multiprocessor demonstrate marked improvements in performance due to the optimizations described in this paper.

Automatic Loop Tiling for Direct Memory Access

2011 IEEE International Parallel & Distributed Processing Symposium, 2011

In heterogeneous multi-core systems, such as the Cell BE processor, each accelerator core has its own fast local memory without hardware supported coherence and the software is responsible to dynamically transfer data between the fast local and slow global memory. The data can be transferred through either a software controlled cache or a direct buffer. The software controlled cache maintains correctness for arbitrary access patterns, but introduces the extra overhead of cache lookup. Direct buffer is efficient for regular accesses, while requiring precise analysis, detailed modeling of execution, and significant code generation. In this paper we present the design and implementation of DMATiler which combines compiler analysis and runtime management to optimize local memory performance via automatic loop tiling and buffer optimization techniques.

Compiler Analysis and Optimizations: What is New?

2003

Traditional compiler analyses and back-end optimizations, which play an important role in generating efficient code for modern high-performance processors, are quite mature, well understood, and have been widely used in production compilers. However, recent advances in high-performance (general purpose) processor architecture, emergence of novel architectural paradigms, emphasis on application-specific processors and embedded systems, and the increasing trend on compiling applications directly onto silicon present several interesting challenges and opportunities in high performance compilation techniques. In this paper we discuss the trends that are emerging to meet the above challenges. In particular, we discuss recent advances in data flow analyses, compiling techniques for embedded and DSP processors, and compiling techniques that reduce power consumption.

Compiler Optimizations for Improving Data Locality

Sigplan Notices, 1994

In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality based on a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs.