Link-Time Optimization of Address Calculation on a 64-bit Architecture (original) (raw)

Analysis of high-level address code transformations for programmable processors

2000

Memory intensive applications require considerable arithmetic for the computation and selection of the different memory access pointers. These memory address calculations often involve complex (non)linear arithmetic expressions which have to be calculated during program execution under tight timing constraints, thus becoming a crucial bottleneck in the overall system performance. This paper explores applicability and effectiveness of source-level optimisations (as opposed to instruction-level) for address computations in the context of multimedia. We propose and evaluate two processor-target independent source-level optimisation techniques, namely, global scope operation cost minimisation complemented with loop-invariant code hoisting, and non-linear operator strength reduction. The transformations attempt to achieve minimal code execution within loops and reduced operator strengths. The effectiveness of the transformations is demonstrated with two real-life multimedia application kernels by comparing the improvements in the number of execution cycles, before and after applying the systematic source-level optimisations, using stateof-the-art C compilers on several popular RISC platforms.

Efficient Address Translation for Architectures with Multiple Page Sizes

ASPLOS, 2017

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection. Ideally, TLBs should perform well for any distribution of page sizes. In reality, set-associative TLBs-used frequently for their energyefficiency compared to fully-associative TLBs-cannot (easily) support multiple page sizes concurrently. Instead, commercial systems typically implement separate set-associative TLBs for different page sizes. This means that when superpages are allocated aggressively, TLB misses may, counterintuitively, increase even if entries for small pages remain unused (and vice-versa). We invent MIX TLBs, energy-frugal set-associative structures that concurrently support all page sizes by exploiting superpage allocation patterns. MIX TLBs boost the performance (often by 10-30%) of big-memory applications on native CPUs, virtualized CPUs, and GPUs. MIX TLBs are simple and require no OS or program changes.

Address calculation for retargetable compilation and exploration of instruction-set architectures

1996

The advent of parallel executing Address Calculation Units (ACUs) in Digital Signal Processor (DSP) and Application Specific Instruction-Set Processor (ASIP) architectures has made a strong impact on an application's ability to efficiently access memories. Unfortunately, successful compiler techniques which map high-level language data constructs to the addressing units of the architecture have lagged far behind. Since access to data is often the most demanding task in DSP, this mapping can be the most crucial function of the compiler. This paper introduces a new retargetable approach and prototype tool for the analysis of array references and traversals for efficient use of ACUs. The ArrSyn utility is designed to be used either as an enhancement to an existing dedicated compiler or as an aid for architecture exploration.

Fast Address Translation Techniques for Distributed Shared Memory Compilers

19th IEEE International Parallel and Distributed Processing Symposium, 2005

The Distributed Shared Memory (DSM) model is designed to leverage the ease of programming of the shared memory paradigm, while enabling the highperformance by expressing locality as in the messagepassing model. Experience, however, has shown that DSM programming languages, such as UPC, may be unable to deliver the expected high level of performance. Initial investigations have shown that among the major reasons is the overhead of translating from the UPC memory model to the target architecture virtual addresses space, which can be very costly. Experimental measurements have shown this overhead increasing execution time by up to three orders of magnitude. Previous work has also shown that some of this overhead can be avoided by hand-tuning, which on the other hand can significantly decrease the UPC ease of use. In addition, such tuning can only improve the performance of local shared accesses but not remote shared accesses. Therefore, a new technique that resembles the Translation Look Aside Buffers (TLBs) is proposed here. This technique, which is called the Memory Model Translation Buffer (MMTB) has been implemented in the GCC-UPC compiler using two alternative strategies, full-table (FT) and reduced-table (RT). It will be shown that the MMTB strategies can lead to a performance boost of up to 700%, enabling ease-of-programming while performing at a similar performance to hand-tuned UPC and MPI codes.

Productivity and performance using partitioned global address space languages

… on Symbolic and …, 2007

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a consortium that boasts multiple proprietary and open source compilers. Another PGAS language, Titanium, is a dialect of Java T M designed for high performance scientific computation. In this paper we describe some of the highlights of two related projects, the Titanium project centered at U.C. Berkeley and the UPC project centered at Lawrence Berkeley National Laboratory. Both compilers use a source-to-source strategy that translates the parallel languages to C with calls to a communication layer called GASNet. The result is portable highperformance compilers that run on a large variety of shared and distributed memory multiprocessors. Both projects combine compiler, runtime, and application efforts to demonstrate some of the performance and productivity advantages to these languages.

Compiler Technology for Two Novel Computer Architectures

14th ITG/GI-Fachtagung Architektur …, 1997

Before it can achieve wide acceptance, parallel computation must be made significantly easier to program. One of the main obstacles to this goal is the current usage of memory, both abstractly, by programmers, and concretely, by computer architects.

Address Register-Oriented Optimizations

Embedded systems consisting of the application program ROM, RAM, the embedded processor core, and any custom hardware on a single wafer are becoming increasingly common in application domains such as signal processing. Given the rapid deployment of these systems, programming on such systems has shifted from assembly language to high-level languages such as C, C++, and Java. The processors used in such systems are usually targeted toward specific application domains, e.g., digital signal processing (DSP). As a result, these embedded processors include application-specific instruction sets, complex and irregular data paths, etc., thereby rendering code generation for these processors difficult. In this paper, we present new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations. We present a heuristic to reduce code size by taking advantage of these ...

Analysis And Evaluation of Address Arithmetic Capabilities in Custom DSP Archtectures

Proceedings of the 34th Design Automation Conference

Many application-specific architectures provide indirect addressing modes with auto-increment/decrement arithmetic. Since these architectures generally do not feature an indexed addressing mode, stack-allocated variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment/decrement arithmetic improves both the performance and size of the generated code. Our objective in this paper is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment/decrement feature that varies from ,l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterizable optimization algorithm that operates on a procedure-wise basis. Hence, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features. We present two sets of experimental results based on selected benchmark programs: (1) the values of l and k beyond which there is little or no improvement in performance, and (2) the values of l and kwhich result in minimum code area.