Link-Time Optimization of Address Calculation on a 64-bit Architecture (original) (raw)

1994, Sigplan Notices

https://doi.org/10.1145/178243.178248

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Performance characterization of global address space applications: a case study with NWChem

Concurrency and Computation: Practice and Experience, 2012

The use of global address space languages and one-sided communication for complex applications is gaining attention in the parallel computing community. However, lack of good evaluative methods to observe multiple levels of performance makes it difficult to isolate the cause of performance deficiencies and to understand the fundamental limitations of system and application design for future improvement. NWChem is a popular computational chemistry package, which depends on the Global Arrays/Aggregate Remote Memory Copy Interface suite for partitioned global address space functionality to deliver high-end molecular modeling capabilities. A workload characterization methodology was developed to support NWChem performance engineering on large-scale parallel platforms. The research involved both the integration of performance instrumentation and measurement in the NWChem software, as well as the analysis of one-sided communication performance in the context of NWChem workloads. Scaling studies were conducted for NWChem on Blue Gene/P and on two large-scale clusters using different generation Infiniband interconnects and x86 processors. The performance analysis and results show how subtle changes in the runtime parameters related to the communication subsystem could have significant impact on performance behavior. The tool has successfully identified several algorithmic bottlenecks, which are already being tackled by computational chemists to improve NWChem performance. as sequential [1], multi-threaded [2], transaction-based [3], shared-memory-based [2], and MPIbased [4]. Concern for both computational aspects (e.g., floating point rate and work throughput) as well as communication characteristics (e.g., bandwidth, latency, and collective operations) are highly relevant to high-end performance. However, the coevolution of HPC systems and application design forces workload benchmarks to reflect modern programming methods. A case in point is the recent interest in addressing the productivity challenge in programming current and future supercomputers through the use of global address space languages and one-sided communication. Languages such as UPC [5], Co-Array Fortran , and newer HPCS languages -X10 [7], Chapel and Fortress [9] -are examples based on the concept of extending global view programming techniques to operate efficiently on large-scale distributed memory machines.

On Heuristic Solutions to the Simple Offset Assignment Problem in Address-Code Optimization

ACM Transactions on Embedded Computer Systems, 2012

The increasing demand for more functionality in embedded systems applications nowadays requires efficient generation of compact code for embedded DSP processors. Because such processors have highly irregular datapaths, compilers targeting those processors are challenged with the automatic generation of optimized code with competent quality comparable to hand-crafted code. A major issue in code-generation is to optimize the placement of program variables in ROM relative to each other so as to reduce the overhead instructions dedicated for address computations. Modern DSP processors are typically shipped with a feature called Address Generation Unit (AGU) that provides efficient address-generation instructions for accessing program variables. Compilers targeting those processors are expected to exploit the AGU to optimize variables assignment. This paper focuses on one of the basic offset-assignment problems; the Simple Offset Assignment (SOA) problem, where the AGU has only one Address Register and no Modify Registers. The notion of Tie-Break Function, TBF, introduced by Leupers and Marwedel [1], has been used to guide the placement of variables in memory. In this paper, we introduce a more effective form of the TBF; the Effective Tie-Breaking Function, ETBF, and show that the ETBF is better at guiding the variables placement process. Underpinning ETBF is the fact that program variables are placed in memory in sequence, with each variable having only two neighbors. We applied our technique to randomly generated graphs as well as to real-world code from the OffsetStone testbench [13]). In previous work , our technique showed up to 7% reduction in overhead when applied to randomlygenerated problem instances. We report in this paper on a further experiment of our technique on real-code from the Offsetstone testbench. Despite the substantial improvement our technique has achieved when applied to random problem instances, we found that it shows slight overhead reduction when applied to real-world instances in OffsetStone, which agrees with similar existing experiments. We analyze these results and show that the ETBF defaults to TBF.

Streamlining data cache access with fast address calculation

1995

Abstract For many programs, especially integer codes, untolerated load instruction latencies account for a significant portion of total execution time. In this paper, we present the design and evaluation of a fast address generation mechanism capable of eliminating the delays caused by effective address calculation for many loads and stores. Our approach works by predicting early in the pipeline (part of) the effective address of a memory access and using this predicted address to speculatively access the data cache.

Compiler-directed physical address generation for reducing dTLB power

2004

Address translation using the Translation Lookaside Buffer (TLB) consumes as much as 16% of the chip power on some processors because of its high associativity and access frequency. While prior work has looked into optimizing this structure at the circuit and architectural levels, this paper takes a different approach of optimizing its power by reducing the number of data TLB (dTLB) lookups for data references. The main idea is to keep translations in a set of translation registers, and intelligently use them in software to directly generate the physical addresses without going through the dTLB. The software has to work within the confines of the translation registers provided by the hardware, and has to maximize the reuse of such translations to be effective. We propose strategies and code transformations for achieving this in array-based and pointer-based codes, looking to optimize data accesses. Results with a suite of Spec95 array-based and pointer-based codes show dTLB energy savings of up to 73% and 88%, respectively, compared to directly using the dTLB for all references. Despite the small increase in instructions executed with our mechanisms, the approach can in fact provide performance benefits in certain cases.

Analysis And Evaluation of Address Arithmetic Capabilities in Custom DSP Archtectures

Proceedings of the 34th Design Automation Conference

Many application-specific architectures provide indirect addressing modes with auto-increment/decrement arithmetic. Since these architectures generally do not feature an indexed addressing mode, stack-allocated variables must be accessed by allocating address registers and performing address arithmetic. Subsuming address arithmetic into auto-increment/decrement arithmetic improves both the performance and size of the generated code. Our objective in this paper is to provide a method for comprehensively analyzing the performance benefits and hardware cost due to an auto-increment/decrement feature that varies from ,l to +l, and allowing access to k address registers in an address generator. We provide this method via a parameterizable optimization algorithm that operates on a procedure-wise basis. Hence, the optimization techniques in a compiler can be used not only to generate efficient or compact code, but also to help the designer of a custom DSP architecture make decisions on address arithmetic features. We present two sets of experimental results based on selected benchmark programs: (1) the values of l and k beyond which there is little or no improvement in performance, and (2) the values of l and kwhich result in minimum code area.

Analysis and Evaluation of Address Arithmetic Capabilities in Custom DSP Architectures

1997

PLTO: A Link-Time Optimizer for the Intel IA32 Architecture

2001

This paper describes PLTO, a link-time instrumentation and optimization tool we have developed for the Intel IA-32 architecture. A number of characteristics of this architecture complicate the task of link-time optimization. These include a large number of op-codes and addressing modes, which increases the complexity of program analysis; variable-length instructions, which complicates disassembly of machine code; a paucity of available registers, which limits the extent of some optimizations; and a reliance on using memory locations for holding values and for parameter passing, which complicates program analysis and optimization. We describe how PLTO addresses these problems and the resulting performance improvements it is able to achieve.

Address Register-Oriented Optimizations

Embedded systems consisting of the application program ROM, RAM, the embedded processor core, and any custom hardware on a single wafer are becoming increasingly common in application domains such as signal processing. Given the rapid deployment of these systems, programming on such systems has shifted from assembly language to high-level languages such as C, C++, and Java. The processors used in such systems are usually targeted toward specific application domains, e.g., digital signal processing (DSP). As a result, these embedded processors include application-specific instruction sets, complex and irregular data paths, etc., thereby rendering code generation for these processors difficult. In this paper, we present new code optimization techniques for embedded fixed point DSP processors which have limited on-chip program ROM and include indirect addressing modes using post-increment and decrement operations. We present a heuristic to reduce code size by taking advantage of these ...

Address calculation for retargetable compilation and exploration of instruction-set architectures

1996

The advent of parallel executing Address Calculation Units (ACUs) in Digital Signal Processor (DSP) and Application Specific Instruction-Set Processor (ASIP) architectures has made a strong impact on an application's ability to efficiently access memories. Unfortunately, successful compiler techniques which map high-level language data constructs to the addressing units of the architecture have lagged far behind. Since access to data is often the most demanding task in DSP, this mapping can be the most crucial function of the compiler. This paper introduces a new retargetable approach and prototype tool for the analysis of array references and traversals for efficient use of ACUs. The ArrSyn utility is designed to be used either as an enhancement to an existing dedicated compiler or as an aid for architecture exploration.

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (8)

Digital Equipment Corporation. DEC OSF/1 Programmer's Guide, section 3.2.3: "Name Resolution." Digital Equipment Corporation, 1993.
Robert B. Garner, et al. The Scalable Processor Architecture (SPARC). Digest of Papers: Compcon 88, pp. 278-283, March 1988.
Gerry Kane. MIPS R2000 Risc Architecture. Prentice Hall, 1987.
Richard L. Sites, ed. Alpha Architecture Reference Manual. Digital Press, 1992.
Amitabh Srivastava and Alan Eustace. ATOM -A System for Building Customized Program Analysis Tools. Proceedings of the SIGPLAN '94 Conference on Programming Language Design and Implementation, to appear. Also available as WRL Research Report 94/2, March 1994.
Amitabh Srivastava and David W. Wall. A practical system for intermodule code optimization at link-time. Journal of Programming Languages 1(1), pp. 1-18, March 1993. Also available as WRL Research Report 92/6, December 1992. WRL Technical Notes ''TCP/IP PrintServer: Print Server Protocol.'' Brian K. Reid and Christopher A. Kent. WRL Technical Note TN-4, September 1988. ''TCP/IP PrintServer: Server Architecture and Im- plementation.'' Christopher A. Kent. WRL Technical Note TN-7, November 1988. ''Smart Code, Stupid Memory: A Fast X Server for a Dumb Color Frame Buffer.'' Joel McCormack. WRL Technical Note TN-9, September 1989. ''Why Aren't Operating Systems Getting Faster As Fast As Hardware?'' John Ousterhout. WRL Technical Note TN-11, October 1989. ''Mostly-Copying Garbage Collection Picks Up Generations and C++.''
Joel F. Bartlett. WRL Technical Note TN-12, October 1989. ''The Effect of Context Switches on Cache Perfor- mance.'' Jeffrey C. Mogul and Anita Borg. WRL Technical Note TN-16, December 1990. ''MTOOL: A Method For Detecting Memory Bot- tlenecks.'' Aaron Goldberg and John Hennessy. WRL Technical Note TN-17, December 1990. ''Predicting Program Behavior Using Real or Es- timated Profiles.''
David W. Wall. WRL Technical Note TN-18, December 1990. ''Cache Replacement with Dynamic Exclusion'' Scott McFarling. WRL Technical Note TN-22, November 1991. ''Boiling Binary Mixtures at Subatmospheric Pres- sures'' Wade R. McGillis, John S. Fitch, William R. Hamburgen, Van P. Carey. WRL Technical Note TN-23, January 1992. ''A Comparison of Acoustic and Infrared Inspection Techniques for Die Attach'' John S. Fitch. WRL Technical Note TN-24, January 1992. ''TurboChannel Versatec Adapter'' David Boggs. WRL Technical Note TN-26, January 1992. ''A Recovery Protocol For Spritely NFS'' Jeffrey C. Mogul. WRL Technical Note TN-27, April 1992. ''Electrical Evaluation Of The BIPS-0 Package'' Patrick D. Boyle. WRL Technical Note TN-29, July 1992. ''Transparent Controls for Interactive Graphics'' Joel F. Bartlett. WRL Technical Note TN-30, July 1992. ''Design Tools for BIPS-0'' Jeremy Dion & Louis Monier. WRL Technical Note TN-32, December 1992. ''Link-Time Optimization of Address Calculation on a 64-Bit Architecture'' Amitabh Srivastava and David W. Wall. WRL Technical Note TN-35, June 1993. ''Combining Branch Predictors'' Scott McFarling. WRL Technical Note TN-36, June 1993. ''Boolean Matching for Full-Custom ECL Gates'' Robert N. Mayo and Herve Touati. WRL Technical Note TN-37, June 1993.

Long Address Traces from RISC Machines: Generation and Analysis

1999

The Western Research Laboratory (WRL) is a computer systems research group that was founded by Digital Equipment Corporation in 1982. Our focus is computer science research relevant to the design and application of high performance scientific computers. We test our ideas by designing, building, and using real systems. The systems we build are research prototypes; they are not intended to become products. There is a second research laboratory located in Palo Alto, the Systems Research Center (SRC). Other Digital research groups are located in Paris (PRL) and in Cambridge, Massachusetts (CRL). Our research is directed towards mainstream high-performance computer systems. Our prototypes are intended to foreshadow the future computing environments used by many Digital customers. The long-term goal of WRL is to aid and accelerate the development of high-performance uni-and multi-processors. The research projects within WRL will address various aspects of high-performance computing. We believe that significant advances in computer systems do not come from any single technological advance. Technologies, both hardware and software, do not all advance at the same pace. System design is the art of composing systems which use each level of technology in an appropriate balance. A major advance in overall system performance will require reexamination of all aspects of the system. We do work in the design, fabrication and packaging of hardware; language processing and scaling issues in system software design; and the exploration of new applications areas that are opening up with the advent of higher performance systems. Researchers at WRL cooperate closely and move freely among the various levels of system design. This allows us to explore a wide range of tradeoffs to meet system goals. We publish the results of our work in a variety of journals, conferences, research reports, and technical notes. This document is a research report. Research reports are normally accounts of completed research and may include material from earlier technical notes. We use technical notes for rapid distribution of technical material; usually this represents research in progress. Research reports and technical notes may be ordered from us. You may mail your order to:

ADOPT: efficient hardware address generation in distributed memory architectures

1996

An address generation and optimization environment (ADOPT) for distributed memory architectures, is presented. ADOPT is oriented to minimize the area overhead introduced by the use of large numbers of customized address calculation units, needed to cope with the increasing bandwidth requirements of memory intensive real-time signal processing applications. Different high-level optimizing architectural alternatives are explored, such as algebraic optimizations and efficient data-path clustering and assignment, to minimize the space/time-multiplexed address unit cost. Furthermore, in order to significantly reduced the routing complexity, typically present in partitioned architectures, a methodology for the synthesis of a distributed architecture for hierarchical local controllers for address generation, is also proposed. The techniques presented are demonstrated on a realistic test-vehicle, showing significant savings on the overall addressing cost

WRL Research Report 98/9

Efficient Address Translation for Architectures with Multiple Page Sizes

ASPLOS, 2017

Processors and operating systems (OSes) support multiple memory page sizes. Superpages increase Translation Lookaside Buffer (TLB) hits, while small pages provide fine-grained memory protection. Ideally, TLBs should perform well for any distribution of page sizes. In reality, set-associative TLBs-used frequently for their energyefficiency compared to fully-associative TLBs-cannot (easily) support multiple page sizes concurrently. Instead, commercial systems typically implement separate set-associative TLBs for different page sizes. This means that when superpages are allocated aggressively, TLB misses may, counterintuitively, increase even if entries for small pages remain unused (and vice-versa). We invent MIX TLBs, energy-frugal set-associative structures that concurrently support all page sizes by exploiting superpage allocation patterns. MIX TLBs boost the performance (often by 10-30%) of big-memory applications on native CPUs, virtualized CPUs, and GPUs. MIX TLBs are simple and require no OS or program changes.

Special Issue on Computer Architecture and High-Performance Computing

2012

SBAC-PAD has established a reputation for high quality and has become a main event for the computer architecture and high-performance computing communities. SBAC-PAD 2009 was no exception, with an excellent technical program. The topics covered a wide variety of areas, including parallel applications and algorithms, scheduling, graphical processing units, and multi-core architectures. SBAC-PAD 2009 attracted 60 complete submissions, each of which was submitted to one of four tracks: Computer Architecture, Applications and Algorithms, Network and Distributed Systems, and System Software. The submissions came from all continents, except for Africa. The papers received at least three and typically four or five reviews, for a total of 270 reviews. Finally, 21 papers were selected for presentation in the technical program and inclusion in the conference proceedings. We are grateful to the track Vice-Chairs David Brooks, David Bader, Y. Charlie Hu, and Dilma da Silva for their timely execution of a tight schedule and for their advice in numerous matters. We are indebted to the Program Committee (PC) members and to the external reviewers for their candid reviews. The technical program would not have been possible without their efforts.

On fast address-lookup algorithms

IEEE Journal on Selected Areas in Communications, 1999

The growth of the Internet and its acceptance has sparkled keen interest in the research community in respect to many apparent scaling problems for a large infrastructure based on IP technology. A self-contained problem of considerable practical and theoretical interest is the longest-prefix lookup operation, perceived as one of the decisive bottlenecks. Several novel approaches have been proposed to speed up this operation that promise to scale forwarding technology into gigabit speeds. This paper surveys these new lookup algorithms and classifies them based on applied techniques, accompanied by a set of practical requirements that are critical to the design of highspeed routing devices. We also propose several new algorithms to provide lookup capability at gigabit speeds. In particular, we show the theoretical limitations of routing table size and show that one of our new algorithms is almost optimal, while requiring only a small number of memory accesses to perform each address lookup.

Cited by

Value locality and load value prediction

ACM SIGPLAN Notices, 1996

Since the introduction of virtual memory demand-paging and cache memories, computer systems have been exploiting spatial and temporal locality to reduce the average latency of a memory reference. In this paper, we introduce the notion of value locality, a third facet of locality that is frequently present in real-world programs, and describe how to effectively capture and exploit it in order to perform load value prediction. Temporal and spatial locality are attributes of storage locations, and describe the future likelihood of references to those locations or their close neighbors. In a similar vein, value locality describes the likelihood of the recurrence of a previously-seen value within a storage location. Modern processors already exploit value locality in a very restricted sense through the use of control speculation (i.e. branch prediction), which seeks to predict the future value of a single condition bit based on previously-seen values. Our work extends this to predict ent...

Link-Time Optimization of Address Calculation on a 64-bit Architecture (original) (raw)

Sign up for access to the world's latest research

Abstract

Related papers

References (8)

Related papers

Related topics

Cited by