Using a Local Prefetch Strategy to Obtain Temporal Time Predictability (original) (raw)

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

Trustable Worst-Case Execution-Time (WCET) bounds are a necessary component for the construction and verification of hard real-time computer systems. Deriving such bounds for contemporary hardware/software systems is a complex task. The single-path conversion overcomes this difficulty by transforming all unpredictable branch alternatives in the code to a sequential code structure with a single execution trace. However, the simpler code structure and analysis of single-path code comes at the cost of a longer execution time. In this paper we address the problem of the execution performance of single-path code. We present a new cache orga- nization that utilizes the principle of locality of single-path code to reduce cache miss latency and cache miss rate. The proposed cache memory architecture combines cache prefetching and cache locking, so that the prefetcher capitalizes on spatial locality while the locker makes use of temporal locality. The demonstration section shows how these two techniques can complement each other.

Towards a time-predictable hierarchical memory architecture-prefetching options to be explored

Object/Component/Service-Oriented …, 2010

In this paper we explore a hierarchical memory architecture that simplifies the WCET prediction of tasks. Instead of using cache memories for speeding up code execution, we propose to use hierarchical memories that are similar to scratchpad memories. These memories are filled by explicit prefetch operations that are executed in synchrony with program execution. The instructions respectively the data that control the timing and the content to be loaded by these memory-fill operations are computed at code-generation time. The paper describes the overall system and memory architecture, and design choices for explicitely controlled timepredictable hierarchical memory architectures.

CPU cache prefetching: Timing evaluation of hardware implementations

IEEE Transactions on Computers, 1998

Prefetching into CPU caches has long been known to be effective in reducing the cache miss ratio, but known implementations of prefetching have been unsuccessful in improving CPU performance. The reasons for this are that prefetches interfere with normal cache operations by making cache address and data ports busy, the memory bus busy, the memory banks busy, and by not necessarily being complete by the time that the prefetched data is actually referenced. In this paper, we present extensive quantitative results of a detailed cycle-by-cycle trace-driven simulation of a uniprocessor memory system in which we vary most of the relevant parameters in order to determine when and if hardware prefetching is useful. We find that, in order for prefetching to actually improve performance, the address array needs to be double ported and the data array needs to either be double ported or fully buffered. It is also very helpful for the bus to be very wide (e.g., 16 bytes) for bus transactions to be split and for main memory to be interleaved. Under the best circumstances, i.e., with a significant investment in extra hardware, prefetching can significantly improve performance. For implementations without adequate hardware, prefetching often decreases performance.

Pointer cache assisted prefetching

Proceedings of the 35th …, 2002

Data prefetching effectively reduces the negative effects of long load latencies on the performance of modern processors. Hardware prefetchers employ hardware structures to predict future memory addresses based on previous patterns. Thread-based prefetchers use portions of the actual program code to determine future load addresses for prefetching. This paper proposes the use of a pointer cache, which tracks pointer transitions, to aid prefetching. The pointer cache provides, for a given pointer's effective address, the base address of the object pointed to by the pointer. We examine using the pointer cache in a wide issue superscalar processor as a value predictor and to aid prefetching when a chain of pointers is being traversed. When a load misses in the L1 cache, but hits in the pointer cache, the first two cache blocks of the pointed to object are prefetched. In addition, the load's dependencies are broken by using the pointer cache hit as a value prediction.

Data Prefetching Mechanism by Exploiting Global and Local Access Patterns

This paper presents a new hardware prefetcher based on the idea of the Global History Buffer proposed in . We extend this idea to Local History Buffers, which keep the memory access information for selective program counters. These buffers can then be queried on cache accesses to predict future memory accesses and enable data prefetching. Our trace-driven simulations show that by using approximately a 4KByte (32 Kbits) storage budget, an average performance improvement of 20% (geomean) can be obtained for SPEC benchmark suite on an ideal out-of-order processor.

Execution History Guided Instruction Prefetching.

The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by shows that this factor alone explains why most modern microprocessors do not use such I-cache hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First we present a method that does not require an extra port to I-cache.

A framework for modeling and optimization of prescient instruction prefetch

ACM SIGMETRICS Performance Evaluation Review, 2003

This paper describes a framework for modeling macroscopic program behavior and applies it to optimizing prescient instruction prefetch-a novel technique that uses helper threads to improve single-threaded application performance by performing judicious and timely instruction prefetch. A helper thread is initiated when the main thread encounters a spawn point, and prefetches instructions starting at a distant target point. The target identifies a code region tending to incur I-cache misses that the main thread is likely to execute soon, even though intervening control flow may be unpredictable. The optimization of spawn-target pair selections is formulated by modeling program behavior as a Markov chain based on profile statistics. Execution paths are considered stochastic outcomes, and aspects of program behavior are summarized via path expression mappings. Mappings for computing reaching, and posteriori probability; path length mean, and variance; and expected path footprint are presented. These are used with Tarjan's fast path algorithm to efficiently estimate the benefit of spawn-target pair selections. Using this framework we propose a spawn-target pair selection algorithm for prescient instruction prefetch. This algorithm has been implemented, and evaluated for the Itanium¢ Processor Family architecture. A limit study finds 4.8% to 17% speedups on an in-order simultaneous multithreading processor with eight contexts, over nextline and streaming I-prefetch for a set of benchmarks with high Icache miss rates.

A Table-Based Application-Specific Prefetch Engine for Object-Oriented Embedded Systems

2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, 2006

A table-based application-specific data prefetching mechanism is presented in this paper. This mechanism is proposed to improve the performance of the application specific instruction-set processors (ASIP) we develop customized to an object-oriented application. In this approach, we divide the data accesses of a class method into two conditional and unconditional parts. We supply the prefetch engine with the static information about each part to prefetch all data fields of an object required by a class method when the class method is invoked. Effective management of memory access patterns by dividing them based on the method to which they belong and storing the access information of nested loops using a simple structure are the merits of the proposed mechanism. In addition, by adding a prefetch flag to cache blocks, we eliminate a large number of prefetch related tag comparisons. The results show that the proposed mechanism reduces the cache miss ratio and prefetch related tag comparisons on average by 66% and 21% ,respectively.

The efficacy of software prefetching and locality optimizations on future memory systems

2004

Software prefetching and locality optimizations are techniques for overcoming the speed gap between processor and memory. In this paper, we provide a comprehensive summary of current software prefetching and locality optimization techniques, and evaluate the impact of memory trends on the effectiveness of these techniques for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. We