A Scalable Analytical Memory Model for CPU Performance Prediction (original) (raw)

Scalable Cross-Architecture Predictions of Memory Hierarchy Response for Scientific Applications

The gap between processor and memory speeds has been growing with each new generation of microprocessors. As a result, memory hierarchy response has become a critical factor limiting application performance. For this reason, we have been working to model the impact of memory access latency on program performance. We build upon our prior work on constructing machine-independent characterizations of application behavior [9] by improving instructionlevel modeling of the structure and scaling of data access patterns. This paper makes two contributions. First, it describes static analysis techniques that help us build accurate reference-level characterizations of memory reuse patterns in the presence of complex interactions between loop unrolling and multi-word memory blocks. Second, it describes a strategy for combining memory hierarchy response characterizations suitable for predicting behavior for fully associative caches with a probabilistic technique that enables us to predict misses for set-associative caches. We validate our approach by comparing our predictions (at loop, routine and program level) against measurements using hardware performance counters for several benchmarks on two platforms with different memory hierarchy characteristics over a large range of problem sizes.

Cross-architecture performance predictions for scientific applications using parameterized models

Sigmetrics Performance Evaluation Review, 2004

This paper describes a toolkit for semi-automatically measuring and modeling static and dynamic characteristics of applications in an architecture-neutral fashion. For predictable applications, models of dynamic characteristics have a convex and differentiable profile. Our toolkit operates on application binaries and succeeds in modeling key application characteristics that determine program performance. We use these characterizations to explore the interactions between an application and a target architecture. We apply our toolkit to SPARC binaries to develop architectureneutral models of computation and memory access patterns of the ASCI Sweep3D and the NAS SP, BT and LU benchmarks. From our models, we predict the L1, L2 and TLB cache miss counts as well as the overall execution time of these applications on an Origin 2000 system. We evaluate our predictions by comparing them against measurements collected using hardware performance counters.

A compiler tool to predict memory hierarchy performance of scientific codes

Parallel Computing, 2004

The study and understanding of memory hierarchy behavior is essential, as it is critical to current systems performance. The design of optimising environments and compilers, which allow the guidance of program transformation applications in order to improve cache performance with as little user intervention as possible, is particularly interesting. In this paper we introduce a fast analytical modelling technique that is suitable for arbitrary set-associative caches with LRU replacement policy, which overcomes weak points of other approaches found in the literature. The model was integrated in the Polaris parallelizing compiler, to allow automated analysis of loop-oriented scientific codes and to drive code optimizations. Results from detailed validations using well-known benchmarks show that the model predictions are usually very accurate and that the code optimizations proposed by the model are always, or nearly always, optimal.

L2 Cache Modeling for Scientific Applications on Chip Multi-Processors

2007 International Conference on Parallel Processing (ICPP 2007), 2007

It is critical to provide high performance for scientific applications running on Chip Multi-Processors (CMP). A CMP architecture often comprises a shared L2 cache and lower-level storages. The shared L2 cache can reduce the number of cache misses if the data are accessed in common by several threads, but it can also lead to performance degradation due to resource contention. Sometimes running threads on all cores can cause severe contention and increase the number of cache misses greatly. To investigate how the performance of a thread varies when running it concurrently with other threads on the remaining cores, we develop an analytical model to predict the number of misses on the shared L2 cache. In particular, we apply the model to thread-parallel numerical programs. We assume that all the threads compute homogeneous tasks and share a fully associative L2 cache. We use circular sequence profiling and stack processing techniques to analyze the L2 cache trace to predict the number of compulsory cache misses, capacity cache misses on shared data, and capacity cache misses on private data, respectively. Our method is able to predict the L2 cache performance for threads that have a global shared address space. For scientific applications, threads often have overlapping memory footprints. We use a cycle accurate simulator to validate the model with three scientific programs: dense matrix multiplication, blocked dense matrix multiplication, and sparse matrix-vector product. The average relative errors for the three experiments are 8.01%, 1.85%, and 2.41%, respectively.

A performance prediction framework for scientific applications

Future Generation Computer Systems, 2006

This work presents a performance modeling framework, developed by the Performance Modeling and Characterization (PMaC) Lab at the San Diego Supercomputer Center, that is faster than traditional cycle-accurate simulation, more sophisticated than performance estimation based on system peak-performance metrics, and is shown to be effective on the LINPACK benchmark and a synthetic version of an ocean modeling application (NLOM). The LINPACK benchmark is further used to investigate methods to reduce the time required to make accurate performance predictions with the framework. These methods are applied to the predictions of the synthetic NLOM application.

A Cache Model for Modern Processors

Modern processors use high-performance cache replacement policies that outperform traditional alternatives like leastrecently used (LRU). Unfortunately, current cache models use stack distances to predict LRU or its variants, and do not capture these high-performance policies. Accurate predictions of cache performance enable many optimizations in multicore systems. For example, cache partitioning uses these predictions to divide capacity among applications in order to maximize performance, guarantee quality of service, or achieve other system objectives. Without an accurate model for high-performance replacement policies, these optimizations are unavailable to modern processors. We present a new probabilistic cache model designed for high-performance replacement policies. This model uses absolute reuse distances instead of stack distances, which makes it applicable to arbitrary age-based replacement policies. We thoroughly validate our model on several high-performance policies on synthetic and real benchmarks, where its median error is less than 1%. Finally, we present two case studies showing how to use the model to improve shared and single-stream cache performance.

Modeling and predicting application performance on hardware accelerators

2011

A method is presented for modeling application performance on parallel computers in terms of the performance of microkernels from the HPC Challenge benchmarks. Specifically, the application run time is expressed as a linear combination of inverse speeds and latencies from microkernels or system characteristics. The model parameters are obtained by an automated series of least squares fits using backward elimination to ensure statistical significance. If necessary, outliers are deleted to ensure that the final fit is robust. Typically three or four terms appear in each model: at most one each for floating-point speed, memory bandwidth, interconnect bandwidth, and interconnect latency. Such models allow prediction of application performance on future computers from easier-to-make predictions of microkernel performance.

Efficient and Accurate Analytical Modeling of the Cache Behavior of Complete Scientific Codes

2003

Data caches are a key hardware means to bridge the gap between processor and memory speeds, but only for programs that exhibit sufficient data locality in their memory accesses. Thus, a method for evaluating cache performance is required to both determine quantitatively cache misses and to guide data cache optimizations. Existing analytical models for data cache optimizations target mainly isolated perfect loop nests. We present an analytical model that is capable of statically analyzing not only loop nest fragments, but also complete numerical programs with regular and compile-time predictable memory accesses. Central to the wholeprogram approach are abstract call inlining, memory access vectors, and parametric reuse analysis, which allow the reuse and interference both within and across loop nests to be quantified precisely in a unified framework. Based on the framework, the cache misses of a program are specified using mathematical formulas and the miss ratio is predicted from these formulas based on statistical sampling techniques. Our experimental results using kernels and whole programs indicate accurate cache miss estimates in a substantially shorter amount of time (typically, several orders of magnitude faster) than simulation.

WARPP: A Toolkit for Simulating High-Performance Parallel Scientific Codes

2009

There are a number of challenges facing the High Performance Computing (HPC) community, including increasing levels of concurrency (threads, cores, nodes), deeper and more complex memory hierarchies (register, cache, disk, network), mixed hardware sets (CPUs and GPUs) and increasing scale (tens or hundreds of thousands of processing elements). Assessing the performance of complex scientific applications on specialised high-performance computing architectures is difficult. In many cases, traditional computer benchmarking is insufficient as it typically requires access to physical machines of equivalent (or similar) specification and rarely relates to the potential capability of an application. A technique known as application performance modelling addresses many of these additional requirements. Modelling allows future architectures and/or applications to be explored in a mathematical or simulated setting, thus enabling hypothetical questions relating to the configuration of a potential future architecture to be assessed in terms of its impact on key scientific codes.

Modeling of L2 cache behavior for thread-parallel scientific programs on Chip Multi-Processors

2006

* This material is based upon work supported by the National Science Foundation under grant No. 0444363. have the nature of memory sharing. The model has been validated by three typical scientific programs: matrix multiplication, blocked matrix multiplication, and sparse matrix-vector product on a variety of matrix sizes. The average relative error lies between 2% and 12%.

A Scalable Analytical Memory Model for CPU Performance Prediction (original) (raw)

Related papers