Per Stenström - Academia.edu (original) (raw)

Papers by Per Stenström

Springer eBooks, Jan 27, 2008

A critical component in the design of secure processors is memory encryption which provides prote... more A critical component in the design of secure processors is memory encryption which provides protection for the privacy of code and data stored in off-chip memory. The overhead of the decryption operation that must precede a load requiring an off-chip memory access, decryption being on the critical path, can significantly degrade performance. Recently hardware counter-based one-time pad encryption techniques [11, 13, 9] have been proposed to reduce this overhead. For highend processors the performance impact of decryption has been successfully limited due to: presence of fairly large on-chip L1 and L2 caches that reduce off-chip accesses; and additional hardware support proposed in [13, 9] to reduce decryption latency. However, for low-to medium-end embedded processors the performance degradation is high because first they only support small (if any) on-chip L1 caches thus leading to significant off-chip accesses and second the hardware cost of decryption latency reduction solutions in [13, 9] is too high making them unattractive for embedded processors. In this paper we present a compiler-assisted strategy that uses minimal hardware support to reduce the overhead of memory encryption in low-to medium-end embedded processors. Our experiments show that the proposed technique reduces average execution time overhead of memory encryption for low-end (medium-end) embedded processor with 0 KB (32 KB) L1 cache from 60% (13.1%), with single counter, to 12.5% (2.1%) by additionally using only 8 hardware counter-registers.

Prefetching offers the potential to improve the performance of linked data structure (LDS) traver... more Prefetching offers the potential to improve the performance of linked data structure (LDS) traversals. However, previously proposed prefetching methods only work well when there is enough work processing a node that the prefetch latency can be hidden, or when the LDS is long enough and the traversal path is known a priori. This paper presents a prefetching technique called prefetch arrays which can prefetch both short LDS, as the lists found in hash tables, and trees when the traversal path is not known a priori. We offer two implementations, one software-only and one which combines software annotations with a prefetch engine in hardware. On a pointer-intensive benchmark suite, we show that our implementations reduce the memory stall time by 23% to 51% for the kernels with linked lists, while the other prefetching methods cause reductions that are substantially less. For binary-trees, our hardware method manages to cut nearly 60% of the memory stall time even when the traversal path is not known a priori. However, when the branching factor of the tree is too high, our technique does not improve performance. Another contribution of the paper is that we quantify pointer-chasing found in interesting applications such as OLTP, Expert Systems, DSS, and JAVA codes and discuss which prefetching techniques are relevant to use in each case.

Journal of Parallel and Distributed Computing, Jun 1, 1997

We consider a network of workstations (NOW) organization consisting of busbased multiprocessors i... more We consider a network of workstations (NOW) organization consisting of busbased multiprocessors interconnected by an ATM interconnect on which a sharedmemory programming model is imposed by using a multiple-writer distributed virtual shared memory system. The latencies associated with bringing data into the local memory are a severe performance limitation of such systems. To tolerate the access latencies, we propose a novel prefetch approach and show how it can be integrated into the software-based coherence layer of a multiple-writer protocol. This approach uses the access history of each page to guide which pages to prefetch. Based on detailed architectural simulations and seven scientific applications we find that our prefetch algorithm can remove a vast majority of the remote operations which improves the performance of all applications. We also find that the bandwidth provided by ATM switches available today is sufficient to accommodate prefetching. However, the protocol processing overhead of available ATM interfaces limits the gain of the prefetching algorithms.

ACM Transactions in Embedded Computing Systems, Apr 1, 2008

The determination of upper bounds on execution times, commonly called Worst-Case Execution Times ... more The determination of upper bounds on execution times, commonly called Worst-Case Execution Times (WCETs), is a necessary step in the development and validation process for hard real-time systems. This problem is hard if the underlying processor architecture has components such as caches, pipelines, branch prediction, and other speculative components. This article describes different approaches to this problem and surveys several commercially available tools and research prototypes.

Abstract We consider a network of workstations (NOW) organiza-tion consisting of a number of bus-... more Abstract We consider a network of workstations (NOW) organiza-tion consisting of a number of bus-based multiprocessor servers interconnected by an ATM switch. A shared-mem-ory model is supported by distributed virtual shared mem-ory (DVSM) and this paper focuses on the ...

Springer eBooks, Feb 1, 2007

... of Augsburg, Germany Michiel Ronsse Ghent University, Belgium Sylvie Detournay Ghent Universi... more ... of Augsburg, Germany Michiel Ronsse Ghent University, Belgium Sylvie Detournay Ghent University, Belgium Program Committee Angelos Bilas Mats Brorsson Koen De Bosschere Jack Davidson Marc Duranton Babak Falsaﬁ Paolo Faraboschi Kristian Flautner Chris Gniady ...

Performance evaluation review, Jun 1, 2000

This paper presents an analytical model to study how working sets scale with database size and ot... more This paper presents an analytical model to study how working sets scale with database size and other applications parameters in decision-support systems (DSS). The model uses application parameters, that are measured on down-scaled database executions, to predict cache miss ratios for executions of large databases. By applying the model to two database engines and typical DSS queries we find that, even for large databases, the most performance-critical working set is small and is caused by the instructions and private data that are required to access a single tuple. Consequently, its size is not affected by the database size. Surprisingly, database data may also exhibit temporal locality but the size of its working set critically depends on the structure of the query, the method of scanning, and the size and the content of the database.

PARLE'94 Parallel Architectures and Languages Europe

Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550)

Journal of Instruction-Level Parallelism

Classic cache replacement policies assume that miss costs are uniform. However, the correlation b... more Classic cache replacement policies assume that miss costs are uniform. However, the correlation between miss rate and cache performance is not as straightforward as it used to be. Ultimately, the true performance cost of a miss should be its access penalty, i.e. the actual processing bandwidth lost because of the miss. Contrary to loads, the penalty of stores is mostly hidden in modern processors. To take advantage of this observation, we propose a simple scheme to replace load misses by store misses. We extend LRU (Least Recently Used) to reduce the aggregate miss penalty instead of the miss count. The new policy is called PS-LRU (Penalty-Sensitive LRU) and is deployed throughout most of this paper. PS-LRU systematically replaces first a block predicted to be accessed with a store next. This policy minimizes the number of load misses to the detriment of store misses. One key issue in this policy is to predict the next access type to a block, so that higher replacement priority is given to blocks that will be accessed next with a store. We introduce and evaluate various prediction schemes based on instructions and broadly inspired from branch predictors. To guide the design we run extensive trace-driven simulations on eight Spec95 benchmarks with a wide range of cache configurations and observe that PS-LRU yield positive load miss improvements over classic LRU across most the benchmarks and cache configurations. In some cases the improvements are very large. Although the total number of load misses is minimized under our simple policy, the number of store misses and the amount of memory traffic both increase. Moreover store misses are not totally "free". To evaluate this trade-off, we apply DCL and ACL (two previously proposed cost-sensitive LRU policies) to the problem of load/store miss penalty. These algorithms are more competitive than PS-LRU. Both DCL and ACL provide attractive trade-offs in which less load misses are saved than in PS-LRU, but the store miss traffic is reduced.

Lecture Notes in Computer Science, 1997

Performance tuning of applications for shared-memory multiprocessors is to a great extent concern... more Performance tuning of applications for shared-memory multiprocessors is to a great extent concerned with removal of performance bottlenecks caused by communication among the processors. To simplify performance tuning, our approach has been to extend the hardware/software interface with powerful memory-control primitives in combination with compiler optimizations to remove communication bottlenecks in distributed shared-memory multiprocessors. Evaluations have shown that this combination can yield quite dramatic application performance improvements. This raises the fundamental question of how the hardware/software interface in future distributed shared-memory machines should be defined to serve as a good target for performance tuning of shared-memory programs, either automatically or by hand. An approach along those lines is discussed.

ACM SIGARCH Computer Architecture News, 1995

The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to stud... more The cost, complexity, and inflexibility of hardware-based directory protocols motivate us to study the performance implications of protocols that emulate directory management using software handlers executed on the compute processors. An important performance limitation of such software-only protocols is that software latency associated with directory management ends up on the critical memory access path for read miss transactions. We propose five strategies that support efficient data transfers in hardware whereas directory management is handled at a slower pace in the background by software handlers. Simulations show that this approach can remove the directory-management latency from the memory access path. Whereas the directory is managed in software, the hardware mechanisms must access the memory state in order to enable data transfers at a high speed. Overall, our strategies reach between 60% and 86% of the hardware-based protocol performance.

Proceedings of the 21st international conference on Parallel architectures and compilation techniques, 2012

Memory access latency is the primary performance bottleneck in modern computer systems. Prefetchi... more Memory access latency is the primary performance bottleneck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial performance gains by overlapping significant portions of memory latency with useful work. Prior work has investigated this technique and measured potential performance gains in a variety of scenarios. However, its use in speeding up Hardware Transactional Memory (HTM) has remained hitherto unexplored. In several HTM designs transactions invalidate speculatively updated cache lines when they abort. Such cache lines tend to have high locality and are likely to be accessed again when the transaction re-executes. Coarse grained transactions that update relatively large amounts of data are particularly susceptible to performance degradation even under moderate contention. However, such transactions show strong locality of reference, especially when contention is high. Prefetching cache lines with high locality can, therefore, improve overall concurrency by speeding up transactions and, thereby, narrowing the window of time in which such transactions persist and can cause contention. Such transactions are important since they are likely to form a common TM use-case. We note that traditional prefetch techniques may not be able to track such lines adequately or issue prefetches quickly enough. This paper investigates the use of prefetching in HTMs, proposing a simple design to identify and request prefetch candidates, and measures potential performance gains to be had for several representative TM workloads.

ACM SIGARCH Computer Architecture News, 1992

Two interesting variations of large-scale shared-memory machines that have recently emerged are c... more Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent non-uniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. This paper compares the performance of these two classes of machines. We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the granularity of data partitions in the application. We then present quantitative results using simulation studies for eight parallel applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses b...

2011 International Conference on Parallel Processing, 2011

Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of... more Hardware transactional memory (HTM) systems have been studied extensively along the dimensions of speculative versioning and contention management policies. The relative performance of several designs policies has been discussed at length in prior work within the framework of scalable chipmultiprocessing systems. Yet, the impact of simple structural optimizations like write-buffering has not been investigated and performance deviations due to the presence or absence of these optimizations remains unclear. This lack of insight into the effective use and impact of these interfacial structures between the processor core and the coherent memory hierarchy forms the crux of the problem we study in this paper. Through detailed modeling of various write-buffering configurations we show that they play a major role in determining the overall performance of a practical HTM system. Our study of both eager and lazy conflict resolution mechanisms in a scalable parallel architecture notes a remarkable convergence of the performance of these two diametrically opposite design points when write buffers are introduced and used well to support the common case. Mitigation of redundant actions, fewer invalidations on abort, latency-hiding and prefetch effects contribute towards reducing execution times for transactions. Shorter transaction durations also imply a lower contention probability, thereby amplifying gains even further. The insights, related to the interplay between buffering mechanisms, system policies and workload characteristics, contained in this paper clearly distinguish gains in performance to be had from write-buffering from those that can be ascribed to HTM policy. We believe that this information would facilitate sound design decisions when incorporating HTMs into parallel architectures.

Journal of Parallel and Distributed Computing, 2000

We study in this paper how effective latency-tolerating and-reducing techniques are at cutting th... more We study in this paper how effective latency-tolerating and-reducing techniques are at cutting the memory access times for shared-memory multiprocessors with directory cache protocols managed by hardware and software. A critical issue for the relative efficiency is how many protocol operations such techniques trigger. This paper presents a framework that makes it possible to reason about the expected relative efficiency of a latencytolerating or-reducing technique by focusing on whether the technique increases, decreases, or does not change the number of protocol operations at the memory module. Since software-only directory protocols handle these operations in software they will perform relatively worse unless the technique reduces the number of protocol operations. Our experimental results from detailed architectural simulations driven by six applications from the SPLASH-2 parallel program suite confirm this expectation. We find that while prefetching performs relatively worse on software-only directory protocols due to useless prefetches, there are examples of protocol optimizations, e.g., optimizations for migratory data, that do relatively better on

Journal of Parallel and Distributed Computing, 1996

Multiprocessors have hit the mainstream and cover the whole spectrum of computational needs from ... more Multiprocessors have hit the mainstream and cover the whole spectrum of computational needs from small-scale symmetric multiprocessors to scalable distributed shared-memory systems with a few hundred processors. This has made it possible to boost the performance of a number of important applications from the numeric and database domain. Extending the scope of applications that can take advantage of the performance of multiprocessors is however hindered by the fundamental limitation of static (off-line) parallelization methods to only uncovering data dependencies that do not depend on input data. We consider in this research the prospects of exploiting module-(function-, procedure-, or method-) level data dependence speculation to simplify the process of extracting inherent coarsegrain parallelism out of sequential codes. Given that codes have been developed using good programming practice, our hypothesis is that there is plenty of parallelism to uncover in such codes and a majority of the data dependences will be resolved by the speculation system.

Instruction-level simulation techniques are the predominant approach to evaluate the impact of ar... more Instruction-level simulation techniques are the predominant approach to evaluate the impact of architectural design alternatives on the performance of computer systems. Previous simulation approaches have not been capable of executing unmodified system as well as application software at an acceptable performance level. Commercial applications, such as databases, constitute a particular challenge. The SIMICS/sun4m platform has been designed to efficiently execute completely unmodified software binaries, such as databases and operating systems. Moreover, it is possible to flexibly model a variety of computer system architectures. We describe in this paper how the platform is used (i) for software development and application performance tuning for a given hardware architecture, (ii) for evaluation of performance increasing architectural modifications with focus on a particular application, and (iii) for combined performance increasing measures of both the hardware architecture and the software system.