Stefanos Kaxiras - Profile on Academia.edu (original) (raw)

Papers by Stefanos Kaxiras

The GLOW Cache Coherence Extensions for Widely Shared Data

Proceedings Fifth International Symposium on High-Performance Computer Architecture, 1999

We propose Instruction-based Prediction as a means to optimize directory-based cache coherent NUM... more We propose Instruction-based Prediction as a means to optimize directory-based cache coherent NUMA shared-memory. Instruction-based prediction is based on observing the behavior of load and store instructions in relation to coherent events and predicting their future behavior. Although this technique is well established in the uniprocessor world, it has not been widely applied for optimizing transparent shared-memory. Typically, in this environment, prediction is based on datablock access history (address-based prediction) in the form of adaptive cache coherence protocols. The advantage of instruction-based prediction is that it requires few hardware resources in the form of small prediction structures per node to match (or exceed) the performance of address-based prediction. To show the potential of instruction-based prediction we propose and evaluate three different optimizations: i) a migratory sharing optimization, ii) a wide sharing optimization, and iii) a producer-consumer optimization based on speculative execution. With execution-driven simulation and a set of nine benchmarks we show that i) for the first two optimizations, instruction-based prediction, using few predictor entries per node, outpaces address-based schemes, and (ii) for the producer-consumer optimization which uses speculative execution, low mis-speculation rates show promise for performance improvements.

Interleaving memory in distributed vector architecture multiprocessor system

Distributed vector architecture

The cache coherence scheme of the Scalable Coherent Interface (SCI) offers performance for some o... more The cache coherence scheme of the Scalable Coherent Interface (SCI) offers performance for some operations degrading linearly with degree of sharing. For large scale sharing, we need schemes that offer logarithmic time reading and writing of shared data for acceptable performance. Therefore, we need to move from SCI's sharing lists to sharing trees. The currently proposed Kiloprocessor Extensions to SCI define tree protocols that create, maintain and invalidate binary trees without taking into account the underlying topology. However these protocols are quite complex. We propose a different approach to Kiloprocessor Extensions to SCI. We define k-ary trees that are well mapped on the topology of a system. In this way the k-ary sharing trees offer great geographical locality (neighbors in the tree are also physical neighbors). The resulting protocols are simple and in some cases their performance for reading and writing shared data is superior to the previous protocols. We present our protocols on two example topologies. The first is the well known k-ary n-cube. We introduce the second topology which is a type of Omega topology, constructed of rings. In order to implement our new schemes we decouple data from directory information. In this way we can cache directory information on various nodes without the corresponding data. Regarding the distributed directory information, we define it in terms of a cache tag with multiple pointers. The actual implementation of such a tag can be in a form of a linked list (or other structure) so we can have dynamic pointer allocation. Set associative caches can be used for fast dynamic allocation and deallocation of pointers. Investigation of the new approach resulted in several variants of the protocols. We present most of the variants here, without comparing them in depth, in terms of performance or cost.

DataScalar Architectures

ABSTRACT

Proceedings of the 10th international conference on Supercomputing - ICS '96, 1996

Programs that make extensive use of widely shared variables are expected to achieve modest speedu... more Programs that make extensive use of widely shared variables are expected to achieve modest speedups for non-bus-based cache coherence protocoLr, pam'cularly as the number ofprocessors sharing the data grows large. Protocols such as the IEEE Scalable Coherent Interjhce (SCI) are optimized for data that is not widely shared; the GLOW protocol extensions are specljically &signed to address this limitation. The GLO w extensions take advantage of physical locality by mapping K-a~logical sharing trees to the network topology. This results in protocol messages that travel shorter distances, expen"encing lower latency and consuming less bandwidth. To build the sharing trees, GLO w caches directory information at strategic points in the network, allowing concurrency, and therefore, scalability, of read requests. Scalability in writes is achieved by exploiting the shan-ng tree to invaliahte or update nodes in the sharing tree concurrently. We have dey%ted the GLO w extensions with respect to SCI and we have implemented them in the Wisconsin Wind Tunnel (WWT) parallel discrete event simulator We studied them on an example topology, the K-ary Ncube, and explored their scalability with four programs for large systems (up to 256processors).

Proceedings of the 12th international conference on Supercomputing, 1998

In this paper we argue that widely shared data are a more serious problem than previously recogni... more In this paper we argue that widely shared data are a more serious problem than previously recognized, and that furthermore, it is possible to provide transparent support that actually gives an advantage to accesses to widely shared data by exploiting their redundancy to improve accessibility. The GLOW extensions to cache coherence pmtocofs-previously proposed-provide such support for widely shared data by defining functionality in the network domain. However in their static form the GLOW extensions relied on the user to identify and expose widely shared data to the hardware. This approach suffers because: i) it requires modification of the programs, ii) it is not always possible to statically idenhfi the widely shared data, and iii) it is incompatible with cornmod@ hardware. To address these issues, we study three dynamic schemes to discover widely shared data at runtime. The first scheme is inspired by read-combining and is based on observing requests in the network switches-the GLOW agents. The agents intercept requests whose addresses have been observed recently. This scheme tracks closely the pegormance of the static GLOW while it always outpelfomrs ordinary congestion-based readcombining. In the second scheme, the memory directory discovers widely shared data by counting the number of reaa!s between writes. Information about the widely shared nature of data is distributed to the nodes which subsequently use special wide sharing requests to access them. Simulations confrm that this scheme works well when the widely shared nature of the data is persistent over time. The third and most significant scheme is based on predicting which load instructions are going to access widely shared data. Although the implementation of this scheme is not as straighrforwani in a commodity-parts environment, it outperforms all others. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made. or distributed for profit or commercial advantage and that copies bear this notice and the fidl citation on the fvst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS 98 Melbourne Australia Copyright ACM 1998 O-89791-998498/ 7...%5.00 that simultaneous requests can be merged. The probability of occurrence of simultaneous requests only becomes a factor when serious network contention extends the latency of individual requests, and in general, the best that combining can hope to achieve is a reduction in latency of access to widely shared data to the latency that would be experienced in an unloaded network.

ACM SIGARCH Computer Architecture News, 1997

DataScalar architectures improve memory system performance by running computation redundantly acr... more DataScalar architectures improve memory system performance by running computation redundantly across multiple processors, which are each tightly coupled with an associated memory. The program data set (and/or text) is distributed across these memories. In this execution model, each processor broadcasts operands it loads from its local memory to all other units. In this paper, we describe the benefits, costs, and problems associated with the DataScalar model. We also present simulation results of one possible implementation of a DataScalar system. In our simulated implementation, six unmodified SPEC95 binaries ran from 7% slower to 50% faster on two nodes, and from 9% to 100% faster on four nodes, than on a system with a comparable, more traditional memory system. Our intuition and results show that DataScalar architectures work best with codes for which traditional parallelization techniques fail. We conclude with a discussion of how DataScalar systems may accommodate traditional pa...

The integration of memory on the same die as the processor (IRAM) has the potential to offer unpr... more The integration of memory on the same die as the processor (IRAM) has the potential to offer unprece- dented bandwidth that can be exploited efficiently by vector processors. However, real-world scientific vector applications with their very large memory requirements and their poor locality, would easily overflow any single IRAM device. In this environment, traditional approaches such as caching or paging generate consid- erable traffic, diminishing the performance advantage of processor-memory integration. To exploit the full poten- tial of IRAM in the realm of large-scale scientific com- puting, we propose a DIstributed Vector Architecture (DIVA), that uses multiple vector-capable IRAM nodes in a distributed shared-memory configuration. The advantages of our approach are twofold: (i) we speed up the execution of the vector instructions by parallelizing them across the nodes, (ii) we reduce external traffic, by bringing computation to data rather than data to com- putation. We dyna...

Non-Speculative Load-Load Reordering in TSO

ACM SIGARCH Computer Architecture News, 2017

In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve pe... more In Total Store Order memory consistency (TSO), loads can be speculatively reordered to improve performance. If a load-load reordering is seen by other cores, speculative loads must be squashed and re-executed. In architectures with an unordered interconnection network and directory coherence, this has been the established view for decades. We show, for the first time, that it is not necessary to squash and re-execute speculatively reordered loads in TSO when their reordering is seen. Instead, the reordering can be hidden form other cores by the coherence protocol. The implication is that we can irrevocably bind speculative loads. This allows us to commit reordered loads out-of-order without having to wait (for the loads to become non-speculative) or without having to checkpoint committed state (and rollback if needed), just to ensure correctness in the rare case of some core seeing the reordering. We show that by exposing a reordering to the coherence layer and by appropriately modi...

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

A coherent global address space in a distributed system enables shared memory programming in a mu... more A coherent global address space in a distributed system enables shared memory programming in a much larger scale than a single multicore or a single SMP. Without dedicated hardware support at this scale, the solution is a software distributed shared memory (DSM) system. However, traditional approaches to coherence (centralized via "active" home-node directories) and critical-section execution (distributed across nodes and cores) are inherently unfit for such a scenario. Instead, it is crucial to make decisions locally and avoid the long latencies imposed by both network and software message handlers. Likewise, synchronization is fast if it rarely involves communication with distant nodes (or even other sockets). To minimize the amount of long-latency communication required in both coherence and critical section execution, we propose a DSM system with a novel coherence protocol, and a novel hierarchical queue delegation locking approach. More specifically, we propose an approach, suitable for data-race-free (DRF) programs, based on selfinvalidation, self-downgrade, and passive data classification directories that require no message handlers, thereby incurring no extra latency. For fast synchronization we extend Queue Delegation Locking to execute critical sections in large batches on a single core before passing execution along to other cores, sockets, or nodes, in that hierarchical order. The result is a software DSM system called Argo which localizes as many decisions as possible and allows high parallel performance with little overhead on synchronization when compared to prior DSM implementations.

Preventing Denial-of-Service Attacks in Shared CMP Caches

Lecture Notes in Computer Science, 2006

Abstract. Denial-of-Service (DoS) attacks try to exhaust some shared resources (eg process tables... more Abstract. Denial-of-Service (DoS) attacks try to exhaust some shared resources (eg process tables, functional units) of a service-centric provider. As Chip Multi-Processors (CMPs) are becoming mainstream architecture for server class processors, the need to manage on-chip ...

A Prolog-based design environment for the high-level synthesis of application-specific architectures

ABSTRACT

The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence

ACM Transactions on Architecture and Code Optimization, 2015

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15, 2015

PSM: software tool for simulating, prototyping, and monitoring of multiprocessor systems

Information and Software Technology, 1992

ABSTRACT

Improving Power Efficiency with an Asymmetric Set-Associative Cache

High Performance Memory Systems, 2004

... 169-178. 22. Seznec A (1995) DASC cache. In: Proceedings of the 1st Annual International Symp... more ... 169-178. 22. Seznec A (1995) DASC cache. In: Proceedings of the 1st Annual International Symposium on High Performance Computer Architecture. 23. ... 27. Zhang C, Zhang X, Yan Y (1997) Two fast and high-associativity cache schemes. IEEE Micro (17) 5: 40-49.

Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, 2013

The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the ... more The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

Two Dynamic Methods for Efficient Large Scale Sharing

The GLOW Cache Coherence Extensions for Widely Shared Data

Proceedings Fifth International Symposium on High-Performance Computer Architecture, 1999

Interleaving memory in distributed vector architecture multiprocessor system

Distributed vector architecture

DataScalar Architectures

ABSTRACT

Proceedings of the 10th international conference on Supercomputing - ICS '96, 1996

Proceedings of the 12th international conference on Supercomputing, 1998

ACM SIGARCH Computer Architecture News, 1997

Non-Speculative Load-Load Reordering in TSO

ACM SIGARCH Computer Architecture News, 2017

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Preventing Denial-of-Service Attacks in Shared CMP Caches

Lecture Notes in Computer Science, 2006

A Prolog-based design environment for the high-level synthesis of application-specific architectures

ABSTRACT

The Effects of Granularity and Adaptivity on Private/Shared Classification for Coherence

ACM Transactions on Architecture and Code Optimization, 2015

Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing - HPDC '15, 2015

PSM: software tool for simulating, prototyping, and monitoring of multiprocessor systems

Information and Software Technology, 1992

ABSTRACT

Improving Power Efficiency with an Asymmetric Set-Associative Cache

High Performance Memory Systems, 2004

Proceedings of the 27th international ACM conference on International conference on supercomputing - ICS '13, 2013

Two Dynamic Methods for Efficient Large Scale Sharing