Token tenure: PATCHing token counting using directory-based cache coherence (original) (raw)

Token tenure and PATCH: a predictive/adaptive Token-counting hybrid

2010

Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces Patch (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Efficient and Scalable Starvation Prevention Mechanism for Token Coherence

IEEE Transactions on Parallel and Distributed Systems, 2011

Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on buslike interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it compromises the scalability of Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence protocols will continue to be key aspects. In this work we propose an alternative starvation prevention mechanism, named priority requests, that outperforms the persistent request one. It is able to reduce the application runtime more than 20% (on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the expense of a slight performance degradation, priority requests still outperform persistent requests significantly.

Adding Token Counting to Directory-Based Cache Coherence

Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, i.e. the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes bottlenecks to the scalability of workloads that use optimistic concurrency. We find that one common bottleneck is updates to auxiliary program data in otherwise non-conflicting operations, e.g. reference count updates on shared object reads and hashtable size field increments on inserts of different elements.

Reducing Network Traffic of Token Protocol Using Sharing Relation Cache

Tsinghua Science & Technology, 2007

Token protocol provides a new coherence framework for shared-memory multiprocessor systems. It avoids indirections of directory protocols for common cache-to-cache transfer misses, and achieves higher interconnect bandwidth and lower interconnect latency compared with snooping protocols. However, the broadcasting increases network traffic, limiting the scalability of token protocol. This paper describes an efficient technique to reduce the token protocol network traffic, called sharing relation cache. This cache provides destination set information for cache-to-cache miss requests by caching directory information for recent shared data. This paper introduces how to implement the technique in a token protocol. Simulations using SPLASH-2 benchmarks show that in a 16-core chip multiprocessor system, the cache reduced the network traffic by 15% on average.

VIPS: Simple Directory-Less Broadcast-Less Cache Coherence Protocol

Coherence in multicores introduces complexity and overhead (directory, state bits) in exchange for local caching, while being "invisible" to the memory consistency model. In this paper we show that a much simpler (directory-less/broadcast-less) multicore coherence provides almost the same performance without the complexity and overhead of a directory protocol. Motivated by recent efforts to simplify coherence for disciplined parallelism, we propose a hardware approach that does not require any application guidance. The cornerstone of our approach is a run-time, application-transparent, division of data into private and shared at a page-level granularity. This allows us to implement a dynamic write-policy (write-back for private, write-through for shared), simplifying the protocol to just two stable states. Self-invalidation of the shared data at synchronization points allows us to remove the directory (and invalidations) completely, with just a data-race-free guarantee (at the write-through granularity) from software. Allowing multiple simultaneous writers and merging their writes, relaxes the DRF guarantee to a word granularity and optimizes traffic. This leads to our main result: a virtually costless coherence that uses the same simple protocol for both shared, DRF data and private data (differentiating only in the timing of when to put data back in the last-level cache) while at the same time approaching the performance (within 3%) of a complex directory protocol.

Exploiting Reuse-Frequency with Speculative and Dynamic Updates in an Enhanced Directory Based Coherence Protocol

2014

As the number of cores increases on chip multiprocessors, cache coherence is fast becoming a major impediment in improving the performance of the multi-cores. This is exacerbated by the fact that the interconnection speeds does not scale well enough with the speed of the processors. To ameliorate these limitations, several mechanisms were augmented to the cache coherence protocols to enhance the performance of the multiprocessors. These mechanisms include policies such as write-update policy, write-invalidate policy etc. However, it has been previously shown that pure write-update protocol is highly undesirable because of the heavy traffic caused by the updates. On the other hand, write-invalidation protocol is not the optimal solution as many of the sharers of the cache blocks may be reused in the near future. In order to increase the efficiency, we introduce a novel update mechanism which uses reuse frequency and last touch time of each cache block as metrics to take the decision ...

Snoopy and Directory Based Cache Coherence Protocols: A Critical Analysis

The computational systems (multi and uni-processors) need to avoid the cache coherence problem. The problem of cache coherence is solved by today's multiprocessors by implementing a cache coherence protocol. The cache coherence protocol affects the performance of a distributed shared memory multiprocessor system. This paper discusses several different varieties of cache coherence protocols including with their pros and cons, the way they are organized, common protocol transitions, and some examples of systems that implement those protocols

Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015

This work proposes a mechanism to hybridize the benefits of snoop-based and directory-based coherence protocols in a single construct. A non-inclusive sparse-directory is used to minimize energy requirements and guarantee scalability. Directory entries will be used only by the most actively shared blocks. To preserve system correctness token counting is used. Additionally, each directory entry is augmented with a counting bloom filter that suppresses most unnecessary on-chip and off-chip requests. Combining all these elements, the proposal, with a low storage overhead, is able to suppress most traffic inherent to snoop-based protocols. With a directory capable of tracking just 40% of the blocks kept in private caches, this coherence protocol is able to match the performance and energy of a sparse-directory capable of tracking 160% of the blocks. Using the same configuration, it can outperform the performance and on-chip memory hierarchy energy of a broadcast-based coherence protocol such as Token by 10% and 20% respectively. To achieve these results, the proposal uses an improved counting bloom filter, which provides twice the space efficiency of a conventional one with similar implementation cost. This filter also enables the coherence controller storage used to track shared blocks and filter private block misses to change dynamically according to the data-sharing properties of the application. With only 5% of tracked private cache entries, the average performance degradation of this construct is less than 8% compared to a 160% over-provisioned sparse-directory.

Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing, 2012

Token Coherence is a cache coherence protocol able to simultaneously capture the best attributes of traditional protocols: low latency and scalability. However, it may lose these desired features when (1) several nodes contend for the same memory block and (2) nodes write highly-shared blocks. The first situation leads to the issue of simultaneous broadcast requests which threaten the protocol scalability. The second situation results in a burst of token responses directed to the writer, which turn it into a bottleneck and increase the latency. To address these problems, we propose a switch-based packing technique able to encapsulate several messages (while in transit) into just one. Its application to the simultaneous broadcasts significantly reduces their bandwidth requirements (up to 45%). Its application to token responses lowers their transmission latency (by 70%). Thus, the packing technique decreases both the latency and coherence traffic, thereby improving system performance (about 15% of reduction in runtime).

The GLOW cache coherence protocol extensions for widely shared data

Proceedings of the 10th international conference on Supercomputing - ICS '96, 1996

Programs that make extensive use of widely shared variables are expected to achieve modest speedups for non-bus-based cache coherence protocoLr, pam'cularly as the number ofprocessors sharing the data grows large. Protocols such as the IEEE Scalable Coherent Interjhce (SCI) are optimized for data that is not widely shared; the GLOW protocol extensions are specljically &signed to address this limitation. The GLO w extensions take advantage of physical locality by mapping K-a~logical sharing trees to the network topology. This results in protocol messages that travel shorter distances, expen"encing lower latency and consuming less bandwidth. To build the sharing trees, GLO w caches directory information at strategic points in the network, allowing concurrency, and therefore, scalability, of read requests. Scalability in writes is achieved by exploiting the shan-ng tree to invaliahte or update nodes in the sharing tree concurrently. We have dey%ted the GLO w extensions with respect to SCI and we have implemented them in the Wisconsin Wind Tunnel (WWT) parallel discrete event simulator We studied them on an example topology, the K-ary Ncube, and explored their scalability with four programs for large systems (up to 256processors).

The directory-based cache coherence protocol for the DASH multiprocessor

ACM Sigarch Computer Architecture News, 1990

DASH is a scalable shared-memory multiprocessor currently being developed at Stanford's Computer Systems Laboratory. The architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a scalable interconnection network. A key feature of DASH is its distributed directory-based cache coherence protocol. Unlike traditional snoopy coherence protocols, the DASH protocol does not rely on broadcast; instead it uses point-to-point messages sent between the processors and memories to keep caches consistent. Furthermore, the DASH system does not contain any single serialization or control point, While these features provide the basis for scalability, they also force a reevaluation of many fundamental issues involved in the design of a protocol. These include the issues of correctness, performance and protocol complexity. In this paper, we present the design of the DASH coherence protocol and discuss how it addresses the above issues, We also discuss our strategy for verifying the correctness of the protocol and briefly compare our protocol to the IEEE Scalable Coherent Interface protocol. 149

Improving coherence protocol reactiveness by trading bandwidth for latency

2012

This paper describes how on-chip network particularities could be used to improve coherence protocol responsiveness. In order to achieve this, a new coherence protocol, named LOCKE, is proposed. LOCKE successfully exploits large on-chip bandwidth availability to improve cache-coherent chip multiprocessor performance and energy efficiency. Provided that the interconnection network is designed to support multicast traffic and the protocol maximizes the potential advantages that direct coherence brings, we demonstrate that a multicast-based coherence protocol could reduce energy requirements in the CMP memory hierarchy. The key idea presented is to establish a suitable level of on-chip network throughput to accelerate synchronization by two means: avoiding the protocol serialization, inherent to directory-based coherence protocol, and reducing average access time more than in other snoop-based coherence protocols, when shared data is truly contended. LOCKE is developed on top of a Token coherence performance substrate, with a new set of simple proactive policies that speeds up data synchronization and eliminates the passive token starvation avoidance mechanism. Using a full-system simulator that faithfully models on-chip interconnection, aggressive core architecture and precise memory hierarchy details, while running a broad spectrum of workloads, our proposal can improve both directory-based and token-based coherence protocols both in terms of energy and performance, at least in systems with up to 16 aggressive out-of-order processors in the chip.

Basis token consistency: Supporting strong web cache consistency

2002

Abstract With Web caching and cache-related services like CDNs and edge services playing an increasingly significant role in the modern Internet, the problem of the weak consistency and coherence provisions in current Web protocols is drawing increasing attention. Toward this end, we differentiate definitions of consistency and coherence for Web-like caching environments, and then present a novel Web protocol we call" basis token consistency"(BTC).

Exploiting Parallelism in Cache Coherency Protocol Engines

1995

Shared memory multiprocessors are based on memory models, which are precise contracts between hard- and software that spell out the semantics of memory operations. Scalable systems implementing such memory models rely on cache coherency protocols that use dedicated hardware. This paper discusses the design space for high performance cache coherency controllers and describes the architecture of the programmable protocol engines that were developed for the S3.mp shared memory multiprocessor. S3.mp uses two independent protocol engines, each of which can maintain multiple, concurrent contexts so that maintaining memory consistency does not limit the system performance. Programmability of these engines allows support of multiple memory organizations, including CC-NUMA and S-COMA.

Cache Coherence Protocols in Distributed Systems

Journal of Applied Science and Technology Trends

Distributed systems performance is affected significantly by cache coherence protocols due to their role in data consistency maintaining. Also, cache coherent protocols have a great task for keeping the interconnection of caches in a multiprocessor environment. Moreover, the overall performance of distributed shared memory multiprocessor system is influenced by the used cache coherence protocol type. The major challenge of shared memory devices is to maintain the cache coherently. Therefore, in past years many contributions have been presented to address the cache issues and to improve the performance of distributed systems. This paper reviews in a systematic way a number of methods used for the cache-coherent protocols in a distributed system.

Verifying distributed directory-based cache coherence protocols: S3.mp, a case study

Lecture Notes in Computer Science, 1995

This paper presents the results for the verification of the S3.mp cache coherence protocol. The S3.mp protocol uses a distributed directory with limited number of pointers and hardware supported overflow handling that keeps processing nodes sharing a data block in a singly linked list. The complexity of the protocol is high and its validation is challenging because of the distributed algorithm used to maintain the linked lists and the non-FIFO network. We found several design errors, including an error which only appears in verification models of more than three processing nodes, which is very unlikely to be detected by intensive simulations. We believe that methods described in this paper are applicable to the verification of other linked list based protocols such as the IEEE Scalable Coherent Interface.

Token Coherence for Transactional Memory

2008

Concurrent programming holds the key to fully utilizing the multi-core chips provided by CMPs. However, traditional concurrent programming techniques based on locking mechanisms are hard to code and error-prone. Transactional Programming is an attempt to simplify concurrent programming where transactions are the main concurrent construct. Hardware Transactional Memory systems provide hardware support for executing transactions and take help from modified versions of standard cache coherence protocols for doing so. This report presents a token coherence protocol for the transactional memory system, TTM [6]. It also presents some comparisons between standard locking based systems and transactional memory systems. It has been observed that the performance of transactional memory systems is workload-dependent. However, transactional programming appears to be much easier than programming with locks.