Token tenure: PATCHing token counting using directory-based cache coherence (original) (raw)

Token tenure and PATCH: a predictive/adaptive Token-counting hybrid

2010

Abstract Traditional coherence protocols present a set of difficult trade-offs: the reliance of snoopy protocols on broadcast and ordered interconnects limits their scalability, while directory protocols incur a performance penalty on sharing misses due to indirection. This work introduces Patch (Predictive/Adaptive Token-Counting Hybrid), a coherence protocol that provides the scalability of directory protocols while opportunistically sending direct requests to reduce sharing latency.

Efficient techniques to provide scalability for token-based cache coherence protocols

Likewise, I would like to thank María Engracia Gómez, Vicente Santonja, and Elvira Baydal because they all contributed in making me feel comfortable and integrated in the group. I have met many interesting students while in university. Although I can not possibly mention everyone who has enriched my experience or provided moral support, I wish to specifically thank a few individuals.

Efficient and Scalable Starvation Prevention Mechanism for Token Coherence

IEEE Transactions on Parallel and Distributed Systems, 2011

Token Coherence is a cache coherence protocol that simultaneously captures the best attributes of the traditional approximations to coherence: direct communication between processors (like snooping-based protocols) and no reliance on buslike interconnects (like directory-based protocols). This is possible thanks to a class of unordered requests that usually succeed in resolving the cache misses. The problem of the unordered requests is that they can cause protocol races, which prevent some misses from being resolved. To eliminate races and ensure the completion of the unresolved misses, Token Coherence uses a starvation prevention mechanism named persistent requests. This mechanism is extremely inefficient and, besides, it compromises the scalability of Token Coherence since it requires storage structures (at each node) whose size grows proportionally to the system size. While multiprocessors continue including an increasingly number of nodes, both the performance and scalability of cache coherence protocols will continue to be key aspects. In this work we propose an alternative starvation prevention mechanism, named priority requests, that outperforms the persistent request one. It is able to reduce the application runtime more than 20% (on average) in a 64-processor system. Furthermore, thanks to the flexibility shown by priority requests, it is possible to drastically minimize its storage requirements, thereby improving the whole scalability of Token Coherence. Although this is achieved at the expense of a slight performance degradation, priority requests still outperform persistent requests significantly.

Adding Token Counting to Directory-Based Cache Coherence

Over the past decade there has been a surge of academic and industrial interest in optimistic concurrency, i.e. the speculative parallel execution of code regions that have the semantics of isolation. This work analyzes bottlenecks to the scalability of workloads that use optimistic concurrency. We find that one common bottleneck is updates to auxiliary program data in otherwise non-conflicting operations, e.g. reference count updates on shared object reads and hashtable size field increments on inserts of different elements.

VIPS: Simple Directory-Less Broadcast-Less Cache Coherence Protocol

Coherence in multicores introduces complexity and overhead (directory, state bits) in exchange for local caching, while being "invisible" to the memory consistency model. In this paper we show that a much simpler (directory-less/broadcast-less) multicore coherence provides almost the same performance without the complexity and overhead of a directory protocol. Motivated by recent efforts to simplify coherence for disciplined parallelism, we propose a hardware approach that does not require any application guidance. The cornerstone of our approach is a run-time, application-transparent, division of data into private and shared at a page-level granularity. This allows us to implement a dynamic write-policy (write-back for private, write-through for shared), simplifying the protocol to just two stable states. Self-invalidation of the shared data at synchronization points allows us to remove the directory (and invalidations) completely, with just a data-race-free guarantee (at the write-through granularity) from software. Allowing multiple simultaneous writers and merging their writes, relaxes the DRF guarantee to a word granularity and optimizes traffic. This leads to our main result: a virtually costless coherence that uses the same simple protocol for both shared, DRF data and private data (differentiating only in the timing of when to put data back in the last-level cache) while at the same time approaching the performance (within 3%) of a complex directory protocol.

Exploiting Reuse-Frequency with Speculative and Dynamic Updates in an Enhanced Directory Based Coherence Protocol

2014

As the number of cores increases on chip multiprocessors, cache coherence is fast becoming a major impediment in improving the performance of the multi-cores. This is exacerbated by the fact that the interconnection speeds does not scale well enough with the speed of the processors. To ameliorate these limitations, several mechanisms were augmented to the cache coherence protocols to enhance the performance of the multiprocessors. These mechanisms include policies such as write-update policy, write-invalidate policy etc. However, it has been previously shown that pure write-update protocol is highly undesirable because of the heavy traffic caused by the updates. On the other hand, write-invalidation protocol is not the optimal solution as many of the sharers of the cache blocks may be reused in the near future. In order to increase the efficiency, we introduce a novel update mechanism which uses reuse frequency and last touch time of each cache block as metrics to take the decision ...

Snoopy and Directory Based Cache Coherence Protocols: A Critical Analysis

The computational systems (multi and uni-processors) need to avoid the cache coherence problem. The problem of cache coherence is solved by today's multiprocessors by implementing a cache coherence protocol. The cache coherence protocol affects the performance of a distributed shared memory multiprocessor system. This paper discusses several different varieties of cache coherence protocols including with their pros and cons, the way they are organized, common protocol transitions, and some examples of systems that implement those protocols

Flask coherence: A morphable hybrid coherence protocol to balance energy, performance and scalability

2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), 2015

This work proposes a mechanism to hybridize the benefits of snoop-based and directory-based coherence protocols in a single construct. A non-inclusive sparse-directory is used to minimize energy requirements and guarantee scalability. Directory entries will be used only by the most actively shared blocks. To preserve system correctness token counting is used. Additionally, each directory entry is augmented with a counting bloom filter that suppresses most unnecessary on-chip and off-chip requests. Combining all these elements, the proposal, with a low storage overhead, is able to suppress most traffic inherent to snoop-based protocols. With a directory capable of tracking just 40% of the blocks kept in private caches, this coherence protocol is able to match the performance and energy of a sparse-directory capable of tracking 160% of the blocks. Using the same configuration, it can outperform the performance and on-chip memory hierarchy energy of a broadcast-based coherence protocol such as Token by 10% and 20% respectively. To achieve these results, the proposal uses an improved counting bloom filter, which provides twice the space efficiency of a conventional one with similar implementation cost. This filter also enables the coherence controller storage used to track shared blocks and filter private block misses to change dynamically according to the data-sharing properties of the application. With only 5% of tracked private cache entries, the average performance degradation of this construct is less than 8% compared to a 160% over-provisioned sparse-directory.

Switch-based packing technique to reduce traffic and latency in token coherence

Journal of Parallel and Distributed Computing, 2012

Token Coherence is a cache coherence protocol able to simultaneously capture the best attributes of traditional protocols: low latency and scalability. However, it may lose these desired features when (1) several nodes contend for the same memory block and (2) nodes write highly-shared blocks. The first situation leads to the issue of simultaneous broadcast requests which threaten the protocol scalability. The second situation results in a burst of token responses directed to the writer, which turn it into a bottleneck and increase the latency. To address these problems, we propose a switch-based packing technique able to encapsulate several messages (while in transit) into just one. Its application to the simultaneous broadcasts significantly reduces their bandwidth requirements (up to 45%). Its application to token responses lowers their transmission latency (by 70%). Thus, the packing technique decreases both the latency and coherence traffic, thereby improving system performance (about 15% of reduction in runtime).