Group Mutual Exclusion by Fetch-and-increment (original) (raw)

Fast and fair mutual exclusion for shared memory systems

Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003)

Two fast mutual exclusion algorithms using read-modifywrite and atomic read/write registers are presented. The first one uses both compare&swap and fetch&store; the second uses only fetch&store. Fetch&store are more commonly available than compare&swap. It is impossible to obtain better algorithms if "time" is measured by counting remote memory references. We were able to maintain the same level of performance with or without the support of compare&swap. However, fairness is degraded from 1-bounded bypass to lockout freedom without the support.

Fast mutual exclusion algorithms using read-modify-write and atomic read/write registers

Proceedings 1998 International Conference on Parallel and Distributed Systems (Cat. No.98TB100250)

Three fast mutual exclusion algorithms using read-modify-write and atomic read/write registers are presented in a sequence, with an improvement from one to the next. The last algorithm is shown to be optimal in minimizing the number of remote memory accesses required in a resource busy period. Remote memory access is the key factor of memory access bottleneck in large shared-memory multiprocessors. The algorithm is particularly suitable in such systems for applications with small critical sections and frequent resource requests.

Concurrent Update on Multiprogrammed Shared Memory Multiprocessors

IEEE Transactions on Reliability

Most multiprocessors are multiprogrammed in order to achieve acceptable response time and to increase utilization. Unfortunately, inopportune preemption may significantly degrade the performance of synchronized parallel applications. To address this problem, researchers have developed two principal strategies for concurrent, atomic update of shared data structures: preemption-safe locking and non-blocking (lock-free) algorithms. Preemption-safe locking requires kernel support. Non-blocking algorithms generally require a universal atomic primitive such as compare and swap or load linked/store conditional.

On the Performance of Delegation over Cache-Coherent Shared Memory

Proceedings of the 2015 International Conference on Distributed Computing and Networking, 2015

Delegation is a thread synchronization technique where access to shared data is performed through a dedicated server thread. When a client thread requires shared data access, it makes a request to a server and waits for a response. This paper studies delegation implementation over cache-coherent shared memory, with the goal of optimizing it for high throughput. Whereas client-server communication naturally fits message-passing systems, efficient implementation over cache-coherent shared memory requires careful optimization. We demonstrate optimizations that significantly improve delegation performance on two modern x86 processors (the Intel Xeon Westmere and the AMD Opteron Magny-Cours), enabling us to come up with counter, stack and queue implementations that outperform the best known alternatives in a large number of cases. Our optimized delegation solution achieves 1.4x (resp. 2x) higher throughput compared to the most efficient state-of-the-art delegation solution on the Intel Xeon (resp. AMD Opteron).

A tight bound on remote reference time complexity of mutual exclusion in the read-modify-write model

Journal of Parallel and Distributed Computing, 2006

In distributed shared memory multiprocessors, remote memory references generate processor-to-memory traffic, which may result in a bottleneck. It is therefore important to design algorithms that minimize the number of remote memory references. We establish a lower bound of three on remote reference time complexity for mutual exclusion algorithms in a model where processes communicate by means of a general read-modify-write primitive that accesses at most one shared variable in one instruction. Since the general read-modify-write primitive is a generalization of a variety of atomic primitives that have been implemented in multiprocessor systems, our lower bound holds for all mutual exclusion algorithms that use such primitives. Furthermore, this lower bound is shown to be tight by presenting an algorithm with the matching upper bound.

Lock Oscillation: Boosting the Performance of Concurrent Data Structures

2017

In combining-based synchronization, two main parameters that affect performance are the combining degree of the synchronization algorithm, i.e. the average number of requests that each com-biner serves, and the number of expensive synchronization primitives (like CAS, Swap, etc.) that it performs. The value of the first parameter must be high, whereas the second must be kept low. In this paper, we present Osci, a new combining technique that shows remarkable performance when paired with cheap context switching. We experimentally show that Osci significantly outperforms all previous combining algorithms. Specifically, the throughput of Osci is higher than that of previously presented combining techniques by more than an order of magnitude. Notably, Osci's throughput is much closer to the ideal than all previous algorithms, while keeping the average latency in serving each request low. We evaluated the performance of Osci in two different multiprocessor architectures, namely AMD and Intel. Based on Osci, we implement and experimentally evaluate implementations of concurrent queues and stacks. These implementations outperform by far all current state-of-the-art concurrent queue and stack implementations. Although the current version of Osci has been evaluated in an environment supporting user-level threads, it would run correctly on any threading library, preemptive or not (including kernel threads).

Group mutual exclusion in linear time and space

Proceedings of the 17th International Conference on Distributed Computing and Networking, 2016

We present two algorithms for the Group Mutual Exclusion (GME) Problem that satisfy the properties of Mutual Exclusion, Starvation Freedom, Bounded Exit, Concurrent Entry and First Come First Served. Both our algorithms use only simple read and write instructions, have O(N) Shared Space complexity and O(N) Remote Memory Reference (RMR) complexity in the Cache Coherency (CC) model. Our first algorithm is developed by generalizing the well-known Lamport's Bakery Algorithm for the classical mutual exclusion problem, while preserving its simplicity and elegance. However, it uses unbounded shared registers. Our second algorithm uses only bounded registers and is developed by generalizing Taubenfeld's Black and White Bakery Algorithm to solve the classical mutual exclusion problem using only bounded shared registers. We show that contrary to common perception our algorithms are the first to achieve these properties with these combination of complexities.

Queue locks on cache coherent multiprocessors

Proceedings of 8th International Parallel Processing Symposium

Large-scale shared-memory multiprocessors typically have long latencies for remote data accesses. A k e y issue for execution performance of many common applications is the synchronization cost. The communication scalability of synchronization has been improved by the introduction of queue-based spin-locks instead of Test&Test&Set. For architectures with long access latencies for global data, attention should also be p aid to the number of global accesses that are involved i n synchronization. We present a method to characterize the performance o f p r oposed queue lock algorithms, and apply it to previously published algorithms. We also present two new queue locks, the LH lock and the M lock. We compare the locks in terms of performance, memory requirements, code size, and required h a r dware support. The LH lock is the simplest of all the locks, yet requires only an atomic swap operation. The M lock is superior in terms of global accesses needed t o p erform synchronization and still competitive in all other criteria. We conclude that the M lock is the best overall queue lock for the class of architectures studied.

Efficient Fetch-and-Increment

2012

A Fetch&Inc object stores a non-negative integer and supports a single operation, fi, that returns the value of the object and increments it. Such objects are used in many asynchronous shared memory algorithms, such as renaming, mutual exclusion, and barrier synchronization. We present an efficient implementation of a wait-free Fetch&Inc object from registers and load-linked/store-conditional (ll/sc) objects.