Danny Hendler - Academia.edu (original) (raw)
Papers by Danny Hendler
Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, 2018
A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally di... more A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally different progress conditions, namely, can the former be provided in situations where the latter cannot? This paper answers the question in the affirmative, by proving that there are objects with lock-free implementations, but without wait-free implementations-using objects of any finite power. We precisely define an object called n-process long-lived approximate agreement (n-LLAA), in which two sets of processes associated with two sides, 0 or 1, need to decide on a sequence of increasingly closer outputs. We prove that 2-LLAA has a lock-free implementation using reads and writes only, while n-LLAA has a lock-free implementation using reads, writes and (n − 1)-process consensus objects. In contrast, we prove that there is no wait-free implementation of the n-LLAA object using reads, writes and specific (n − 1)-process consensus objects, called (n − 1)-window registers.
Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020
A shared-memory counter is a well-studied and widely-used concurrent object. It supports two oper... more A shared-memory counter is a well-studied and widely-used concurrent object. It supports two operations: An Inc operation that increases its value by 1 and a Read operation that returns its current value. Jayanti, Tan and Toueg [Jayanti et al., 2000] proved a linear lower bound on the worst-case step complexity of obstruction-free implementations, from read and write operations, of a large class of shared objects that includes counters. The lower bound leaves open the question of finding counter implementations with sub-linear amortized step complexity. In this paper, we address this gap. We present the first wait-free n-process counter, implemented using only read and write operations, whose amortized operation step complexity is O(log^2 n) in all executions. This is the first non-blocking read/write counter algorithm that provides sub-linear amortized step complexity in executions of arbitrary length. Since a logarithmic lower bound on the amortized step complexity of obstruction-...
2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), 2021
Relaxing the sequential specification of shared objects has been proposed as a promising approach... more Relaxing the sequential specification of shared objects has been proposed as a promising approach to obtain implementations with better complexity. In this paper, we study the step complexity of relaxed variants of two common shared objects: max registers and counters. In particular, we consider the k-multiplicative-accurate max register and the k-multiplicativeaccurate counter, where read operations are allowed to err by a multiplicative factor of k (for some k ∈ N). More accurately, reads are allowed to return an approximate value x of the maximum value v previously written to the max register, or of the number v of increments previously applied to the counter, respectively, such that v/k ≤ x ≤ v • k. We provide upper and lower bounds on the complexity of implementing these objects in a wait-free manner in the shared memory model.
Proceedings of the 39th Symposium on Principles of Distributed Computing, 2020
We present the first deterministic wait-free long-lived snapshot algorithm, using only read and w... more We present the first deterministic wait-free long-lived snapshot algorithm, using only read and write operations, that guarantees polylogarithmic amortized step complexity in all executions. This is the first non-blocking snapshot algorithm, using reads and writes only, that has sub-linear amortized step complexity in executions of arbitrary length. The key to our construction is a novel implementation of a 2-component max array object which may be of independent interest. CCS Concepts: • Theory of computation → Concurrent algorithms.
Proceedings of the 39th Symposium on Principles of Distributed Computing, 2020
The emergence of systems with non-volatile main memory (NVM) increases the interest in the design... more The emergence of systems with non-volatile main memory (NVM) increases the interest in the design of recoverable concurrent objects that are robust to crash-failures, since their operations are able to recover from such failures by using state retained in NVM. Of particular interest are recoverable algorithms that, in addition to ensuring object consistency, also provide detectability, a correctness condition requiring that the recovery code can infer if the failed operation was linearized or not and, in the former case, obtain its response. In this work, we investigate the space complexity of detectable algorithms and the external support they require. We make the following three contributions. First, we present the first wait-free bounded-space detectable read/write and CAS object implementations. Second, we prove that the bit complexity of every N-process obstruction-free detectable CAS implementation, assuming values from a domain of size at least N , is Ω(N). Finally, we prove that the following holds for obstruction-free detectable implementations of a large class of objects: their recoverable operations must be provided with auxiliary state-state that is not required by the non-recoverable counterpart implementation-whose value must be provided from outside the operation, either by the system or by the caller of the operation. In contrast, this external support is, in general, not required if the recoverable algorithm is not detectable.
Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, 2018
We presents a novel abstract individual-process crash-recovery model for non-volatile memory, whi... more We presents a novel abstract individual-process crash-recovery model for non-volatile memory, which enables modularity, so that complex recoverable objects can be constructed in a modular manner from simpler recoverable base objects. Within the framework of this model, we define nesting-safe recoverable linearizability (NRL)-a novel correctness condition that captures the requirements for nesting recoverable objects. Informally, NRL allows the recovery code to extend the interval of the failed operation until the recovery code succeeds to complete (possibly after multiple failures and recovery attempts). Unlike previous correctness definitions, the NRL condition implies that, following recovery, an implemented (higherlevel) recoverable operation is able to complete its invocation of a base-object operation and obtain its response. We present algorithms for nesting-safe recoverable primitives, namely, recoverable versions of widely-used primitive sharedmemory operations such as read, write, test-and-set and compareand-swap, which can be used to implement higher-level recoverable objects. We then exemplify how these recoverable base objects can be used for constructing a recoverable counter object. Finally, we prove an impossibility result on wait-free implementations of recoverable test-and-set (TAS) objects from read, write and TAS operations, thus demonstrating that our model also facilitates rigorous analysis of the limitations of recoverable concurrent objects.
Lecture Notes in Computer Science, 2016
Obstruction-free consensus, ensuring that a process running solo will eventually terminate, is at... more Obstruction-free consensus, ensuring that a process running solo will eventually terminate, is at the core of practical ways to solve consensus, e.g., by using randomization or failure detectors. An obstructionfree consensus algorithm may not terminate in many executions, but it must terminate whenever a process runs solo. Such an algorithm can be evaluated by its solo step complexity, which bounds the worst case number of steps taken by a process running alone, from any configuration, until it decides. This paper presents a lower bound of Ω(log n) on the solo step complexity of obstruction-free binary anonymous consensus. The proof constructs a sequence of executions in which more and more distinct variables are about to be written to, and then uses the backtracking covering technique to obtain a single execution in which many variables are accessed.
Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing - PODC '06, 2006
It has been considered bon ton to blame locks for their fragility, especially since researchers i... more It has been considered bon ton to blame locks for their fragility, especially since researchers identified obstruction-freedom: a progress condition that precludes locking while being weak enough to raise the hope for good performance. This paper attenuates this hope by establishing lower bounds on the complexity of obstructionfree implementations in contention-free executions: those where obstruction-freedom was precisely claimed to be effective. Through our lower bounds, we argue for an inherent cost of concurrent computing without locks. We first prove that obstruction-free implementations of a large class of objects, using only overwriting or trivial primitives in contention-free executions, have Ω(n) space complexity and Ω(log 2 n) (obstruction-free) step complexity. These bounds apply to implementations of many popular objects, including variants of fetch&add, counter, compare&swap, and LL/SC. When arbitrary primitives can be applied in contention-free executions, we show that, in any implementation of binary consensus, or any perturbable object, the number of distinct base objects accessed and memory stalls incurred by some process in a contention free execution is Ω(√ n). All these results hold regardless of the behavior of processes after they become aware of contention. We also prove that, in any obstruction-free implementation of a perturbable object in which processes are not allowed to fail their operations, the number of memory stalls incurred by some process that is unaware of contention is Ω(n).
Proceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing - PODC '10, 2010
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. A mutual exclusion algorithm is adaptive to point contention, if its RMR complexity is a function of the maximum number of processes concurrently executing their entry, critical, or exit section. In the best prior art deterministic adaptive mutual exclusion algorithm, presented by Kim and Anderson [22], a process performs O min(k, log N) RMRs as it enters and exits its critical section, where k is point contention and N is the number of processes in the system. Kim and Anderson also proved that a deterministic algorithm with o(k) RMR complexity does not exist [21]. However, they describe a randomized mutual exclusion algorithm that has O(log k) expected RMR complexity against an oblivious adversary. All these results apply for algorithms that use only atomic read and write operations. We present a randomized adaptive mutual exclusion algorithms with O(log k/ log log k) expected amortized RMR complexity, even against a strong adversary, for the cachecoherent shared memory read/write model. Using techniques similar to those used in [17], our algorithm can be adapted for the distributed shared memory read/write model. This establishes that sub-logarithmic adaptive mutual exclusion, using reads and writes only, is possible.
Proceedings of the twenty-second annual symposium on Principles of distributed computing - PODC '03, 2003
This paper introduces operation-valency, a generalization of the valency proof technique originat... more This paper introduces operation-valency, a generalization of the valency proof technique originated by Fischer, Lynch, and Paterson. By focusing on critical events that influence the return values of individual operations rather then on critical events that influence a protocol's single return value, the new technique allows us to derive a collection of realistic lower bounds for lock-free implementations of concurrent objects such as linearizable queues, stacks, sets, hash tables, shared counters, approximate agreement, and more. By realistic we mean that they follow the real-world model introduced by Dwork, Herlihy, and Waarts, counting both memory-references and memory-stalls due to contention, and that they allow the combined use of read, write, and read-modify-write operations available on current machines. By using the operation-valency technique, we derive an f~(X/~) non-cached shared memory accesses lower bound on the worst-case time complexity of lock-free implementations of objects in Influence(n), a wide class of concurrent objects including all of those mentioned above, in which an individual operation can be influenced by all others. We also prove the existence of a fundamental relationship between the space complexity, latency, contention, and "influence level" of any lock-free object implementation. Our results are broad in that they hold for implementations combining read/write memory and any collection of read-modifywrite operations, and in that they apply even if shared memory words have unbounded size.
Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, 2015
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. To ensure the correctness of concurrent algorithms in general, and mutual exclusion algorithms in particular, it is often required to prohibit certain re-orderings of memory instructions that may compromise correctness, by inserting memory fence (a.k.a. memory barrier) instructions. Memory fences incur non-negligible overhead and may significantly increase time complexity. A mutual exclusion algorithm is adaptive to total contention (or simply adaptive), if the time complexity of every passage (an entry to the critical section and the corresponding exit) is a function of total contention, that is, the number of processes, k, that participate in the execution in which that passage is performed. We say that an algorithm A is f-adaptive (and that f is an adaptivity function of A), if the time complexity of every passage in A is O f (k). Adaptive implementations are desirable when contention is much smaller than the total number of processes, n, sharing the implementation. Recent work [5] presented the first read/write mutual exclusion algorithm with asymptotically optimal complexity under both the RMRs and fences metrics: each passage through the critical section incurs O(log n) RMRs and a constant number of fences. The algorithm works in the popular Total Store Ordering (TSO) model. The algorithm of [5] is non-adaptive, however, and they posed the question of whether there exists an adaptive mutual exclusion algorithm with the same complexities. We provide a negative answer to this question, thus capturing an inherent cost of adaptivity. In fact, we prove a stronger result: adaptive read/write mutual exclusion algo-* Partially supported by the Israel Science Foundation (grants 1227/10, 1749/14) and by the Lynne and William Frankel Center for Computing Science at Ben-Gurion University.
46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)
This paper proves ¡ £ ¥ lower bounds on the time to perform a single instance of an operation in ... more This paper proves ¡ £ ¥ lower bounds on the time to perform a single instance of an operation in any implementation of a large class of data structures shared by £ processes. For standard data structures such as counters, stacks, and queues, the bound is tight. The implementations considered may apply any deterministic primitives to a base object. No bounds are assumed on either the number of base objects or their size. Time is measured as the number of steps a process performs on base objects and the number of stalls it incurs as a result of contention with other processes.
Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing - PODC '15, 2015
Out-of-order execution of instructions is a common optimization technique for multicores and mult... more Out-of-order execution of instructions is a common optimization technique for multicores and multiprocessors, which is governed by the memory model of the architecture. Relatively strong memory models, like TSO (supported by x86 and AMD), only allow reads to bypass earlier writes, while other models, like RMO (supported by ARM, POWER and Alpha) and PSO (supported by older SPARC), also allow the reordering of writes to different locations. These reorderings can be prevented by the use of costly fence instructions. In this paper we prove that when writes can be reordered (e.g, in RMO or even PSO), there is a tradeoff between the number of fences, f , and the number of remote memory references (RMRs), r, for a large class of objects, including locks, counters and queues:
Proceedings of the 28th ACM symposium on Principles of distributed computing - PODC '09, 2009
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. A recent proof [9] established an Ω(log N) lower bound on the number of RMRs incurred by processes as they enter and exit the critical section, matching an upper bound by Yang and Anderson [18]. Both these bounds apply for algorithms that only use read and write operations. The lower bound of [9] only holds for deterministic algorithms, however; the question of whether randomized mutual exclusion algorithms, using reads and writes only, can achieve sub-logarithmic expected RMR complexity remained open. This paper answers this question in the affirmative. We present two strong-adversary [8] randomized local-spin mutual exclusion algorithms. In both algorithms, processes incur O(log N/ log log N) expected RMRs per passage in every execution. Our first algorithm has sub-optimal worstcase RMR complexity of O (log N/ log log N) 2. Our second algorithm is a variant of the first that can be combined with a deterministic algorithm, such as [18], to obtain O(log N) worst-case RMR complexity. The combined algorithm thus achieves sub-logarithmic expected RMR complexity while maintaining optimal worst-case RMR complexity. Our upper bounds apply for both the cache coherent (CC) and the distributed shared memory (DSM) models.
Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures - SPAA '10, 2010
Traditional data structure designs, whether lock-based or lock-free, provide parallelism via fine... more Traditional data structure designs, whether lock-based or lock-free, provide parallelism via fine grained synchronization among threads. We introduce a new synchronization paradigm based on coarse locking, which we call flat combining. The cost of synchronization in flat combining is so low, that having a single thread holding a lock perform the combined access requests of all others, delivers, up to a certain non-negligible concurrency level, better performance than the most effective parallel finely synchronized implementations. We use flat-combining to devise, among other structures, new linearizable stack, queue, and priority queue algorithms that greatly outperform all prior algorithms.
ACM SIGPLAN Notices, 2011
Building correct and efficient concurrent algorithms is known to be a difficult problem of fundam... more Building correct and efficient concurrent algorithms is known to be a difficult problem of fundamental importance. To achieve efficiency, designers try to remove unnecessary and costly synchronization. However, not only is this manual trial-and-error process ad-hoc, time consuming and error-prone, but it often leaves designers pondering the question of: is it inherently impossible to eliminate certain synchronization, or is it that I was unable to eliminate it on this attempt and I should keep trying? In this paper we respond to this question. We prove that it is impossible to build concurrent implementations of classic and ubiquitous specifications such as sets, queues, stacks, mutual exclusion and read-modify-write operations, that completely eliminate the use of expensive synchronization. We prove that one cannot avoid the use of either: i) read-after-write (RAW), where a write to shared variable A is followed by a read to a different shared variable B without a write to B in bet...
SIAM Journal on Computing, 2012
We present Ω(n) lower bounds on the worst case time to perform a single instance of an operation ... more We present Ω(n) lower bounds on the worst case time to perform a single instance of an operation in any nonblocking implementation of a large class of concurrent data structures shared by n processes. Time is measured by the number of stalls a process incurs as a result of contention with other processes. For standard data structures such as counters, stacks, and queues, our bounds are tight. The implementations considered may apply any primitives to a base object. No upper bounds are assumed on either the number of base objects or their size.
Journal of the ACM, 2009
Obstruction-free implementations of concurrent objects are optimized for the common case where th... more Obstruction-free implementations of concurrent objects are optimized for the common case where there is no step contention , and were recently advocated as a solution to the costs associated with synchronization without locks. In this article, we study this claim and this goes through precisely defining the notions of obstruction-freedom and step contention. We consider several classes of obstruction-free implementations, present corresponding generic object implementations, and prove lower bounds on their complexity. Viewed collectively, our results establish that the worst-case operation time complexity of obstruction-free implementations is high, even in the absence of step contention. We also show that lock-based implementations are not subject to some of the time-complexity lower bounds we present.
Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, 2018
A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally di... more A long-standing open question has been whether lock-freedom and wait-freedom are fundamentally different progress conditions, namely, can the former be provided in situations where the latter cannot? This paper answers the question in the affirmative, by proving that there are objects with lock-free implementations, but without wait-free implementations-using objects of any finite power. We precisely define an object called n-process long-lived approximate agreement (n-LLAA), in which two sets of processes associated with two sides, 0 or 1, need to decide on a sequence of increasingly closer outputs. We prove that 2-LLAA has a lock-free implementation using reads and writes only, while n-LLAA has a lock-free implementation using reads, writes and (n − 1)-process consensus objects. In contrast, we prove that there is no wait-free implementation of the n-LLAA object using reads, writes and specific (n − 1)-process consensus objects, called (n − 1)-window registers.
Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020
A shared-memory counter is a well-studied and widely-used concurrent object. It supports two oper... more A shared-memory counter is a well-studied and widely-used concurrent object. It supports two operations: An Inc operation that increases its value by 1 and a Read operation that returns its current value. Jayanti, Tan and Toueg [Jayanti et al., 2000] proved a linear lower bound on the worst-case step complexity of obstruction-free implementations, from read and write operations, of a large class of shared objects that includes counters. The lower bound leaves open the question of finding counter implementations with sub-linear amortized step complexity. In this paper, we address this gap. We present the first wait-free n-process counter, implemented using only read and write operations, whose amortized operation step complexity is O(log^2 n) in all executions. This is the first non-blocking read/write counter algorithm that provides sub-linear amortized step complexity in executions of arbitrary length. Since a logarithmic lower bound on the amortized step complexity of obstruction-...
2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS), 2021
Relaxing the sequential specification of shared objects has been proposed as a promising approach... more Relaxing the sequential specification of shared objects has been proposed as a promising approach to obtain implementations with better complexity. In this paper, we study the step complexity of relaxed variants of two common shared objects: max registers and counters. In particular, we consider the k-multiplicative-accurate max register and the k-multiplicativeaccurate counter, where read operations are allowed to err by a multiplicative factor of k (for some k ∈ N). More accurately, reads are allowed to return an approximate value x of the maximum value v previously written to the max register, or of the number v of increments previously applied to the counter, respectively, such that v/k ≤ x ≤ v • k. We provide upper and lower bounds on the complexity of implementing these objects in a wait-free manner in the shared memory model.
Proceedings of the 39th Symposium on Principles of Distributed Computing, 2020
We present the first deterministic wait-free long-lived snapshot algorithm, using only read and w... more We present the first deterministic wait-free long-lived snapshot algorithm, using only read and write operations, that guarantees polylogarithmic amortized step complexity in all executions. This is the first non-blocking snapshot algorithm, using reads and writes only, that has sub-linear amortized step complexity in executions of arbitrary length. The key to our construction is a novel implementation of a 2-component max array object which may be of independent interest. CCS Concepts: • Theory of computation → Concurrent algorithms.
Proceedings of the 39th Symposium on Principles of Distributed Computing, 2020
The emergence of systems with non-volatile main memory (NVM) increases the interest in the design... more The emergence of systems with non-volatile main memory (NVM) increases the interest in the design of recoverable concurrent objects that are robust to crash-failures, since their operations are able to recover from such failures by using state retained in NVM. Of particular interest are recoverable algorithms that, in addition to ensuring object consistency, also provide detectability, a correctness condition requiring that the recovery code can infer if the failed operation was linearized or not and, in the former case, obtain its response. In this work, we investigate the space complexity of detectable algorithms and the external support they require. We make the following three contributions. First, we present the first wait-free bounded-space detectable read/write and CAS object implementations. Second, we prove that the bit complexity of every N-process obstruction-free detectable CAS implementation, assuming values from a domain of size at least N , is Ω(N). Finally, we prove that the following holds for obstruction-free detectable implementations of a large class of objects: their recoverable operations must be provided with auxiliary state-state that is not required by the non-recoverable counterpart implementation-whose value must be provided from outside the operation, either by the system or by the caller of the operation. In contrast, this external support is, in general, not required if the recoverable algorithm is not detectable.
Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing, 2018
We presents a novel abstract individual-process crash-recovery model for non-volatile memory, whi... more We presents a novel abstract individual-process crash-recovery model for non-volatile memory, which enables modularity, so that complex recoverable objects can be constructed in a modular manner from simpler recoverable base objects. Within the framework of this model, we define nesting-safe recoverable linearizability (NRL)-a novel correctness condition that captures the requirements for nesting recoverable objects. Informally, NRL allows the recovery code to extend the interval of the failed operation until the recovery code succeeds to complete (possibly after multiple failures and recovery attempts). Unlike previous correctness definitions, the NRL condition implies that, following recovery, an implemented (higherlevel) recoverable operation is able to complete its invocation of a base-object operation and obtain its response. We present algorithms for nesting-safe recoverable primitives, namely, recoverable versions of widely-used primitive sharedmemory operations such as read, write, test-and-set and compareand-swap, which can be used to implement higher-level recoverable objects. We then exemplify how these recoverable base objects can be used for constructing a recoverable counter object. Finally, we prove an impossibility result on wait-free implementations of recoverable test-and-set (TAS) objects from read, write and TAS operations, thus demonstrating that our model also facilitates rigorous analysis of the limitations of recoverable concurrent objects.
Lecture Notes in Computer Science, 2016
Obstruction-free consensus, ensuring that a process running solo will eventually terminate, is at... more Obstruction-free consensus, ensuring that a process running solo will eventually terminate, is at the core of practical ways to solve consensus, e.g., by using randomization or failure detectors. An obstructionfree consensus algorithm may not terminate in many executions, but it must terminate whenever a process runs solo. Such an algorithm can be evaluated by its solo step complexity, which bounds the worst case number of steps taken by a process running alone, from any configuration, until it decides. This paper presents a lower bound of Ω(log n) on the solo step complexity of obstruction-free binary anonymous consensus. The proof constructs a sequence of executions in which more and more distinct variables are about to be written to, and then uses the backtracking covering technique to obtain a single execution in which many variables are accessed.
Proceedings of the twenty-fifth annual ACM symposium on Principles of distributed computing - PODC '06, 2006
It has been considered bon ton to blame locks for their fragility, especially since researchers i... more It has been considered bon ton to blame locks for their fragility, especially since researchers identified obstruction-freedom: a progress condition that precludes locking while being weak enough to raise the hope for good performance. This paper attenuates this hope by establishing lower bounds on the complexity of obstructionfree implementations in contention-free executions: those where obstruction-freedom was precisely claimed to be effective. Through our lower bounds, we argue for an inherent cost of concurrent computing without locks. We first prove that obstruction-free implementations of a large class of objects, using only overwriting or trivial primitives in contention-free executions, have Ω(n) space complexity and Ω(log 2 n) (obstruction-free) step complexity. These bounds apply to implementations of many popular objects, including variants of fetch&add, counter, compare&swap, and LL/SC. When arbitrary primitives can be applied in contention-free executions, we show that, in any implementation of binary consensus, or any perturbable object, the number of distinct base objects accessed and memory stalls incurred by some process in a contention free execution is Ω(√ n). All these results hold regardless of the behavior of processes after they become aware of contention. We also prove that, in any obstruction-free implementation of a perturbable object in which processes are not allowed to fail their operations, the number of memory stalls incurred by some process that is unaware of contention is Ω(n).
Proceeding of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing - PODC '10, 2010
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. A mutual exclusion algorithm is adaptive to point contention, if its RMR complexity is a function of the maximum number of processes concurrently executing their entry, critical, or exit section. In the best prior art deterministic adaptive mutual exclusion algorithm, presented by Kim and Anderson [22], a process performs O min(k, log N) RMRs as it enters and exits its critical section, where k is point contention and N is the number of processes in the system. Kim and Anderson also proved that a deterministic algorithm with o(k) RMR complexity does not exist [21]. However, they describe a randomized mutual exclusion algorithm that has O(log k) expected RMR complexity against an oblivious adversary. All these results apply for algorithms that use only atomic read and write operations. We present a randomized adaptive mutual exclusion algorithms with O(log k/ log log k) expected amortized RMR complexity, even against a strong adversary, for the cachecoherent shared memory read/write model. Using techniques similar to those used in [17], our algorithm can be adapted for the distributed shared memory read/write model. This establishes that sub-logarithmic adaptive mutual exclusion, using reads and writes only, is possible.
Proceedings of the twenty-second annual symposium on Principles of distributed computing - PODC '03, 2003
This paper introduces operation-valency, a generalization of the valency proof technique originat... more This paper introduces operation-valency, a generalization of the valency proof technique originated by Fischer, Lynch, and Paterson. By focusing on critical events that influence the return values of individual operations rather then on critical events that influence a protocol's single return value, the new technique allows us to derive a collection of realistic lower bounds for lock-free implementations of concurrent objects such as linearizable queues, stacks, sets, hash tables, shared counters, approximate agreement, and more. By realistic we mean that they follow the real-world model introduced by Dwork, Herlihy, and Waarts, counting both memory-references and memory-stalls due to contention, and that they allow the combined use of read, write, and read-modify-write operations available on current machines. By using the operation-valency technique, we derive an f~(X/~) non-cached shared memory accesses lower bound on the worst-case time complexity of lock-free implementations of objects in Influence(n), a wide class of concurrent objects including all of those mentioned above, in which an individual operation can be influenced by all others. We also prove the existence of a fundamental relationship between the space complexity, latency, contention, and "influence level" of any lock-free object implementation. Our results are broad in that they hold for implementations combining read/write memory and any collection of read-modifywrite operations, and in that they apply even if shared memory words have unbounded size.
Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing, 2015
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. To ensure the correctness of concurrent algorithms in general, and mutual exclusion algorithms in particular, it is often required to prohibit certain re-orderings of memory instructions that may compromise correctness, by inserting memory fence (a.k.a. memory barrier) instructions. Memory fences incur non-negligible overhead and may significantly increase time complexity. A mutual exclusion algorithm is adaptive to total contention (or simply adaptive), if the time complexity of every passage (an entry to the critical section and the corresponding exit) is a function of total contention, that is, the number of processes, k, that participate in the execution in which that passage is performed. We say that an algorithm A is f-adaptive (and that f is an adaptivity function of A), if the time complexity of every passage in A is O f (k). Adaptive implementations are desirable when contention is much smaller than the total number of processes, n, sharing the implementation. Recent work [5] presented the first read/write mutual exclusion algorithm with asymptotically optimal complexity under both the RMRs and fences metrics: each passage through the critical section incurs O(log n) RMRs and a constant number of fences. The algorithm works in the popular Total Store Ordering (TSO) model. The algorithm of [5] is non-adaptive, however, and they posed the question of whether there exists an adaptive mutual exclusion algorithm with the same complexities. We provide a negative answer to this question, thus capturing an inherent cost of adaptivity. In fact, we prove a stronger result: adaptive read/write mutual exclusion algo-* Partially supported by the Israel Science Foundation (grants 1227/10, 1749/14) and by the Lynne and William Frankel Center for Computing Science at Ben-Gurion University.
46th Annual IEEE Symposium on Foundations of Computer Science (FOCS'05)
This paper proves ¡ £ ¥ lower bounds on the time to perform a single instance of an operation in ... more This paper proves ¡ £ ¥ lower bounds on the time to perform a single instance of an operation in any implementation of a large class of data structures shared by £ processes. For standard data structures such as counters, stacks, and queues, the bound is tight. The implementations considered may apply any deterministic primitives to a base object. No bounds are assumed on either the number of base objects or their size. Time is measured as the number of steps a process performs on base objects and the number of stalls it incurs as a result of contention with other processes.
Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing - PODC '15, 2015
Out-of-order execution of instructions is a common optimization technique for multicores and mult... more Out-of-order execution of instructions is a common optimization technique for multicores and multiprocessors, which is governed by the memory model of the architecture. Relatively strong memory models, like TSO (supported by x86 and AMD), only allow reads to bypass earlier writes, while other models, like RMO (supported by ARM, POWER and Alpha) and PSO (supported by older SPARC), also allow the reordering of writes to different locations. These reorderings can be prevented by the use of costly fence instructions. In this paper we prove that when writes can be reordered (e.g, in RMO or even PSO), there is a tradeoff between the number of fences, f , and the number of remote memory references (RMRs), r, for a large class of objects, including locks, counters and queues:
Proceedings of the 28th ACM symposium on Principles of distributed computing - PODC '09, 2009
Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusio... more Mutual exclusion is a fundamental distributed coordination problem. Shared-memory mutual exclusion research focuses on local-spin algorithms and uses the remote memory references (RMRs) metric. A recent proof [9] established an Ω(log N) lower bound on the number of RMRs incurred by processes as they enter and exit the critical section, matching an upper bound by Yang and Anderson [18]. Both these bounds apply for algorithms that only use read and write operations. The lower bound of [9] only holds for deterministic algorithms, however; the question of whether randomized mutual exclusion algorithms, using reads and writes only, can achieve sub-logarithmic expected RMR complexity remained open. This paper answers this question in the affirmative. We present two strong-adversary [8] randomized local-spin mutual exclusion algorithms. In both algorithms, processes incur O(log N/ log log N) expected RMRs per passage in every execution. Our first algorithm has sub-optimal worstcase RMR complexity of O (log N/ log log N) 2. Our second algorithm is a variant of the first that can be combined with a deterministic algorithm, such as [18], to obtain O(log N) worst-case RMR complexity. The combined algorithm thus achieves sub-logarithmic expected RMR complexity while maintaining optimal worst-case RMR complexity. Our upper bounds apply for both the cache coherent (CC) and the distributed shared memory (DSM) models.
Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures - SPAA '10, 2010
Traditional data structure designs, whether lock-based or lock-free, provide parallelism via fine... more Traditional data structure designs, whether lock-based or lock-free, provide parallelism via fine grained synchronization among threads. We introduce a new synchronization paradigm based on coarse locking, which we call flat combining. The cost of synchronization in flat combining is so low, that having a single thread holding a lock perform the combined access requests of all others, delivers, up to a certain non-negligible concurrency level, better performance than the most effective parallel finely synchronized implementations. We use flat-combining to devise, among other structures, new linearizable stack, queue, and priority queue algorithms that greatly outperform all prior algorithms.
ACM SIGPLAN Notices, 2011
Building correct and efficient concurrent algorithms is known to be a difficult problem of fundam... more Building correct and efficient concurrent algorithms is known to be a difficult problem of fundamental importance. To achieve efficiency, designers try to remove unnecessary and costly synchronization. However, not only is this manual trial-and-error process ad-hoc, time consuming and error-prone, but it often leaves designers pondering the question of: is it inherently impossible to eliminate certain synchronization, or is it that I was unable to eliminate it on this attempt and I should keep trying? In this paper we respond to this question. We prove that it is impossible to build concurrent implementations of classic and ubiquitous specifications such as sets, queues, stacks, mutual exclusion and read-modify-write operations, that completely eliminate the use of expensive synchronization. We prove that one cannot avoid the use of either: i) read-after-write (RAW), where a write to shared variable A is followed by a read to a different shared variable B without a write to B in bet...
SIAM Journal on Computing, 2012
We present Ω(n) lower bounds on the worst case time to perform a single instance of an operation ... more We present Ω(n) lower bounds on the worst case time to perform a single instance of an operation in any nonblocking implementation of a large class of concurrent data structures shared by n processes. Time is measured by the number of stalls a process incurs as a result of contention with other processes. For standard data structures such as counters, stacks, and queues, our bounds are tight. The implementations considered may apply any primitives to a base object. No upper bounds are assumed on either the number of base objects or their size.
Journal of the ACM, 2009
Obstruction-free implementations of concurrent objects are optimized for the common case where th... more Obstruction-free implementations of concurrent objects are optimized for the common case where there is no step contention , and were recently advocated as a solution to the costs associated with synchronization without locks. In this article, we study this claim and this goes through precisely defining the notions of obstruction-freedom and step contention. We consider several classes of obstruction-free implementations, present corresponding generic object implementations, and prove lower bounds on their complexity. Viewed collectively, our results establish that the worst-case operation time complexity of obstruction-free implementations is high, even in the absence of step contention. We also show that lock-based implementations are not subject to some of the time-complexity lower bounds we present.