On the Design and Implementation of an Efficient Lock-Free Scheduler (original) (raw)

Jumbler: A lock-contention aware thread scheduler for multi-core parallel machines

2017 International Conference on Recent Advances in Signal Processing, Telecommunications & Computing (SigTelCom), 2017

On a cache-coherent multi-core multi-processor parallel machine, the execution time of a multi-threaded application with high-lock contention is immensely sensitive to the distribution of application-threads across multiple processors. Improper mapping of threads results in loss of performance due to the frequency of lock transfers between sockets. With increased transfer of lock object among different processors, a large number of last-level cache misses occur. The increase in last-level cache misses negatively affects program execution. Operating system's thread-schedulers are unaware of lock contention and therefore the default execution results in loss of performance especially in the application employing high lock-contention. To mitigate the problem, we propose a novel-scheduling technique as an extension of an existing work called shuffling. Our proposed scheduler migrates and maps the threads of a multi-threaded application across sockets so that the lock-contention threads are mapped on the same socket. The threads mapped together (employing the same lock) yield low number of last-level cache misses. We experiment with the proposed scheduler on a system having 2 sockets with 4 cores each and evaluate it using multithreaded parallel benchmarks. The experiments show that our algorithm achieves reduction in execution time up to 986.7%. Moreover, our algorithm does not require any changes to the application source-code or the operating system kernel.

High-level lock-less programming for multi-core

Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2012. , 2012

Modern computers are built upon multi-core architectures. Achieving peak performance on these architectures is hard and may require a substantial programming effort. The synchronisation of many processes racing to access a common resource (the shared memory) has been a fundamental problem on parallel computing for years, and many solutions have been proposed to address this issue. Non-blocking synchronisation and transactional primitives have been envisioned as a way to reduce memory wall problem. Despite sometimes effective (and exhibiting a great momentum in the research community), they are only one facet of the problem, as their exploitation still requires non-trivial programming skills. With non-blocking philosophy in mind, we propose high-level programming patterns that will relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance, like proper (distributed) dynamic memory allocation and effective exploitation of the memory hierarchy.

Decoupling contention management from scheduling

ACM SIGARCH Computer Architecture News, 2010

Many parallel applications exhibit unpredictable communication between threads, leading to contention for shared objects. The choice of contention management strategy impacts strongly the performance and scalability of these applications: spinning provides maximum performance but wastes significant processor resources, while blocking-based approaches conserve processor resources but introduce high overheads on the critical path of computation. Under situations of high or changing load, the operating system complicates matters further with arbitrary scheduling decisions which often preempt lock holders, leading to long serialization delays until the preempted thread resumes execution. We observe that contention management is orthogonal to the problems of scheduling and load management and propose to decouple them so each may be solved independently and effectively. To this end, we propose a load control mechanism which manages the number of active threads in the system separately fro...

Performance Impact of Lock-Free Algorithms on Multicore Communication APIs

Data race conditions in multi-tasking software applications are prevented by serializing access to shared memory resources, ensuring data consistency and deterministic behavior. Traditionally tasks acquire and release locks to synchronize operations on shared memory. Unfortunately, lock management can add significant processing overhead especially for multicore deployments where tasks on different cores convoy in queues waiting to acquire a lock. Implementing more than one lock introduces the risk of deadlock and using spinlocks constrains which cores a task can run on. The better alternative is to eliminate locks and validate that real-time properties are met, which is not directly considered in many embedded applications. Removing the locks is non-trivial and packaging lock-free algorithms for developers reduces the possibility of concurrency defects. This paper details how a multicore communication API implementation is enhanced to support lock-free messaging and the impact this has on data exchange latency between tasks. Throughput and latency are compared on Windows and Linux between lock-based and lock-free implementations for data exchange of messages, packets, and scalars. A model of the lock-free exchange predicts performance at the system architecture level and provides a stop criterion for the refactoring. The results show that migration from single to multicore hardware architectures degrades lock-based performance, and increases lock-free performance.

Scheduling directives for shared-memory many-core processor systems

Proceedings of the 2013 International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM '13, 2013

We consider many-core processors with task-oriented programming, whereby scheduling constraints among tasks are decided offline, and are then enforced by the runtime system. Here, exposing and beneficially exploiting fine grain data and control parallelism is increasingly important. Therefore, high expressive power for stating such constraints/directives, along with the ability to implement them in fast, simple hardware, is critical for success. In this paper, we focus on the relationship between duplicable tasks, which are used to express and exploit data parallelism. We extend the conventional Start-After-Complete (precedence) constraint to also be usable between replicas of different such tasks rather than only between entire tasks, thereby increasing the exposable parallelism. Additionally, we propose the parameterized Start-After-Start constraint, which can be used to control the degree of "lockstep" among multiple such tasks, e.g., in order to improve cache performance when the tasks work on the same data. Also, we briefly describe several additional interesting directives. Finally, we show that the directives can be supported efficiently in hardware. Hypercore, a very efficient CREW PRAM-like shared-cache architecture, which is very challenging because it has extremely fast dispatching for basic constraints, is used in the discussion. However, the new directives have broader applicability.

Real-time computing with lock-free shared objects

ACM Transactions on Computer Systems, 1997

This paper considers the use of lock-free shared objects within hard r eal-time systems. As the name suggests, lock-free shared objects are distinguished by the fact that they are not locked. As such, they do not give rise to priority inversions, a key advantage over conventional, lock-based object-sharing approaches. Despite this advantage, it is not immediately apparent that lock-free shared objects can be employed if tasks must adhere to strict timing constraints. In particular, lock-free object implementations permit concurrent operations to interfere with each other, and repeated interferences can cause a given operation to take an arbitrarily long time to complete. The main contribution of this paper is to show that such interferences can be b ounded by judicious scheduling. This work pertains to periodic, hard r eal-time tasks that share l o ck-free objects on a uniprocessor. In the rst part of the paper, scheduling conditions are derived for such tasks, for both static and dynamic priority schemes. Based on these conditions, it is formally shown that lock-free object-sharing approaches can be expected to incur much less overhead than approaches based on wait-free objects or lock-based schemes. In the last part of the paper, this conclusion is validated experimentally through work involving a realtime desktop videoconferencing system.

Scheduler-Conscious Synchronization

ACM Transactions on Computer Systems, 1997

E cient synchronization is important for achieving good performance in parallel programs, especially on large-scale multiprocessors. Most synchronization algorithms have been designed to run on a dedicated machine, with one application process per processor, and can su er serious performance degradation in the presence of multiprogramming. Problems arise when running processes block or, worse, busy-wait for action on the part of a process that the scheduler has chosen not to run.

OS and Runtime Support for Efficiently Managing Cores in Parallel Applications

Parallel applications can benefit from the ability to explicitly control their thread scheduling policies in user-space. However, modern operating systems lack the interfaces necessary to make this type of "user-level" scheduling efficient. The key component missing is the ability for applications to gain direct access to cores and keep control of those cores even when making I/O operations that traditionally block in the kernel. A number of former systems provided limited support for these capabilities, but they have been abandoned in modern systems, where all efforts are now focused exclusively on kernel-based scheduling approaches similar to Linux’s. In many circles, the term "kernel" has actually become synonymous with Linux, and its Unix-based model of process and thread management is often assumed as the only possible choice. In this work, we explore the OS and runtime support required to resurrect user-level threading as a viable mechanism for exploiting parallelism on modern systems. The idea is that applications request cores, not threads, from the underlying system and employ user-level scheduling techniques to multiplex their own set of threads on top of those cores. So long as an application has control of a core, it will have uninterrupted, dedicated access to it. This gives developers more control over what runs where (and when), so they can design their algorithms to take full advantage of the parallelism available to them. They express their parallelism via core requests and their concurrency via user-level threads. We frame our work with a discussion of Akaros, a new, experimental operating system we developed, whose “Many-Core Process” (MCP) abstraction is built specifically with user-level scheduling in mind. The Akaros kernel provides low-level primitives for gaining direct access to cores, and a user-level library, called parlib, provides a framework for building custom threading packages that make use of those cores. From a parlib-based version of pthreads to a port of Lithe and the popular Go programming language, the combination of Akaros and parlib proves itself as a powerful medium to help bring user-level scheduling back from the dark ages.

A scalable lock manager for multicores

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 2013

Modern implementations of DBMS software are intended to take advantage of high core counts that are becoming common in high-end servers. However, we have observed that several database platforms, including MySQL, Shore-MT, and a commercial system, exhibit throughput collapse as load increases, even for a workload with little or no logical contention for locks. Our analysis of MySQL identifies latch contention within the lock manager as the bottleneck responsible for this collapse. We design a lock manager with reduced latching, implement it in MySQL, and show that it avoids the collapse and generally improves performance. Our efficient implementation of a lock manager is enabled by a staged allocation and de-allocation of locks. Locks are pre-allocated in bulk, so that the lock manager only has to perform simple list-manipulation operations during the acquire and release phases of a transaction. De-allocation of the lock datastructures is also performed in bulk, which enables the use of fast implementations of lock acquisition and release, as well as concurrent deadlock checking.