Performance Impact of Lock-Free Algorithms on Multicore Communication APIs (original) (raw)
Data race conditions in multi-tasking software applications are prevented by serializing access to shared memory resources, ensuring data consistency and deterministic behavior. Traditionally tasks acquire and release locks to synchronize operations on shared memory. Unfortunately, lock management can add significant processing overhead especially for multicore deployments where tasks on different cores convoy in queues waiting to acquire a lock. Implementing more than one lock introduces the risk of deadlock and using spinlocks constrains which cores a task can run on. The better alternative is to eliminate locks and validate that real-time properties are met, which is not directly considered in many embedded applications. Removing the locks is non-trivial and packaging lock-free algorithms for developers reduces the possibility of concurrency defects. This paper details how a multicore communication API implementation is enhanced to support lock-free messaging and the impact this has on data exchange latency between tasks. Throughput and latency are compared on Windows and Linux between lock-based and lock-free implementations for data exchange of messages, packets, and scalars. A model of the lock-free exchange predicts performance at the system architecture level and provides a stop criterion for the refactoring. The results show that migration from single to multicore hardware architectures degrades lock-based performance, and increases lock-free performance.
Related papers
High-level lock-less programming for multi-core
Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems (ACACES) — Poster Abstracts, Fiuggi, Italy, 2012. , 2012
Modern computers are built upon multi-core architectures. Achieving peak performance on these architectures is hard and may require a substantial programming effort. The synchronisation of many processes racing to access a common resource (the shared memory) has been a fundamental problem on parallel computing for years, and many solutions have been proposed to address this issue. Non-blocking synchronisation and transactional primitives have been envisioned as a way to reduce memory wall problem. Despite sometimes effective (and exhibiting a great momentum in the research community), they are only one facet of the problem, as their exploitation still requires non-trivial programming skills. With non-blocking philosophy in mind, we propose high-level programming patterns that will relieve the programmer from worrying about low-level details such as synchronisation of racing processes as well as those fine tunings needed to improve the overall performance, like proper (distributed) dynamic memory allocation and effective exploitation of the memory hierarchy.
Contention in Multicore Hardware Shared Resources: Understanding of the State of the Art
2014
The real-time systems community has over the years devoted considerable attention to the impact on execution timing that arises from contention on access to hardware shared resources. The relevance of this problem has been accentuated with the arrival of multicore processors. From the state of the art on the subject, there appears to be considerable diversity in the understanding of the problem and in the "approach" to solve it. This sparseness makes it difficult for any reader to form a coherent picture of the problem and solution space. This paper draws a tentative taxonomy in which each known approach to the problem can be categorised based on its specific goals and assumptions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Related papers
Proceedings of the 8th annual IEEE/ ACM international symposium on Code generation and optimization - CGO '10, 2010