A dynamic-sized nonblocking work stealing deque (original) (raw)

Proceedings of the 17th annual ACM symposium on Parallelism in algorithms and architectures - SPAA'05, 2005

The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based on a collection of array-based doubleended queues (deques) with low cost synchronization among local and stealing processes. Unfortunately, the algorithm's synchronization protocol is strongly based on the use of fixed size arrays, which are prone to overflows, especially in the multiprogrammed environments for which they are designed. We present a work-stealing deque that does not have the overflow problem. The only ABP-style work-stealing algorithm that eliminates the overflow problem is the list-based one presented by Hendler, Lev and Shavit. Their algorithm indeed deals with the overflow problem, but it is complicated, and introduces a trade-off between the space and time complexity, due to the extra work required to maintain the list. Our new algorithm presents a simple lock-free work-stealing deque, which stores the elements in a cyclic array that can grow when it overflows. The algorithm has no limit other than integer overflow (and the system's memory size) on the number of elements that may be on the deque, and the total memory required is linear in the number of elements in the deque.

Dynamic Multiple Work Stealing Strategy for Flexible Load Balancing

IEICE Transactions on Information and Systems, 2012

Lazy-task creation is an efficient method of overcoming the overhead of the grain-size problem in parallel computing. Work stealing is an effective load balancing strategy for parallel computing. In this paper, we present dynamic work stealing strategies in a lazy-task creation technique for efficient fine-grain task scheduling. The basic idea is to control load balancing granularity depending on the number of task parents in a stack. The dynamic-length strategy of work stealing uses run-time information, which is information on the load of the victim, to determine the number of tasks that a thief is allowed to steal. We compare it with the bottommost first work stealing strategy used in StackThread/MP, and the fixed-length strategy of work stealing, where a thief requests to steal a fixed number of tasks, as well as other multithreaded frameworks such as Cilk and OpenMP task implementations. The experiments show that the dynamic-length strategy of work stealing performs well in irregular workloads such as in UTS benchmarks, as well as in regular workloads such as Fibonacci, Strassen's matrix multiplication, FFT, and Sparse-LU factorization. The dynamic-length strategy works better than the fixed-length strategy because it is more flexible than the latter; this strategy can avoid load imbalance due to overstealing.

Experimental Evaluation of the Performance of Processing Stealing Technique: A Scalable Load Balancing Technique for a Dynamic Multiprocessor System

International Journal of Computer Applications, 2013

This paper reports preliminary experimental evaluation of a Processing Elements Stealing (PE-S) technique which was targeted as efficient and scalable load balancing technique for dynamically structured multiprocessor systems. The multiprocessor system is imagined as a dynamic cluster based multiprocessor. Each cluster of the multiprocessor system is a node in symmetric multiprocessor architecture and the number of Processing Element (PE) in each cluster is dynamically determined at runtime. The PE-S technique dynamically computes the configuration ratio using the number of threads in the dynamically assigned tasks to generate the new number of PE for each cluster. This new configuration ratio is thereafter used to balance the additional computational work generated by runtime instantiation of current workloads for each cluster. In this work, the efficiency of the PE-S was evaluated using memory traces of some tightly parallel applications where the amount of parallelism is parameterized. These traces were used as workloads on two different simulation setups; the first is a dynamic multiprocessor with PE-S while the other was also a dynamic multiprocessor but without PE-S. This is to evaluate the performance of the PE-S load balancing technique on the targeted multiprocessor. Also the efficiency of PE-S reconfigurations was compared with other possible reconfiguration ratios. The experimental results showed that the load balancing algorithm is efficient and scalable for balancing at least 100,000 instructions tasks and PE-S generated ratios are averagely better than any other reconfiguration ratios.

Load balancing using work-stealing for pipeline parallelismin emerging applications

2009

Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g. , payment of royalties). Copies may be requested from IBM T.

Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09, 2009

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.

An algorithm for load balancing in multiprocessor systems

Information Processing Letters, 1990

We present an algorithm for dynamic load balancing in a multiprocessor system that minimizes the number of accesses to the shared memory. The algorithm makes no assumptions, probabilistic or otherwise, regarding task arrivals or processing requirements. For k processors to process n tasks, the algorithm incurs O(k log k log n) potential memory collisions in the worst care. The algorithm itself is a simple variation of the strategy of visiting the longest queue. T'he key idea is to delay reporting task arrivals and completions, where the delay is a function of dynamic loading conditions.

Dynamic processor allocation for adaptively parallel work-stealing jobs

2004

TCP's burstiness is usually regarded as harmful, or at best, inconvenient. Instead, this thesis suggests a new perspective and examines whether TCP's burstiness is useful for certain applications. It claims that burstiness can be harnessed to insulate traffic from packet reordering caused by route change. We introduce the use of flowlets, a new abstraction for a burst of packets from a particular flow followed by an idle interval. We apply flowlets to the routing of traffic along multiple paths and develop a scheme using flowlet-switching to split ...

Effective Task Binding in Work Stealing Runtimes for Numa Multi-Core Processors

2017

Modern server processors in high performance computing consist of multiple integrated memory controllers on-chip and behave as NUMA in nature. Many user level runtime systems like Open MP, Cilk and TBB provide task construct for programming multi core processors. Task body may define the code that can access task local data and also shared data. Most of the shared data is scattered across virtual memory pages. These virtual pages may be mapped to various memory banks of sockets due to first touch policy of Linux. The user level run-time environments must ensure that the tasks are mapped at a minimum possible distant memory bank where the shared data is located. At the same time, the runtime systems must ensure the load balancing among the multiple cores. Many of the user level runtime systems apply work stealing technique for balancing the load among the cores. Often, there is a tradeoff between these two requirements: object locality and load balancing. In this paper, we address th...

The shared-thread multiprocessor

… of the 22nd annual international conference …, 2008

This paper describes initial results for an architecture called the Shared-Thread Multiprocessor (STMP). The STMP combines features of a multithreaded processor and a chip multiprocessor; specifically, it enables distinct cores on a chip multiprocessor to share thread state. This shared thread state allows the system to schedule threads from a shared pool onto individual cores, allowing for rapid movement of threads between cores.

Supporting intra-task parallelism in real-time multiprocessor systems

2012

Mul ple programming models are emerging to address the increased need for dynamic task-level parallelism in applica ons for mul-core processors and shared-memory parallel compu ng, presen ng promising solu ons from a user-level perspec ve. Nonetheless, while high-level parallel languages offer a simple way for applica on programmers to specify parallelism in a form that easily scales with problem size, they s ll leave the actual scheduling of tasks to be performed at run me. Therefore, if the underlying system cannot efficiently map those tasks on the available cores, the benefits will be lost. This is par cularly important in modern real-me systems as their average workload is rapidly growing more parallel, complex and compu ng-intensive, whilst preserving stringent ming constraints. However, as the real-me scheduling theory has mostly been focused on sequen al task models, a shi to parallel task models introduces a completely new dimension to the scheduling problem. Within this context, the work presented in this thesis considers how to dynamically schedule highly heterogeneous parallel applica ons that require real-me performance guarantees on mul-core processors. A novel scheduling approach called RTWS is proposed. RTWS combines the G-EDF scheduler with a priority-aware work-stealing load balancing scheme, enabling parallel real-me tasks to be executed on more than one processor at a given me instant. Two stealing sub-policies have arisen from this proposal and their suitability is discussed in detail. Furthermore, this thesis describes the implementa on of a new scheduling class in the Linux kernel concerning RTWS, and extensively evaluate its feasibility. Experimental results demonstrate the greater scalability and lower scheduling overhead of the proposed approach, compara vely to an exis ng real-me deadline-driven scheduling policy for the Linux kernel, as well as reveal its be er performance when considering tasks with intra-task parallelism than without, even for short-living applica ons. We show that busy-aware stealing is robust to small devia ons from a strict priority schedule and conclude that some priority inversion may be actually acceptable, provided it helps reduce conten on, communica on, synchronisa on and coordina on between parallel threads.

A dynamic-sized nonblocking work stealing deque (original) (raw)

Related papers