Dynamic Multiple Work Stealing Strategy for Flexible Load Balancing (original) (raw)
Related papers
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09, 2009
Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.
Effective Task Binding in Work Stealing Runtimes for Numa Multi-Core Processors
2017
Modern server processors in high performance computing consist of multiple integrated memory controllers on-chip and behave as NUMA in nature. Many user level runtime systems like Open MP, Cilk and TBB provide task construct for programming multi core processors. Task body may define the code that can access task local data and also shared data. Most of the shared data is scattered across virtual memory pages. These virtual pages may be mapped to various memory banks of sockets due to first touch policy of Linux. The user level run-time environments must ensure that the tasks are mapped at a minimum possible distant memory bank where the shared data is located. At the same time, the runtime systems must ensure the load balancing among the multiple cores. Many of the user level runtime systems apply work stealing technique for balancing the load among the cores. Often, there is a tradeoff between these two requirements: object locality and load balancing. In this paper, we address th...
Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization
Journal of Parallel and Distributed Computing, 2019
Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing "in place", taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.
Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism
Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens parallelism on load conditions in order to significantly reduce its overheads compared to existing approaches, thus enabling the efficient execution of more fine-grained tasks. Unlike other adaptive dynamic schedulers, lazy scheduling does not maintain any additional state to infer system load and does not make irrevocable serialization decisions. These two features allow it to scale well and to provide excellent load balancing in practice but at a much lower overhead cost compared to work stealing, the golden standard of dynamic schedulers. We evaluate three variants of lazy scheduling on a set of benchmarks on three different platforms and find it to substantially outperform popular work stealing implementations on fine-grained codes. Furthermore, we show that the vast performance gap between manually coarsened and fully parallel code is greatly reduced by lazy scheduling, and that, with minimal static coarsening, lazy scheduling delivers performance very close to that of fully tuned code. The tedious manual coarsening required by the best existing work stealing schedulers and its damaging effect on performance portability have kept novice and general-purpose programmers from parallelizing their codes. Lazy scheduling offers the foundation for a declarative parallel programming methodology that should attract those programmers by minimizing the need for manual coarsening and by greatly enhancing the performance portability of parallel code.
A dynamic-sized nonblocking work stealing deque
Distributed Computing, 2006
The non-blocking work-stealing algorithm of Arora, Blumofe, and Plaxton [2] (henceforth ABP work-stealing) is on its way to becoming the multiprocessor load balancing technology of choice in both industry and academia. This highly efficient scheme is based on a collection of array-based double-ended queues (deques) with low cost synchronization among local and stealing processes. Unfortunately, the algorithm's synchronization protocol is strongly based on the use of fixed size arrays, which are prone to overflows, especially in the multi programmed environments for which they are designed. This is a significant drawback since, apart from memory inefficiency, it means that the size of the deque must be tailored to accommodate the effects of the hard-to-predict level of multiprogramming, and the implementation must include an expensive and application-specific overflow mechanism. This paper presents the first dynamic memory work-stealing algorithm. It is based on a novel way of building non-blocking dynamic-sized work stealing deques by detecting synchronization conflicts based on "pointer-crossing" rather than "gaps between indexes" as in the original ABP algorithm. As we show, the new algorithm dramatically increases robustness and memory efficiency, while causing applications no observable performance penalty. We therefore believe it can replace array-based ABP work stealing deques, eliminating the need for application-specific overflow mechanisms.
Distributed Work Stealing in a Task-Based Dataflow Runtime
arXiv (Cornell University), 2022
The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extremescale applications. However, load balancing is still a challenge in taskbased dataflow runtimes. In this paper, we present extensions to the PaR-SEC runtime to demonstrate that distributed work stealing is an effective load-balancing method for task-based dataflow runtimes. In contrast to shared-memory work stealing, we find that each process should consider future tasks and the expected waiting time for execution when determining whether to steal. We demonstrate the effectiveness of the proposed work-stealing policies for a sparse Cholesky factorization, which shows a speedup of up to 35% compared to a static division of work.
Hierarchical work stealing on manycore clusters
2011
Partitioned Global Address Space languages like UPC offer a convenient way of expressing large shared data structures, especially for irregular structures that require asynchronous random access. But the static SPMD parallelism model of UPC does not support divide and conquer parallelism or other forms of dynamic parallelism. We introduce a dynamic tasking library for UPC that provides a simple and effective way of adding task parallelism to SPMD programs. The task library, called HotSLAW, provides a high-level API that abstracts concurrent task management details and performs dynamic load balancing. To achieve scalability, we propose a topology-aware hierarchical work stealing strategy that exploits locality in distributed-memory clusters. Our approach, named HotSLAW, extends state of the art techniques in shared-and distributed-memory implementations with two mechanisms: Hierarchical Victim Selection (HVS) finds the nearest victim thread to preserve locality and Hierarchical Chunk Selection (HCS) dynamically determines the amount of work to steal based on the locality of the victim thread. We evaluate the performance of our runtime on shared-and distributed-memory systems using irregular applications. On shared memory, HotSLAW provides performance comparable or better than hand tuned OpenMP implementations. On distributed memory systems, the combination of Hierarchical Victim Selection and Hierarchical Chunk Selection provides better performance than state of the art approaches using a random victim selection with a StealHalf strategy for the workload considered.
A simple load balancing scheme for task allocation in parallel machines
Proceedings of the third annual ACM symposium on Parallel algorithms and architectures - SPAA '91, 1991
A collection of local workpiles (task queues) and a simple load balancing scheme is well suited for scheduling tasks in shared memory parallel machines. Task scheduling on such machines has usually been done through a single, globally accessible, workpile. The scheme introduced in this paper achieves a balancing comparable to that of a global workpile, while minimizing the overheads. In many parallel computer architectures, each processor has some memory that it can access more e ciently, and so it is desirable that tasks do not mirgrate frequently. The load balancing is simple and distributed: Whenever a processor accesses its local workpile, it performs a balancing operation with probability inversely proportional to the size of its workpile. The balancing operation consists of examining the workpile of a random processor and exchanging tasks so as to equalize the size of the two workpiles. The probabilistic analysis of the performance of the load balancing scheme proves that each tasks in the system receives its fair share of computation time. Speci cally, the expected size of each local task queue is within a small constant factor of the average, i.e. total number of tasks in the system divided by the number of processors.
Scheduling task parallelism on multi-socket multicore systems
The recent addition of task parallelism to the OpenMP shared mem-ory API allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execu-tion on the OpenMP run time system. This is a welcome develop-ment for scientific computing as supercomputer nodes grow "fatter" with multicore and manycore processors. But efficient scheduling of tasks on modern multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and NUMA characteristics. In this paper, we propose a hierarchical scheduling strategy that lever-ages different methods at different levels of the hierarchy. By allow-ing one thread to steal work on behalf of all of the threads within a single chip that share a cache, our scheduler limits the number of costly remote steals. For cores on the same chip, a shared LIFO queue allows exploitation of cache locality between sibling tas...
ACM Transactions on Programming Languages and Systems, 2014
Lazy scheduling is a runtime scheduler for task-parallel codes that effectively coarsens parallelism on load conditions in order to significantly reduce its overheads compared to existing approaches, thus enabling the efficient execution of more fine-grained tasks. Unlike other adaptive dynamic schedulers, lazy scheduling does not maintain any additional state to infer system load and does not make irrevocable serialization decisions. These two features allow it to scale well and to provide excellent load balancing in practice but at a much lower overhead cost compared to work stealing, the golden standard of dynamic schedulers. We evaluate three variants of lazy scheduling on a set of benchmarks on three different platforms and find it to substantially outperform popular work stealing implementations on fine-grained codes. Furthermore, we show that the vast performance gap between manually coarsened and fully parallel code is greatly reduced by lazy scheduling, and that, with minim...