Optimizing load balancing and data-locality with data-aware scheduling (original) (raw)

On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks

2013 42nd International Conference on Parallel Processing, 2013

Improving the performance of work-stealing loadbalancing algorithms in distributed shared-memory systems is challenging. These algorithms need to overcome high costs of contention among workers, communication and remote datareferences between nodes, and their impact on the locality preferences of tasks. Prior research focus on stealing from a victim that best exploits data locality, and on using special deques that minimize the contention between local and remote workers.

Scalable work stealing

Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis - SC '09, 2009

Irregular and dynamic parallel applications pose significant challenges to achieving scalable performance on large-scale multicore clusters. These applications often require ongoing, dynamic load balancing in order to maintain efficiency. Scalable dynamic load balancing on large clusters is a challenging problem which can be addressed with distributed dynamic load balancing systems. Work stealing is a popular approach to distributed dynamic load balancing; however its performance on large-scale clusters is not well understood. Prior work on work stealing has largely focused on shared memory machines. In this work we investigate the design and scalability of work stealing on modern distributed memory systems. We demonstrate high efficiency and low overhead when scaling to 8,192 processors for three benchmark codes: a producer-consumer benchmark, the unbalanced tree search benchmark, and a multiresolution analysis kernel.

Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, 2014

Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Due to the explosion of parallelism found in today's hardware, applications need to perform over-decomposition to deliver good performance; this over-decomposition is driving job management systems' requirements to support applications with a growing number of tasks with finer granularity. Our goal in this work is to provide a compact, light-weight, scalable, and distributed task execution framework (CloudKon) that builds upon cloud computing building blocks (Amazon EC2, SQS, and DynamoDB). Most of today's state-of-the-art job execution systems have predominantly Master/Slaves architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. CloudKon is a distributed job management system that can support both HPC and MTC workloads with millions of tasks/jobs. We compare our work with other state-of-the-art job management systems including Sparrow and MATRIX. The results show that CloudKon delivers better scalability compared to other state-of-the-art systems for some metricsall with a significantly smaller code-base (5%).

Towards Data Intensive Many-Task Computing

Many-task computing aims to bridge the gap between two computing paradigms, high throughput computing and high performance computing. Many task computing denotes high-performance computations comprising multiple distinct activities, coupled via file system operations. The aggregate number of tasks, quantity of computing, and volumes of data may be extremely large. Traditional techniques to support many-task computing commonly found in scientific computing (i.e. the reliance on parallel file systems with static configurations) do not scale to today's largest systems for data intensive application, as the rate of increase in the number of processors per system is outgrowing the rate of performance increase of parallel file systems. We argue that in such circumstances, data locality is critical to the successful and efficient use of large distributed systems for data-intensive applications. We propose a "data diffusion" approach to enable dataintensive many-task computing. Data diffusion acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data, effectively harnessing data locality in application data access patterns. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. To explore the feasibility of data diffusion, we offer both a theoretical and empirical analysis. We define an abstract model for data diffusion, define and implement scheduling policies with heuristics that optimize real world performance, and develop a competitive online caching eviction policy. We also offer many empirical experiments to explore the benefits of data diffusion, both under static and dynamic resource provisioning. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage, on up to 5832 processors. We conclude the chapter with performance results from a large scale astronomy application demonstrating that our approach improves both its performance and scalability. Data Intensive Distributed Comput Page 3 of 55 To appear as a book chapter in Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management

Albatross: An efficient cloud-enabled task scheduling and execution framework using distributed message queues

2016 IEEE 12th International Conference on e-Science (e-Science), 2016

Data Analytics has become very popular on large datasets in different organizations. It is inevitable to use distributed resources such as Clouds for Data Analytics and other types of data processing at larger scales. To effectively utilize all system resources, an efficient scheduler is needed, but the traditional resource managers and job schedulers are centralized and designed for larger batch jobs which are fewer in number. Frameworks such as Hadoop and Spark, which are mainly designed for Big Data analytics, have been able to allow for more diversity in job types to some extent. However, even these systems have centralized architectures and will not be able to perform well on large scales and under heavy task loads. Modern applications generate tasks at very high rates that can cause significant slowdowns on these frameworks. Additionally, over-decomposition has shown to be very useful in increasing the system utilization. In order to achieve high efficiency, scalability, and better system utilization, it is critical for a modern scheduler to be able to handle over-decomposition and run highly granular tasks. Further, to achieve high performance, Albatross is written in C/C++, which imposes a minimal overhead to the workload process as compared to languages like Java or Python. We propose Albatross, a task level scheduling and execution framework that uses a Distributed Message Queue (DMQ) for task distribution among its workers. Unlike most scheduling systems, Albatross uses a pulling approach as opposed to the common push approach. The former would let Albatross achieve a good load balancing and scalability. Furthermore, the framework has built in support for task execution dependency on workflows. Therefore, Albatross is able to run various types of workloads, including Data Analytics and HPC applications. Finally, Albatross provides data locality support. This allows the framework to achieve higher performance through minimizing the amount of unnecessary data movement on the network. Our evaluations show that Albatross outperforms Spark and Hadoop at larger scales and in the case of running higher granularity workloads.

Distributed Work Stealing in a Task-Based Dataflow Runtime

arXiv (Cornell University), 2022

The task-based dataflow programming model has emerged as an alternative to the process-centric programming model for extremescale applications. However, load balancing is still a challenge in taskbased dataflow runtimes. In this paper, we present extensions to the PaR-SEC runtime to demonstrate that distributed work stealing is an effective load-balancing method for task-based dataflow runtimes. In contrast to shared-memory work stealing, we find that each process should consider future tasks and the expected waiting time for execution when determining whether to steal. We demonstrate the effectiveness of the proposed work-stealing policies for a sparse Cholesky factorization, which shows a speedup of up to 35% compared to a static division of work.

Exploiting replication and data reuse to efficiently schedule data-intensive applications on grids

Job Scheduling Strategies …, 2005

Data-intensive applications executing over a computational grid demand large data transfers. These are costly operations. Therefore, taking them into account is mandatory to achieve efficient scheduling of dataintensive applications on grids. Further, within a heterogeneous and ever changing environment such as a grid, better schedules are typically attained by heuristics that use dynamic information about the grid and the applications. However, this information is often difficult to be accurately obtained. On the other hand, although there are schedulers that attain good performance without requiring dynamic information, they were not designed to take data transfer into account. This paper presents Storage Affinity, a novel scheduling heuristic for bag-of-tasks data-intensive applications running on grid environments. Storage Affinity exploits a data reuse pattern, common on many data-intensive applications, that allows it to take data transfer delays into account and reduce the makespan of the application. Further, it uses a replication strategy that yields efficient schedules without relying upon dynamic information that is difficult to obtain. Our results show that Storage Affinity may attain better performance than the state-of-theart knowledge-dependent schedulers. This is achieved at the expense of consuming more CPU cycles and network bandwidth.

Decoupling Computation and Data Scheduling in Distributed Data-Intensive Applications

2002

In high energy physics, bioinformatics, and other disciplines, we encounter applications involving numerous, loosely coupled jobs that both access and generate large data sets. So-called Data Grids seek to harness geographically distributed resources for such large-scale data-intensive problems. Yet effective scheduling in such environments is challenging, due to a need to address a variety of metrics and constraints (e.g., resource utilization, response time, global and local allocation policies) while dealing with multiple, potentially independent sources of jobs and a large number of storage, compute, and network resources.

Dynamic Multiple Work Stealing Strategy for Flexible Load Balancing

IEICE Transactions on Information and Systems, 2012

Lazy-task creation is an efficient method of overcoming the overhead of the grain-size problem in parallel computing. Work stealing is an effective load balancing strategy for parallel computing. In this paper, we present dynamic work stealing strategies in a lazy-task creation technique for efficient fine-grain task scheduling. The basic idea is to control load balancing granularity depending on the number of task parents in a stack. The dynamic-length strategy of work stealing uses run-time information, which is information on the load of the victim, to determine the number of tasks that a thief is allowed to steal. We compare it with the bottommost first work stealing strategy used in StackThread/MP, and the fixed-length strategy of work stealing, where a thief requests to steal a fixed number of tasks, as well as other multithreaded frameworks such as Cilk and OpenMP task implementations. The experiments show that the dynamic-length strategy of work stealing performs well in irregular workloads such as in UTS benchmarks, as well as in regular workloads such as Fibonacci, Strassen's matrix multiplication, FFT, and Sparse-LU factorization. The dynamic-length strategy works better than the fixed-length strategy because it is more flexible than the latter; this strategy can avoid load imbalance due to overstealing.

Optimizing load balancing and data-locality with data-aware scheduling (original) (raw)

Related papers