Generic algorithms for scheduling applications on heterogeneous platforms (original) (raw)

Generic algorithms for scheduling applications on hybrid multi-core machines

2017

We study the problem of executing an application represented by a precedence task graph on a multi-core machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy list scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first online scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions whenever these exist or baseline algorithms.

Generic algorithms for scheduling applications on heterogeneous multi-core platforms

2018

We study the problem of executing an application represented by a precedence task graph on a parallel machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings and design generic scheduling approaches. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy List Scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. We also extend this algorithm to more types of computing units, achieving an approximation ratio which depends on the number of different types. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first on-line scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms for hybrid architectures have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions, whenever these exist, or baseline algorithms.

Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing

Parallel and Distributed Systems, …, 2002

AbstractÐEfficient application scheduling is critical for achieving high performance in heterogeneous computing environments. The application scheduling problem has been shown to be NP-complete in general cases as well as in several restricted cases. Because of its key importance, this problem has been extensively studied and various algorithms have been proposed in the literature which are mainly for systems with homogeneous processors. Although there are a few algorithms in the literature for heterogeneous processors, they usually require significantly high scheduling costs and they may not deliver good quality schedules with lower costs. In this paper, we present two novel scheduling algorithms for a bounded number of heterogeneous processors with an objective to simultaneously meet high performance and fast scheduling time, which are called the Heterogeneous Earliest-Finish-Time (HEFT) algorithm and the Critical-Path-on-a-Processor (CPOP) algorithm. The HEFT algorithm selects the task with the highest upward rank value at each step and assigns the selected task to the processor, which minimizes its earliest finish time with an insertion-based approach. On the other hand, the CPOP algorithm uses the summation of upward and downward rank values for prioritizing tasks. Another difference is in the processor selection phase, which schedules the critical tasks onto the processor that minimizes the total execution time of the critical tasks. In order to provide a robust and unbiased comparison with the related work, a parametric graph generator was designed to generate weighted directed acyclic graphs with various characteristics. The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithms significantly surpass previous approaches in terms of both quality and cost of schedules, which are mainly presented with schedule length ratio, speedup, frequency of best results, and average scheduling time metrics.

Analysis of a List Scheduling Algorithm for Task Graphs on Two Types of Resources

2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

We consider the problem of scheduling task graphs on two types of unrelated resources, which arises in the context of task-based runtime systems on modern platforms containing CPUs and GPUs. In this paper, we focus on an algorithm named HeteroPrio, which was originally introduced as an efficient heuristic for a particular application. HeteroPrio is an adaptation of the well known list scheduling algorithm, in which the tasks are picked by the resources in the order of their acceleration factor. This algorithm is augmented with a spoliation mechanism: a task assigned by the list algorithm can later on be reassigned to a different resource if it allows to finish this task earlier.We propose here the first theoretical analysis of the HeteroPrio algorithm in the presence of dependencies. More specifically, if the platform contains m and n processors of each type, we show that the worst-case approximation ratio of HeteroPrio is between 1+maxleft(fracmn,fracnmright)1 + \max \left( {\frac{m}{n},\frac{n}{m}} \right)1+maxleft(fracmn,fracnmright) ...

Scheduling of Parallel Tasks with Proportionate Priorities

Arabian Journal for Science and Engineering, 2016

Parallel computing systems promise higher performance for computationally intensive applications. Since programmes for parallel systems consist of tasks that can be executed simultaneously, task scheduling becomes crucial for the performance of these applications. Given dependence constraints between tasks, their arbitrary sizes, and bounded resources available for execution, optimal task scheduling is considered as an NP-hard problem. Therefore, proposed scheduling algorithms are based on heuristics. This paper presents a novel list scheduling heuristic, called the Noodle heuristic. Noodle is a simple yet effective scheduling heuristic that differs from the existing list scheduling techniques in the way it assigns task priorities. The priority mechanism of Noodle maintains a proportionate fairness among all ready tasks belonging to all paths within a task graph. We conduct an extensive experimental evaluation of Noodle heuristic with task graphs taken from Standard Task Graph. Our experimental study includes results for task graphs comprising of 50, 100, and 300 tasks per graph and execution scenarios with 2-, 4-, 8-, and 16-core systems. We report results for average Schedule Length Ratio (SLR) obtained by producing variations in Communication to Computation cost Ratio. We also analyse results for different degree of parallelism and

A SURVEY OF MULTIPROCESSORS TASKS SCHEDULING ALGORITHMS

In this paper, we survey some representative algorithms for offline scheduling DAG type applications in a deterministic environment, where it is assumed that the prediction of task execution time and network bandwidth/latency is accurate. However, in a real world computing environment, the probability of a precomputed schedule being executed exactly as expected is low. Due to resource sharing among multiple users, the performance of resources is inherently variant. In a static scheduling of a program represented by a directed task graph on a multiprocessor system to minimize the program completion time is a well-known problem in parallel processing. Since finding an optimal schedule is an NP complete problem in general, researchers have resorted to devising efficient heuristics. The objective of this survey is to describe various scheduling algorithms and their functionalities in a contrasting fashion as well as examine their relative merits in terms of performance and time-complexity. Since these algorithms are based on diverse assumptions, they differ in their functionalities, and hence are difficult to describe in a unified context.

Optimal Scheduling for Precedence-Constrained Applications on Heterogeneous Machines

Proceedings of MOL2NET 2018, International Conference on Multidisciplinary Sciences, 4th edition, 2018

High-Performance Computing (HPC) is a growing necessity of our technological society, HPC demands high loads of parallel computing jobs, an optimal scheduling of the parallel applications tasks is a priority to meet the demands of its users on time. Branch-and-bound (BB) Algorithms and Mathematical Programming (MP) solve complex optimization problems in an optimal manner, some MP or BB even have parallel computing capabilities, making them suitable solutions to solve realworld problems. In this paper, we propose two exact algorithms, a BB and an MP Model for scheduling precedence-constrained applications, on heterogeneous computing systems, as far as we known the first ones on his kind presented in the state of the art. One major contribution of the work is the proposed formulations of the objective function in both methods. Experimental results obtained more than twenty optimal values for synthetic applications from the literature.

Clustering-Based Task Scheduling in a Large Number of Heterogeneous Processors

IEEE Transactions on Parallel and Distributed Systems, 2016

Parallelization paradigms for effective execution in a Directed Acyclic Graph (DAG) application have been widely studied in the area of task scheduling. Schedule length can be varied depending on task assignment policies, scheduling policies, and heterogeneity in terms of each processor and each communication bandwidth in a heterogeneous system. One disadvantage of existing task scheduling algorithms is that the schedule length cannot be reduced for a data intensive application. In this paper, we propose a clustering-based task scheduling algorithm called Clustering for Minimizing the Worst Schedule Length (CMWSL) to minimize the schedule length in a large number of heterogeneous processors. First, the proposed method derives the lower bound of the total execution time for each processor by taking both the system and application characteristics into account. As a result, the number of processors used for actual execution is regulated to minimize the Worst Schedule Length (WSL). Then, the actual task assignment and task clustering are performed to minimize the schedule length until the total execution time in a task cluster exceeds the lower bound. Experimental results indicate that CMWSL outperforms both existing list-based and clustering-based task scheduling algorithms in terms of the schedule length and efficiency, especially in data-intensive applications.

Scheduling many-task workloads on supercomputers: Dealing with trailing tasks

2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers, 2010

In order for many-task applications to be attractive candidates for running on high-end supercomputers, they must be able to benefit from the additional compute, I/O, and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this means that the application should use the HPC resource in such a way that it can reduce time to solution beyond what is possible otherwise. Furthermore, it is necessary to make efficient use of the computational resources, achieving high levels of utilization.

Analysis, evaluation, and comparison of algorithms for scheduling task graphs on parallel processors

1996

In this paper, we survey algorithms that allocate a parallel program represented by an edge-weighted directed acyclic graph (DAG), also called a task graph or macrodataflow graph, to a set of homogeneous processors, with the objective of minimizing the completion time. We analyze 21 such algorithms and classify them into four groups. The first group includes algorithms that schedule the DAG to a bounded number of processors directly. These algorithms ;we called the bounded number of processors (€3") scheduling algorithms. The algorithms in the second group schedule the DAG to an unbounded number of clusters and are called the unbounded number of clusters (UNC) scheduling algorithms. The algorithms in the third group schedule the DAG usingtask duplication and are called the task duplication based (TDB) scheduling algorithms. The algorithms in the fourth group perform allocation and mapjping on arbitrary processor network topologies. These algorithms are called the arbitrary processor network (APN) scheduling algorithms. The design philosophies and principles behind these algorithms are discussed, and the performance of all of the algorithms is evaluated and compared against each other on a unified basis by using various scheduling parameters. start by classifying these algorithms into the following four groups: