Job Coscheduling on Coupled High-End Computing Systems (original) (raw)

Coscheduling techniques and monitoring tools for non-dedicated cluster computing

Our efforts are directed towards the understanding of the coscheduling mechanism in a NOW system when a parallel job is executed jointly with local workloads, balancing parallel performance against the local interactive response. Explicit and implicit coscheduling techniques in a PVM-Linux NOW (or cluster) have been implemented.

Co-scheduling of computation and data on computer clusters

Proceedings of the 17th …, 2005

Scientific investigations have to deal with rapidly growing amounts of data from simulations and experiments. During data analysis, scientists typically want to extract subsets of the data and perform computations on them. In order to speed up the analysis, computations are performed on distributed systems such as computer clusters, or Grid systems. A well-known difficult problem is to build systems that execute the computations and data movement in a coordinated fashion. In this paper, we describe an architecture for executing co-scheduled tasks of computation and data movement on a computer cluster that takes advantage of two technologies currently being used in distributed Grid systems. The first is Condor, that manages the scheduling and execution of distributed computation, and the second is Storage Resource Managers (SRMs) that manage the space usage and content of storage systems. This is achieved by including the information about the availability of files on the nodes provided by SRMs into the advertised information that Condor uses for the purpose of matchmaking. The system is capable of dynamically load balancing by replicating popular files on idle nodes. To confirm the feasibility of our approach, a prototype system was built on a computer cluster. Several experiments based on real work logs were performed. We observed that without replication compute nodes are underutilized and job wait times in the scheduler's queue are longer. This architecture can be used in wide-area Grid systems since the basic components are already used for the Grid.

Buffered coscheduling: a new methodology for multitasking parallel jobs on distributed systems

Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, 2000

Buffered coscheduling is a scheduling methodology for time-sharing communicating processes in parallel and distributed systems. The methodology has two primary features: communication buffering and strobing. With communication buffering, communication generated by each processor is buffered and performed at the end of regular intervals to amortize communication and scheduling overhead. This infrastructure is then leveraged by a strobing mechanism to perform a total exchange of information at the end of each interval, thus providing global information to more efficiently schedule communicating processes. This paper describes how buffered coscheduling can optimize resource utilization by analyzing workloads with varying computational granularities, load imbalances, and communication patterns. The experimental results, performed using a detailed simulation model, show that buffered coscheduling is very effective on fast SANs such as Myrinet as well as slower switch-based LANs.

A loosely coupled metacomputer: co-operating job submissions across multiple supercomputing sites

Concurrency: Practice and Experience, 1999

This paper introduces a general metacomputing framework for submitting jobs across a variety of distributed computational resources. A first-come, first-served scheduling algorithm distributes the jobs across the computer resources. Because dependencies between jobs are expressed via a dataflow graph, the framework is more than just a uniform interface to the independently running queuing systems and interactive shells on each computer system. Using the dataflow approach extends the concept of sequential batch and interactive processing to running programs across multiple computers and computing sites in cooperation. We present results from a Grand Challenge case study showing that the turnaround time was dramatically reduced by having access to several supercomputers at runtime. The framework is applicable to other complex scientific problems that are coarse grained.

Scalable co-scheduling strategies in distributed computing

… on Computer Systems …, 2010

In this paper, we present an approach to scalable coscheduling in distributed computing for complex sets of interrelated tasks (jobs). The scalability means that schedules are formed for job models with various levels of task granularity, data replication policies, and the processor resource and memory can be upgraded. The necessity of guaranteed job execution at the required quality of service causes taking into account the distributed environment dynamics, namely, changes in the number of jobs for servicing, volumes of computations, possible failures of processor nodes, etc. As a consequence, in the general case, a set of versions of scheduling, or a strategy, is required instead of a single version. We propose a scalable model of scheduling based on multicriteria strategies. The choice of the specific schedule depends on the load level of the resource dynamics and is formed as a resource query which is sent to a local batch-job management system.

Efficient Scheduling of Parallel Jobs on Massively Parallel Systems

2007

We present bu ered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the e cient implementation of a distributed operating system. Bu ered coscheduling is based on three innovative techniques: communication bu ering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of bu ered coscheduling include higher resource utilization, reduced communication overhead, e cient implementation of ow-control strategies and fault-tolerant protocols, accurate performance modeling, and a simpli ed yet still expressive parallel programming model. Preliminary experimental results show that bu ered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

Improved Resource Utilization with Buffered Coscheduling

Parallel Algorithms and Applications, 2001

We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of flow-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model which offloads many resource-management tasks to the operating system. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads and is relatively insensitive to the local process scheduling strategy.

Towards Accommodating Real-time Jobs on HPC Platforms

ArXiv, 2021

Increasing data volumes in scientific experiments necessitate the use of high performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasiinstant feedback is growing. Scientists want to use results from one experiment to guide the selection of the next or even to improve the course of a single experiment. Current HPC systems are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available; otherwise it is queued. It is hard for these systems to support real-time jobs. Real-time jobs, in order to meet their requirements, should sometimes have to preempt batch jobs and/or be scheduled ahead of batch jobs that were submitted earlier. Accommodating real-time jobs may negatively impact system utilization also, especially when preemption/restart of batch jobs is involv...

A closer look at coscheduling approaches for a network of workstations

Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures - SPAA '99, 1999

Efficient scheduling of processes on processors of a Network of Workstations (NOW) is essential for good system performance. However, the design of such schedulers is challenging because of the complex interaction between several system and workload parameters. Coscheduling, though desirable, is impractical for such a loosely coupled environment. Two operations, waiting for a message and arrival of a message, can be used to take remedial actions that can guide the behavior of the system towards coscheduling using local information. We present a taxonomy of three possibilities for each of these two operations, leading to a design space of 3 3 scheduling mechanisms. This paper presents an extensive implementation and evaluation exercise in studying these mechanisms.