Optimizing Data Shuffling in Data-Parallel Computation by Understanding User-Defined Functions (original) (raw)

Easing the management of data-parallel systems via adaptation

2000

> Fast Ether Cable modem WaveLAN Storage Storage CPU Server CPU Server Switch Gigabit Ether : Users running data-parallel applications across the Internet. The data stores are on server clusters, which, compared to monolithic machines, flexibly support different kinds of concurrent workloads, are easier to upgrade, and have the potential to support independent node faults. Our client machines, in contrast to the traditional view, are active collaborators with the clusters in providing the end-result to the user.

Scheduling Computationally Intensive Data Parallel Programs

We consider the problem of how to run a workload of multiple parallel jobs on a single parallel machine. Jobs are assumed to be data-parallel with large degrees of parallelism, and the machine is assumed to have an MIMD architecture. We identify a spectrum of scheduling policies between the two extremes of time-slicing, in which jobs take turns to use the whole machine, and space-slicing, in which jobs get disjoint subsets of processors for their own dedicated use. Each of these scheduling policies is evaluated using a metric suited for interactive execution: the minimum machine power being devoted to any job, averaged over time. The following result is demonstrated. If there is no advance knowledge of job characteristics (such as running time, I/O frequency and communication locality) the best scheduling policy is gang-scheduling with instruction-balance. This conclusion validates some of the current practices in commercial systems. This work is then extended to irregular jobs, i.e...

A framework for partitioning parallel computations in heterogeneous environments

Concurrency: Practice and Experience, 1995

In the paper we present a framework for partitioning data parallel computations across a heterogeneous metasystem at runtime. The framework is guided by program and resource information which is made available to the system. Three difficult problems are handled by the framework: processor selection, task placement and heterogeneous data domain decomposition. Solving each of these problems contributes to reduced elapsed time. In particular, processor selection determines the best gain size at which to run the computation, task placement reduces communication cost, and data domain decomposition achieves processor load balance. We present results which indicate that excellent performanceis achievableusing the framework. The paper extends our earlier work on partitioning data parallel computations across a singlelevel network of heterogeneous workstations.

FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms

The Journal of Supercomputing, 2014

Optimization of data-parallel applications for modern HPC platforms requires partitioning the computations between the heterogeneous computing devices in proportion to their speed. Heterogeneous data partitioning algorithms are based on computation performance models of the executing platforms. Their implementation is not trivial as it requires: accurate and efficient benchmarking of computing devices, which may share resources and/or execute different codes; appropriate interpolation methods to predict performance; and advanced mathematical methods to solve the data partitioning problem. In this paper, we present FuPerMod, a software tool that addresses these implementation issues and automates the development of data partitioning code in data-parallel applications for heterogeneous HPC platforms.

Design and analysis of data management in scalable parallel scripting

2012 International Conference for High Performance Computing, Networking, Storage and Analysis, 2012

We seek to enable efficient large-scale parallel execution of applications in which a shared filesystem abstraction is used to couple many tasks. Such parallel scripting (manytask computing, MTC) applications suffer poor performance and utilization on large parallel computers because of the volume of filesystem I/O and a lack of appropriate optimizations in the shared filesystem. Thus, we design and implement a scalable MTC data management system that uses aggregated compute node local storage for more efficient data movement strategies. We co-design the data management system with the data-aware scheduler to enable dataflow pattern identification and automatic optimization. The framework reduces the time to solution of parallel stages of an astronomy data analysis application, Montage, by 83.2% on 512 cores; decreases the time to solution of a seismology application, CyberShake, by 7.9% on 2,048 cores; and delivers BLAST performance better than mpiBLAST at various scales up to 32,768 cores, while preserving the flexibility of the original BLAST application.

Efficient Data-parallel Computing on Small Heterogeneous Clusters

2012

Cluster-based data-parallel frameworks such as MapReduce, Hadoop, and Dryad are increasingly popular for a large class of compute-intensive tasks. Such systems are designed for large-scale clusters, and employ several techniques to decrease the run time of jobs in the presence of failures, slow machines, and other effects. In this paper, we apply Dryad to smaller-scale, “ad-hoc” clusters such as those formed by aggregating the servers and workstations in a small office. We first show that, while Dryad’s greedy scheduling algorithm performs well at scale, it is significantly less optimal in a small (5-10 machine) cluster environment where nodes have widely differing performance characteristics. We further show that in such cases, performance models of dataflow operators can be constructed which predict runtimes of vertex processes with sufficient accuracy to allow a more intelligent planner to achieve significant performance gains for a variety of jobs, and we show how to efficiently...

Lachesis: Automated Generation of Persistent Partitionings for Big Data Applications

ArXiv, 2020

Persistent partitioning is effective in improving the performance by avoiding the expensive shuffling operation, while incurring relatively small overhead. However it remains a significant challenge to automate this process for Big Data analytics workloads that extensively use user defined functions. That is because user defined functions coded with an object-oriented language such as Python, Scala, Java, can contain arbitrary code that is opaque to the system and makes it hard to extract and reuse sub-computations for optimizing data placement. In addition, it is also challenging to predict the future workloads that may utilize the partitionings. We propose the Lachesis system, which allows UDFs to be decomposed into analyzable and reusable sub-computations and relies on a deep reinforcement learning model that infers which sub-computations should be used to partition the underlying data. This analysis is then used to automatically optimize the storage of the data across applications.

A Novel Data-Partitioning Algorithm for Performance Optimization of Data-Parallel Applications on Heterogeneous HPC Platforms

IEEE Transactions on Parallel and Distributed Systems

Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements contend for shared on-chip resources such as Last Level Cache (LLC), interconnect, etc. and shared nodal resources such as DRAM, PCI-E links, etc. This has resulted in severe resource contention and Non-Uniform Memory Access (NUMA) that have posed serious challenges to model and algorithm developers. Moreover, the accelerators feature limited main memory compared to the multicore CPU host and are connected to it via limited bandwidth PCI-E links thereby requiring support for efficient out-of-card execution. To summarize, the complexities (resource contention, NUMA, accelerator-specific limitations, etc.) have introduced new challenges to optimization of data-parallel applications on these platforms for performance. Due to these complexities, the performance profiles of data-parallel applications executing on these platforms are not smooth and deviate significantly from the shapes that allowed state-of-the-art load-balancing algorithms to find optimal solutions. In this paper, we formulate the problem of optimization of data-parallel applications on modern heterogeneous HPC platforms for performance. We then propose a new model-based data partitioning algorithm, which minimizes the execution time of computations in the parallel execution of the application. This algorithm takes as input a set of p discrete speed functions corresponding to p available heterogeneous processors. It does not make any assumptions about the shapes of these functions. We prove the correctness of the algorithm and its complexity of O(m 3 × p 3), where m is the cardinality of the input discrete speed functions. We experimentally demonstrate the optimality and efficiency of our algorithm using two data-parallel applications, matrix multiplication and fast Fourier transform, on a heterogeneous cluster of nodes where each node contains an Intel multicore Haswell CPU, an Nvidia K40c GPU, and an Intel Xeon Phi co-processor.

Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters

Journal of Parallel and Distributed Computing, 2002

This paper examines the problem of adapting data parallel applications in a shared dynamic environment of PC or workstation clusters. We developed an analytic framework to compare and contrast a wide range of adaptation strategies: dynamic load balancing, migration, processor addition and removal. These strategies have been evaluated with respect to the cost and benefit they provide for three representative parallel applications: an iterative jacobi solver for Laplace's equation, gaussian elimination with partial pivoting, and a gene sequence comparison application. We found that the cost and benefit of each method can be predicted with high accuracy (within 10%) for all applications and show that the framework is applicable to a wide variety of parallel applications. We then show that accurate prediction allows the most appropriate method to be selected dynamically. Performance improvement for the three applications ranged from 25% to 45% using our adaptation library. In addition, we dispel the conventional wisdom that migration is too expensive, and show that it can be beneficial even for running parallel applications with non-trivial communication. # 2002 Elsevier Science (USA)