A framework for distributed data-parallel execution in the Kepler scientific workflow system (original) (raw)

Approaches to distributed execution of scientific workflows in kepler

The Kepler scientific workflow system enables creation, execution and sharing of workflows across a broad range of scientific and engineering disciplines while also facilitating remote and distributed execution of workflows. In this paper, we present and compare different approaches to distributed execution of workflows using the Kepler environment, including a distributed dataparallel framework using Hadoop and Stratosphere, and Cloud and Grid execution using Serpens, Nimrod/K and Globus actors. We also present real-life applications in computational chemistry, bioinformatics and computational physics to demonstrate the usage of different distributed computing capabilities of Kepler in executable workflows. We further analyze the differences of each approach and provide a guidance for their applications.

Scientific Workflow Management and the Kepler System

J Clin Microbiol, 2005

Many scientific disciplines are now data and information driven, and new scientific knowledge is often gained by scientists putting together data analysis and knowledge discovery 'pipelines'. A related trend is that more and more scientific communities realize the benefits of sharing their data and computational services, and are thus contributing to a distributed data and computational community infrastructure (a.k.a. 'the Grid'). However, this infrastructure is only a means to an end and ideally scientists should not be too concerned with its existence. The goal is for scientists to focus on development and use of what we call scientific workflows. These are networks of analytical steps that may involve, e.g., database access and querying steps, data analysis and mining steps, and many other steps including computationally intensive jobs on high-performance cluster computers. In this paper we describe characteristics of and requirements for scientific workflows as identified in a number of our application projects. We then elaborate on Kepler, a particular scientific workflow system, currently under development across a number of scientific data management projects. We describe some key features of Kepler and its underlying Ptolemy II system, planned extensions, and areas of future research. Kepler is a community-driven, open source project, and we always welcome related projects and new contributors to join.

A Run-time System for Efficient Execution of Scientific Workflows on Distributed Environments

2006 18th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'06), 2006

sponse to the demand of researchers from several domains of science who need to process and analyse increasingly larger experimental datasets. The idea is based on the observation that these operations can be composed as long pipelines of fairly standard computations that need to be executed on very large data collection. Researchers should then be able to share pieces of code among peers, and these several components can be integrated on different workflows for data processing. In this work we present a runtime support system that was customized for facilitating this type of computation on distributed computing environments. Our system is optimized for data-intensive workflows, meaning that we are very concerned with data management issues. Experiments with our system have shown that we can achieve linear speedups for fairly sophisticated applications, created from multiple components.

TARDIS: Optimal Execution of Scientific Workflows in Apache Spark

Springer eBooks, 2017

The success of using workflows for modeling large-scale scientific applications has fostered the research on parallel execution of scientific workflows in shared-nothing clusters, in which large volumes of scientific data may be stored and processed in parallel using ordinary machines. However, most of the current scientific workflow management systems do not handle the memory and data locality appropriately. Apache Spark deals with these issues by chaining activities that should be executed in a specific node, among other optimizations such as the in-memory storage of intermediate data in RDDs (Resilient Distributed Datasets). However, to take advantage of the RDDs, Spark requires existing workflows to be described using its own API, which forces the activities to be reimplemented in Python, Java, Scala or R, and this demands a big effort from the workflow programmers. In this paper, we propose a parallel scientific workflow engine called TARDIS, whose objective is to run existing workflows inside a Spark cluster, using RDDs and smart caching, in a completely transparent way for the user, i.e., without needing to reimplement the workflows in the Spark API. We evaluated our system through experiments and compared its performance with Swift/K. The results show that TARDIS performs better (up to 138% improvement) than Swift/K for parallel scientific workflow execution.

Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems

Scientific Programming, 2005

This paper describes the Pegasus framework that can be used to map complex scientific workflows onto distributed resources. Pegasus enables users to represent the workflows at an abstract level without needing to worry about the particulars of the target execution systems. The paper describes general issues in mapping applications and the functionality of Pegasus. We present the results of improving application performance through workflow restructuring which clusters multiple tasks in a workflow into single entities. A real-life astronomy application is used as the basis for the study.

New Execution Paradigm for Data-Intensive Scientific Workflows

2009 Congress on Services - I, 2009

With the advent of Grid and service-oriented technolo- gies, scientific workflows have been introduced in response to the increasing demand of researchers for assembling di- verse, highly-specialized applications, allowing them to ex- change large heterogeneous datasets in order to accom- plish a complex scientific task. Much research has already been done to provide efficient scientific workflow manage- ment systems (WfMS).

Early Cloud Experiences with the Kepler Scientific Workflow System

2012

With the increasing popularity of the Cloud computing, there are more and more requirements for scientific work–flows to utilize Cloud resources. In this paper, we present our preliminary work and experiences on enabling the interaction between the Kepler scientific workflow system and the Amazon Elastic Compute Cloud (EC2). A set of EC2 actors and Kepler Amazon Machine Images are introduced with the discussion on their different usage modes.