A history-tracing XML-based provenance framework for workflows (original) (raw)
Related papers
A workflow modeling system for capturing data provenance
Computers & Chemical Engineering, 2014
A workflow is an abstraction of the steps associated with the underlying work process and is typically modeled as a directed graph. The workflow concept under its various manifestations has been used to model applications in diverse areas, including project planning, manufacturing, scientific experiments, execution of computer software, and publishing. While the Open Provenance Model Core Specification had laid the foundation for defining the key concepts in a workflow, a simplified high level graphical representation of a workflow that is widely applicable is not available. In this paper we describe a novel general framework for building workflows and implementing the associated actions, which will facilitate understanding of work processes across multiple disciplines. As such, most work processes are organized hierarchically with well defined control and management responsibilities. This framework will facilitate integration and coordination of activities across associated domains. Additionally, it will act as a template to refer to the associated metadata as well as reference to access the instance data from archives of completed workflow cases. When a specific case is in progress, a finite state machine will guide the user through the steps and provide up to date information about the current state. We describe the main building blocks in the framework, their functionalities and illustrate the integration of workflows between an experimental and a scientific process. (G.S. Joglekar). force searches highly inefficient and time consuming. The current solution for such situations is to write custom computer software for every special requirement linking multiple information repositories with disparate data identifiers and creating specific search protocols to mine for data related to the issue at hand. Similarly, there are software companies that specialize in annotating information and reports that were created using word processors or spreadsheets, using natural language processing techniques. Therefore, moving forward, in order to avoid such case-by-case solutions, it is important insure that all information is captured in a semantically rich format.
Towards a Taxonomy of Provenance in Scientific Workflow Management Systems
2009
Scientific Workflow Management Systems (SWfMS) have been helping scientists to prototype and execute in silico experiments. They can systematically collect provenance information for the derived data products to be later queried. Despite the efforts on building a standard Open Provenance Model (OPM), provenance is tightly coupled to SWfMS. Thus scientific workflow provenance concepts, representation and mechanisms are very heterogeneous, difficult to integrate and dependent on the SWfMS. To help comparing, integrating and analyzing scientific workflow provenance, this paper presents a taxonomy about provenance characteristics. Its classification enables computer scientists to distinguish between different perspectives of provenance and guide to a better understanding of provenance data in general. The analysis of existing approaches will assist us in managing provenance data from distributed heterogeneous workflow executions.
A graph model of data and workflow provenance
2010
Provenance has been studied extensively in both database and workflow management systems, so far with little convergence of definitions or models. Provenance in databases has generally been defined for relational or complex object data, by propagating fine-grained annotations or algebraic expressions from the input to the output. This kind of provenance has been found useful in other areas of computer science: annotation databases, probabilistic databases, schema and data integration, etc. In contrast, workflow provenance aims to capture a complete description of evaluation-or enactment-of a workflow, and this is crucial to verification in scientific computation. Workflows and their provenance are often presented using graphical notation, making them easy to visualize but complicating the formal semantics that relates their run-time behavior with their provenance records. We bridge this gap by extending a previously-developed dataflow language which supports both database-style querying and workflow-style batch processing steps to produce a workflow-style provenance graph that can be explicitly queried. We define and describe the model through examples, present queries that extract other forms of provenance, and give an executable definition of the graph semantics of dataflow expressions.
Active Provenance for Data-Intensive Workflows: Engaging Users and Developers
2019 15th International Conference on eScience (eScience), 2019
We present a practical approach for provenance capturing in Data-Intensive workflow systems. It provides contextualisation by recording injected domain metadata with the provenance stream. It offers control over lineage precision, combining automation with specified adaptations. We address provenance tasks such as extraction of domain metadata, injection of custom annotations, accuracy and integration of records from multiple independent workflows running in distributed contexts. To allow such flexibility, we introduce the concepts of programmable Provenance Types and Provenance Configuration. Provenance Types handle domain contextualisation and allow developers to model lineage patterns by redefining API methods, composing easy-to-use extensions. Provenance Configuration, instead, enables users of a Data-Intensive workflow execution o prepare it for provenance capture, by configuring the attribution of Provenance Types to components and by specifying grouping into semantic clusters. This enables better searches over the lineage records. Provenance Types and Provenance Configuration are demonstrated in a system being used by computational seismologists. It is based on an extended provenance model, S-PROV.
Provenance in scientific workflow systems
2007
The automated tracking and storage of provenance information promises to be a major advantage of scientific workflow systems. We discuss issues related to data and workflow provenance, and present techniques for focusing user attention on meaningful provenance through "user views," for managing the provenance of nested scientific data, and for using information about the evolution of a workflow specification to understand the difference in the provenance of similar data products.
Establishing workflow trust using provenance information
2006
Abstract. Workflow forms a key part of many existing Service Oriented applications, involving the integration of services that may be made available at distributed sites. It is possible to distinguish between an “abstract” workflow description outlining which services must be involved in a workflow execution and a “physical” workflow description outlining the instances of services that were used in a particular enactment.
Using Explicit Control Processes in Distributed Workflows to Gather Provenance
2008
Distributing workflow tasks among high performance environments involves local processing and remote execution on clusters and grids. This distribution often needs interoperation between heterogeneous workflow definition languages and their corresponding execution machines. A centralized Workflow Management System (WfMS) can be locally controlling the execution of a workflow that needs a grid WfMS to execute a sub-workflow that requires high performance. Workflow specification languages often provide different control-flow execution structures. Moving from one environment to another requires mappings between these languages. Due to heterogeneity, control-flow structures, available in one system, may not be supported in another. In these heterogeneous distributed environments, provenance gathering becomes also heterogeneous. This work presents controlflow modules that aim to be independent from WfMS. By inserting these control-flow modules on the workflow specification, the workflow execution control becomes less dependent of heterogeneous workflow execution engines. In addition, they can be used to gather provenance data both from local and remote execution, thus allowing the same provenance registration on both environments independent of the heterogeneous WfMS. The proposed modules extend the ordinary workflow tasks by providing dynamic behavioral execution control. They were implemented in the VisTrails graphical workflow enactment engine, which offers a flexible infrastructure for provenance gathering.
Enhancing and abstracting scientific workflow provenance for data publishing
Proceedings of the Joint EDBT/ICDT 2013 Workshops on - EDBT '13, 2013
Many scientists are using workflows to systematically design and run computational experiments. Once the workflow is executed, the scientist may want to publish the dataset generated as a result, to be, e.g., reused by other scientists as input to their experiments. In doing so, the scientist needs to curate such dataset by specifying metadata information that describes it, e.g. its derivation history, origins and ownership. To assist the scientist in this task, we explore in this paper the use of provenance traces collected by workflow management systems when enacting workflows. Specifically, we identify the shortcomings of such raw provenance traces in supporting the data publishing task, and propose an approach whereby distilled, yet more informative, provenance traces that are fit for the data publishing task can be derived.