Efficient Stream Provenance via Operator Instrumentation (original) (raw)
Related papers
Perm: Efficient Provenance Support for Relational Databases
In many application areas like scientific computing, data-warehousing, and data integration detailed information about the origin of data is required. This kind of information is often referred to as data provenance. The provenance of a piece of data, a so-called data item, includes information about the source data from which it is derived and the transformations that lead to its creation and current representation. In the context of relational databases, provenance has been studied both from a theoretical and algorithmic perspective. Yet, in spite of the advances made, there are very few practical systems available that support generating, querying and storing provenance information (We refer to such systems as provenance management systems or PMS). These systems support only a subset of SQL, a severe limitation in practice since most of the application domains that benefit from provenance information use complex queries. Such queries typically involve nested sub-queries, aggregat...
A Generic Provenance Middleware for Database Queries, Updates, and Transactions
We present an architecture and prototype implementation for a generic provenance database middleware (GProM) that is based on the concept of query rewrites, which are applied to an algebraic graph representation of database operations. The system supports a wide range of provenance types and representations for queries, updates, transactions, and operations spanning multiple transac-tions. GProM supports several strategies for provenance genera-tion, e.g., on-demand, rule-based, and "always on". To the best of our knowledge, we are the first to present a solution for comput-ing the provenance of concurrent database transactions. Our solu-tion can retroactively trace transaction provenance as long as an audit log and time travel functionality are available (both are sup-ported by most DBMS). Other noteworthy features of GProM in-clude: extensibility through a declarative rewrite rule specification language, support for multiple database backends, and an optimizer for rewrit...
Proceedings of the 7th ACM international conference on Distributed event-based systems - DEBS '13, 2013
A Hybrid Approach for Efficient Provenance Storage
Efficient provenance storage is an essential step towards the adoption of provenance. In this paper, we analyze the prove-nance collected from multiple workloads with a view towards efficient storage. Based on our analysis, we characterize the properties of provenance with respect to long term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of web graph compression (adapted for provenance) and dictionary encoding, provides the best tradeoff in terms of compression ratio, compression time and query performance when compared to other compression schemes.
Evaluation of a Hybrid Approach for Efficient Provenance Storage
Provenance is the metadata that describes the history of objects. Provenance provides new functionality in a variety of areas, including experimental documentation, debugging, search, and security. As a result, a number of groups have built systems to capture provenance. Most of these systems focus on provenance collection, a few systems focus on building applications that use the provenance, but all of these systems ignore an important aspect: efficient long-term storage of provenance. In this article, we first analyze the provenance collected from multiple workloads and characterize the properties of provenance with respect to long-term storage. We then propose a hybrid scheme that takes advantage of the graph structure of provenance data and the inherent duplication in provenance data. Our evaluation indicates that our hybrid scheme, a combination of Web graph compression (adapted for provenance) and dictionary encoding, provides the best trade-off in terms of compression ratio, compression time, and query performance when compared to other compression schemes.