Grigoris Karvounarakis | University of Pennsylvania (original) (raw)
Papers by Grigoris Karvounarakis
We study conjunctive queries with unequalities (x ≠ y) and we identify cases when query containme... more We study conjunctive queries with unequalities (x ≠ y) and we identify cases when query containment can still be characterized by the existence of homomorphisms. We also identify a class of GLAV-like database schema mappings with unequalities, for which the chase theorem holds, and thus data exchange has the same complexity as for GLAV mappings. Finally, we define a notion of consistency and provide an algorithm to check whether a set of mappings is consistent.
Iswc, 2003
Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread wor... more Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread worldwide. Despite the plethora of SW languages (e.g., RDF/S, DAML+OIL, OWL) recently proposed for capturing data semantics, the vast majority of legacy sources still rely on relational databases (RDB) published on the Web or corporate intranets as virtual XML. In this paper, we a d v ocate a Datalog framework for mediating high-level queries to relational and/or XML sources using community o n tologies expressed in a SW language such as RDF/S. We describe the architecture and the reasoning services of our SW integration middleware, called SWIM, and we present the main design choices and techniques for supporting powerful mappings between di erent d a t a models, as well as, reformulation and optimization of queries expressed against mediation schemas and views.
Webdb, 2008
A key challenge in supporting information interchange is not only supporting queries over integra... more A key challenge in supporting information interchange is not only supporting queries over integrated data, but also updates. Previous work on update exchange has enabled update propagation over schema mappings in a unidirectional way -conceptually similar to view maintenance, in that a derived instance gets updated based on changes to a source instance. In this paper, we consider how to support data and update propagation across bidirectional mappings that enable different sites to mirror each other's data. We describe how data and update exchange can be extended to support bidirectional updates, implement an algorithm to perform side effect-free update propagation in this model, and show preliminary results suggesting our approach is feasible.
Proceedings of the 33rd International Conference on Very Large Data Bases, 2007
We consider systems for data sharing among heterogeneous peers related by a network of schema map... more We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer's updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditionsexpressing what data and sources a peer judges to be authoritative -which may cause a peer to reject another's updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in .
IEEE Data(base) Engineering Bulletin, 2007
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010
Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugg... more Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results.
Information systems such as organizational memories, vertical aggregators, infomediaries, etc. ar... more Information systems such as organizational memories, vertical aggregators, infomediaries, etc. are expected to play a central role in the 21st-century economy by enabling the development and maintenance of specific communities of interest (e.g., enterprise, professional, trading) on corporate intranets or the Web. Such Community Web Portals essentially provide the means to select, classify and access, in a semantically meaningful and ubiquitous way various information resources (e.g., sites, documents, data) for diverse target audiences (corporate, inter-enterprise, e-marketplace, etc.). Yet, in commercial software for deploying Community Portals, querying is still limited to full-text (or attribute-value) retrieval and more advanced information-seeking needs require navigational access. Furthermore, recent Web standards for describing resources (see the W3C Metadata Activity: RDF/ RDF Schema) are completely ignored. Moreover, standard (relational or object) databases are too rigid ...
resources available on corporate intranets or the Internet. The Resource Description Framework (R... more resources available on corporate intranets or the Internet. The Resource Description Framework (RDF) aims at facilitating the creation and exchange of metadata as any other Web data. The growing number of available information resources and the proliferation of description services in various user communities, lead nowadays to large volumes of RDF metadata. Managing such RDF resource descriptions and schemas with existing low-level APIs and file-based implementations does not ensure fast deployment and easy maintenance of real-scale RDF applications. In this paper, we advocate the use of database technology to support declarative access, as well as, logical and physical independence for voluminous RDF description bases.
Proceedings of the 16th International Conference on Database Theory - ICDT '13, 2013
We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs ... more We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs can be seen as particular cases of the evaluation of these queries on RDF graphs annotated with elements of so-called spm-semirings. Spm-semirings extend semirings, used for positive relational algebra queries on annotated relational data, with a new operator to capture the semantics of the non-monotone SPARQL operator OPTIONAL. Furthermore, spmsemiring-based annotations ensure that desired SPARQL query equivalences hold when querying annotated RDF. In addition to introducing spm-semirings, we study their properties and provide an alternative characterization of these structures in terms of semirings with an embedded boolean algebra (or seba-structure for short). This characterization allows to construct spm-semirings and to identify a universal object in the class of spm-semirings. Finally, we show that this universal object provides a concise provenance representation and can be used to evaluate SPARQL queries on arbitrary spm-semiring-annotated RDF graphs.
Lecture Notes in Computer Science, 2013
Assessing the quality of linked data currently published on the Web is a crucial need of various ... more Assessing the quality of linked data currently published on the Web is a crucial need of various data-intensive applications. Extensive work on similar applications for relational data and queries has shown that data provenance can be used in order to compute trustworthiness, reputation and reliability of query results, based on the source data and query operators involved in their derivation. In particular, abstract provenance models can be employed to record information about source data and query operators during query evaluation, and later be used e.g., to assess trust for individual query results. In this paper, we investigate the extent to which relational provenance models can be leveraged for capturing the provenance of SPARQL queries over linked data, and identify their limitations. To overcome these limitations, we advocate the need for new provenance models that capture the full expressive power of SPARQL, and can be used to support assessment of various forms of data quality for linked data manipulated declaratively by such queries. * An earlier version of this paper appeared in IEEE Internet Computing 15(1): 31-39, 2011 1 www.w3.org/standards/semanticweb/data 2 www.w3.org/wiki/SparqlEndpoints 3 www.w3.org/2005/-Incubator/-prov/wiki/User Requirements
Lecture Notes in Computer Science, 2005
In this paper we benchmark three popular database representations of RDF/S schemata and data: (a)... more In this paper we benchmark three popular database representations of RDF/S schemata and data: (a) a schema-aware (i.e., one table per RDF/S class or property) with explicit (ISA) or implicit (NOISA) storage of subsumption relationships, (b) a schema-oblivious (i.e., a single table with triples of the form subject-predicate-object ), using (ID) or not (URI) identifiers to represent resources and (c) a hybrid of the schema-aware and schema-oblivious representations (i.e., one table per RDF/S meta-class by distinguishing also the range type of properties). Furthermore, we benchmark two common approaches for evaluating taxonomic queries either on-the-fly (ISA, NOISA, Hybrid), or by precomputing the transitive closure of subsumption relationships (MatView, URI, ID). The main conclusion drawn from our experiments is that the evaluation of taxonomic queries is most efficient over RDF/S stores utilizing the Hybrid and MatView representations. Of the rest, schema-aware representations (ISA, NOISA) exhibit overall better performance than URI, which is superior to that of ID, which exhibits the overall worst performance.
Lecture Notes in Computer Science, 2012
The modern enterprise software stack-a collection of applications supporting bookkeeping, analyti... more The modern enterprise software stack-a collection of applications supporting bookkeeping, analytics, planning, and forecasting for enterprise data-is in danger of collapsing under its own weight. The task of building and maintaining enterprise software is tedious and laborious; applications are cumbersome for end-users; and adapting to new computing hardware and infrastructures is difficult. We believe that much of the complexity in today's architecture is accidental, rather than inherent. This tutorial provides an overview of the LogicBlox platform, a ambitious redesign of the enterprise software stack centered around a unified declarative programming model, based on an extended version of Datalog.
Lecture Notes in Computer Science, 2001
We distinguish between two broad categories of e-services: discrete services (e.g., add item to s... more We distinguish between two broad categories of e-services: discrete services (e.g., add item to shopping cart, charge a credit card), and sessionoriented ones (teleconference, collaborative text chat, streaming video, ccommerce interactions). Discrete services typically have short duration, and cannot respond to external asynchronous events. Session-oriented services have longer duration (perhaps hours), and typically can respond to asynchronous events (e.g., the ability to add a new participant to a teleconference). When composing discrete e-services it usually suffices to use a process model and engine that composes the e-services as relatively independent tasks. But when composing session-oriented e-services, the engine must be able to receive asynchronous events and determine how and whether to impact the active sessions. For example, if a teleconference participant loses his wireless connection then it might be appropriate to trigger an announcement to some or all of the other participants. In this paper we propose a process model and architecture for flexible composition and execution of discrete and session-oriented services. Unlike previous approaches, our model permits the specification of scripted "active flowcharts" that can be triggered by asynchronous events, and can appropriately impact active sessions. We introduce here a model and language for specifying process schemas (essentially a collection of active flowcharts) that combine multiple e-services, and describe a prototype engine for executing these process schemas.
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '07, 2007
We show that relational algebra calculations for incomplete databases, probabilistic databases, b... more We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and whyprovenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.
ACM SIGMOD Record, 2012
ABSTRACT We present an overview of the literature on querying semiring-annotated data, a notion w... more ABSTRACT We present an overview of the literature on querying semiring-annotated data, a notion we introduced five years ago in a paper with Val Tannen. First, we show that positive relational algebra calculations for various forms of annotated relations, as well as provenance models for such queries, are particular cases of the same general algorithm involving commutative semirings. For this reason, we present a formal framework for answering queries on data with annotations from commutative semirings, and propose a comprehensive provenance representation based on semirings of polynomials. We extend these considerations to XQuery views over annotated, unordered XML data, and show that the semiring framework suffices for a large positive fragment of XQuery applied to such data. Finally, we conclude with a brief overview of the large body of work that builds upon these results, including both extensions to the theoretical foundations and uses in practical applications.
ACM SIGMOD Record, 2008
Sharing structured data today requires standardizing upon a single schema, then mapping and clean... more Sharing structured data today requires standardizing upon a single schema, then mapping and cleaning all of the data. This results in a single queriable mediated data instance. However, for settings in which structured data is being collaboratively authored by a large community, e.g., in the sciences, there is often a lack of consensus about how it should be represented, what is correct, and which sources are authoritative. Moreover, such data is seldom static: it is frequently updated, cleaned, and annotated. The ORCHESTRA collaborative data sharing system develops a new architecture and consistency model for such settings, based on the needs of data sharing in the life sciences. In this paper we describe the basic architecture and implementation of the ORCHESTRA system, and summarize some of the open challenges that arise in this setting.
Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07, 2007
We study conjunctive queries with unequalities (x ≠ y) and we identify cases when query containme... more We study conjunctive queries with unequalities (x ≠ y) and we identify cases when query containment can still be characterized by the existence of homomorphisms. We also identify a class of GLAV-like database schema mappings with unequalities, for which the chase theorem holds, and thus data exchange has the same complexity as for GLAV mappings. Finally, we define a notion of consistency and provide an algorithm to check whether a set of mappings is consistent.
Iswc, 2003
Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread wor... more Semantic Web (SW) technology aims to facilitate the integration of legacy data sources spread worldwide. Despite the plethora of SW languages (e.g., RDF/S, DAML+OIL, OWL) recently proposed for capturing data semantics, the vast majority of legacy sources still rely on relational databases (RDB) published on the Web or corporate intranets as virtual XML. In this paper, we a d v ocate a Datalog framework for mediating high-level queries to relational and/or XML sources using community o n tologies expressed in a SW language such as RDF/S. We describe the architecture and the reasoning services of our SW integration middleware, called SWIM, and we present the main design choices and techniques for supporting powerful mappings between di erent d a t a models, as well as, reformulation and optimization of queries expressed against mediation schemas and views.
Webdb, 2008
A key challenge in supporting information interchange is not only supporting queries over integra... more A key challenge in supporting information interchange is not only supporting queries over integrated data, but also updates. Previous work on update exchange has enabled update propagation over schema mappings in a unidirectional way -conceptually similar to view maintenance, in that a derived instance gets updated based on changes to a source instance. In this paper, we consider how to support data and update propagation across bidirectional mappings that enable different sites to mirror each other's data. We describe how data and update exchange can be extended to support bidirectional updates, implement an algorithm to perform side effect-free update propagation in this model, and show preliminary results suggesting our approach is feasible.
Proceedings of the 33rd International Conference on Very Large Data Bases, 2007
We consider systems for data sharing among heterogeneous peers related by a network of schema map... more We consider systems for data sharing among heterogeneous peers related by a network of schema mappings. Each peer has a locally controlled and edited database instance, but wants to ask queries over related data from other peers as well. To achieve this, every peer's updates propagate along the mappings to the other peers. However, this update exchange is filtered by trust conditionsexpressing what data and sources a peer judges to be authoritative -which may cause a peer to reject another's updates. In order to support such filtering, updates carry provenance information. These systems target scientific data sharing applications, and their general principles and architecture have been described in .
IEEE Data(base) Engineering Bulletin, 2007
Proceedings of the 2010 international conference on Management of data - SIGMOD '10, 2010
Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugg... more Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results.
Information systems such as organizational memories, vertical aggregators, infomediaries, etc. ar... more Information systems such as organizational memories, vertical aggregators, infomediaries, etc. are expected to play a central role in the 21st-century economy by enabling the development and maintenance of specific communities of interest (e.g., enterprise, professional, trading) on corporate intranets or the Web. Such Community Web Portals essentially provide the means to select, classify and access, in a semantically meaningful and ubiquitous way various information resources (e.g., sites, documents, data) for diverse target audiences (corporate, inter-enterprise, e-marketplace, etc.). Yet, in commercial software for deploying Community Portals, querying is still limited to full-text (or attribute-value) retrieval and more advanced information-seeking needs require navigational access. Furthermore, recent Web standards for describing resources (see the W3C Metadata Activity: RDF/ RDF Schema) are completely ignored. Moreover, standard (relational or object) databases are too rigid ...
resources available on corporate intranets or the Internet. The Resource Description Framework (R... more resources available on corporate intranets or the Internet. The Resource Description Framework (RDF) aims at facilitating the creation and exchange of metadata as any other Web data. The growing number of available information resources and the proliferation of description services in various user communities, lead nowadays to large volumes of RDF metadata. Managing such RDF resource descriptions and schemas with existing low-level APIs and file-based implementations does not ensure fast deployment and easy maintenance of real-scale RDF applications. In this paper, we advocate the use of database technology to support declarative access, as well as, logical and physical independence for voluminous RDF description bases.
Proceedings of the 16th International Conference on Database Theory - ICDT '13, 2013
We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs ... more We show that the evaluation of SPARQL algebra queries on various notions of annotated RDF graphs can be seen as particular cases of the evaluation of these queries on RDF graphs annotated with elements of so-called spm-semirings. Spm-semirings extend semirings, used for positive relational algebra queries on annotated relational data, with a new operator to capture the semantics of the non-monotone SPARQL operator OPTIONAL. Furthermore, spmsemiring-based annotations ensure that desired SPARQL query equivalences hold when querying annotated RDF. In addition to introducing spm-semirings, we study their properties and provide an alternative characterization of these structures in terms of semirings with an embedded boolean algebra (or seba-structure for short). This characterization allows to construct spm-semirings and to identify a universal object in the class of spm-semirings. Finally, we show that this universal object provides a concise provenance representation and can be used to evaluate SPARQL queries on arbitrary spm-semiring-annotated RDF graphs.
Lecture Notes in Computer Science, 2013
Assessing the quality of linked data currently published on the Web is a crucial need of various ... more Assessing the quality of linked data currently published on the Web is a crucial need of various data-intensive applications. Extensive work on similar applications for relational data and queries has shown that data provenance can be used in order to compute trustworthiness, reputation and reliability of query results, based on the source data and query operators involved in their derivation. In particular, abstract provenance models can be employed to record information about source data and query operators during query evaluation, and later be used e.g., to assess trust for individual query results. In this paper, we investigate the extent to which relational provenance models can be leveraged for capturing the provenance of SPARQL queries over linked data, and identify their limitations. To overcome these limitations, we advocate the need for new provenance models that capture the full expressive power of SPARQL, and can be used to support assessment of various forms of data quality for linked data manipulated declaratively by such queries. * An earlier version of this paper appeared in IEEE Internet Computing 15(1): 31-39, 2011 1 www.w3.org/standards/semanticweb/data 2 www.w3.org/wiki/SparqlEndpoints 3 www.w3.org/2005/-Incubator/-prov/wiki/User Requirements
Lecture Notes in Computer Science, 2005
In this paper we benchmark three popular database representations of RDF/S schemata and data: (a)... more In this paper we benchmark three popular database representations of RDF/S schemata and data: (a) a schema-aware (i.e., one table per RDF/S class or property) with explicit (ISA) or implicit (NOISA) storage of subsumption relationships, (b) a schema-oblivious (i.e., a single table with triples of the form subject-predicate-object ), using (ID) or not (URI) identifiers to represent resources and (c) a hybrid of the schema-aware and schema-oblivious representations (i.e., one table per RDF/S meta-class by distinguishing also the range type of properties). Furthermore, we benchmark two common approaches for evaluating taxonomic queries either on-the-fly (ISA, NOISA, Hybrid), or by precomputing the transitive closure of subsumption relationships (MatView, URI, ID). The main conclusion drawn from our experiments is that the evaluation of taxonomic queries is most efficient over RDF/S stores utilizing the Hybrid and MatView representations. Of the rest, schema-aware representations (ISA, NOISA) exhibit overall better performance than URI, which is superior to that of ID, which exhibits the overall worst performance.
Lecture Notes in Computer Science, 2012
The modern enterprise software stack-a collection of applications supporting bookkeeping, analyti... more The modern enterprise software stack-a collection of applications supporting bookkeeping, analytics, planning, and forecasting for enterprise data-is in danger of collapsing under its own weight. The task of building and maintaining enterprise software is tedious and laborious; applications are cumbersome for end-users; and adapting to new computing hardware and infrastructures is difficult. We believe that much of the complexity in today's architecture is accidental, rather than inherent. This tutorial provides an overview of the LogicBlox platform, a ambitious redesign of the enterprise software stack centered around a unified declarative programming model, based on an extended version of Datalog.
Lecture Notes in Computer Science, 2001
We distinguish between two broad categories of e-services: discrete services (e.g., add item to s... more We distinguish between two broad categories of e-services: discrete services (e.g., add item to shopping cart, charge a credit card), and sessionoriented ones (teleconference, collaborative text chat, streaming video, ccommerce interactions). Discrete services typically have short duration, and cannot respond to external asynchronous events. Session-oriented services have longer duration (perhaps hours), and typically can respond to asynchronous events (e.g., the ability to add a new participant to a teleconference). When composing discrete e-services it usually suffices to use a process model and engine that composes the e-services as relatively independent tasks. But when composing session-oriented e-services, the engine must be able to receive asynchronous events and determine how and whether to impact the active sessions. For example, if a teleconference participant loses his wireless connection then it might be appropriate to trigger an announcement to some or all of the other participants. In this paper we propose a process model and architecture for flexible composition and execution of discrete and session-oriented services. Unlike previous approaches, our model permits the specification of scripted "active flowcharts" that can be triggered by asynchronous events, and can appropriately impact active sessions. We introduce here a model and language for specifying process schemas (essentially a collection of active flowcharts) that combine multiple e-services, and describe a prototype engine for executing these process schemas.
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '07, 2007
We show that relational algebra calculations for incomplete databases, probabilistic databases, b... more We show that relational algebra calculations for incomplete databases, probabilistic databases, bag semantics and whyprovenance are particular cases of the same general algorithms involving semirings. This further suggests a comprehensive provenance representation that uses semirings of polynomials. We extend these considerations to datalog and semirings of formal power series. We give algorithms for datalog provenance calculation as well as datalog evaluation for incomplete and probabilistic databases. Finally, we show that for some semirings containment of conjunctive queries is the same as for standard set semantics.
ACM SIGMOD Record, 2012
ABSTRACT We present an overview of the literature on querying semiring-annotated data, a notion w... more ABSTRACT We present an overview of the literature on querying semiring-annotated data, a notion we introduced five years ago in a paper with Val Tannen. First, we show that positive relational algebra calculations for various forms of annotated relations, as well as provenance models for such queries, are particular cases of the same general algorithm involving commutative semirings. For this reason, we present a formal framework for answering queries on data with annotations from commutative semirings, and propose a comprehensive provenance representation based on semirings of polynomials. We extend these considerations to XQuery views over annotated, unordered XML data, and show that the semiring framework suffices for a large positive fragment of XQuery applied to such data. Finally, we conclude with a brief overview of the large body of work that builds upon these results, including both extensions to the theoretical foundations and uses in practical applications.
ACM SIGMOD Record, 2008
Sharing structured data today requires standardizing upon a single schema, then mapping and clean... more Sharing structured data today requires standardizing upon a single schema, then mapping and cleaning all of the data. This results in a single queriable mediated data instance. However, for settings in which structured data is being collaboratively authored by a large community, e.g., in the sciences, there is often a lack of consensus about how it should be represented, what is correct, and which sources are authoritative. Moreover, such data is seldom static: it is frequently updated, cleaned, and annotated. The ORCHESTRA collaborative data sharing system develops a new architecture and consistency model for such settings, based on the needs of data sharing in the life sciences. In this paper we describe the basic architecture and implementation of the ORCHESTRA system, and summarize some of the open challenges that arise in this setting.
Proceedings of the 2007 ACM SIGMOD international conference on Management of data - SIGMOD '07, 2007