C-set: a commutative replicated data type for semantic stores (original) (raw)

2011

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

Abstract. Web 2.0 tools are currently evolving to embrace semantic web technologies. Blogs, CMS, Wikis, social networks and real-time notifications, integrate ways to provide semantic annotations and therefore contribute to the linked data and more generally to the semantic web vision. This evolution generates a lot of semantic datasets of different qualities, different trust levels and partially replicated. This raises the issue of managing the consistency among these replicas.

Col-Graph: Towards Writable and Scalable Linked Open Data

Lecture Notes in Computer Science, 2014

Linked Open Data faces severe issues of scalability, availability and data quality. These issues are observed by data consumers performing federated queries; SPARQL endpoints do not respond and results can be wrong or out-of-date. If a data consumer finds an error, how can she fix it? This raises the issue of the writability of Linked Data. In this paper, we devise an extension of the federation of Linked Data to data consumers. A data consumer can make partial copies of different datasets and make them available through a SPARQL endpoint. A data consumer can update her local copy and share updates with data providers and consumers. Update sharing improves general data quality, and replicated data creates opportunities for federated query engines to improve availability. However, when updates occur in an uncontrolled way, consistency issues arise. In this paper, we define fragments as SPARQL CONSTRUCT federated queries and propose a correction criterion to maintain these fragments incrementally without reevaluating the query. We define a coordination free protocol based on the counting of triples derivations and provenance. We analyze the theoretical complexity of the protocol in time, space and traffic. Experimental results suggest the scalability of our approach to Linked Data.

On Versioning and Archiving Semantic Web Data

This paper concerns versioning services over Semantic Web (SW) repositories. We propose a novel storage index (based on partial orders), called POI, that exploits the fact that RDF Knowledge Bases (KBs) (a) have not a unique serialization (as it happens with texts) and (b) their versions are usually related by containment (⊆). We discuss the benefits and drawbacks of this approach in terms of storage space and efficiency both analytically and experimentally in comparison with the existing approaches (including the changebased approach). We report experimental results over synthetic data sets showing that POI offers notable space saving, e.g. compression ratio (i.e. uncompressed/compressed size) ranges between 1,800% and 18,163%, as well as efficiency in various cross version operations. POI is equipped with three version insertion algorithms and could be also exploited in cases where the set of KBs does not fit in main memory. Although the focus of this work is SW data versioning, POI can be considered as a generic indexing scheme for storing set-valued data.

Ontology Consistency and Instance Checking For Real World Linked Data

Ontology Consistency and Instance Checking for Real World Linked Data, 2015

Many large ontologies have been created which make use of OWL's expressiveness for specification. However, tools to ensure that instance data is in compliance with the schema are often not well integrated with triple-stores and cannot detect certain classes of schema-instance inconsistency due to the assumptions of the OWL axioms. This can lead to lower quality, inconsistent data. We have developed a simple ontol-ogy consistency and instance checking service, SimpleConsist[8]. We also define a number of ontology design best practice constraints on OWL or RDFS schemas. Our implementation allows the user to specify which constraints should be applied to schema and instance data.

Conflict-Free Partially Replicated Data Types

2015

Designers of large user-oriented distributed applications, such as social networks and mobile applications, have adopted measures to improve the responsiveness of their applications. Latency is a major concern as people are very sensitive to it. Geo-replication is a commonly used mechanism to bring the data closer to clients. Nevertheless, reaching the closest datacenter can still be considerably slow. Thus, in order to further reduce the access latency, mobile and web applications may be forced to replicate data at the client-side. Unfortunately, fully replicating large data structures may still be a waste of resources, specially for thin-clients. We propose a replication mechanism built upon conflict-free replicated data types (CRDT) to seamlessly replicate parts of large data structures. We define partial replication and give an approach to keep the strong eventual consistency properties of CRDTs with partial replicas. We integrate our mechanism into SwiftCloud, a transactional system that brings geo-replication to clients. We evaluate the solution with a content-sharing application. Our results show improvements in bandwidth, memory, and latency over both classical geo-replication and the existing SwiftCloud solution.

Writes that fall in the forest and make no sound: Semantics-based adaptive data consistency

Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties-something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks. * Work done while at Cornell University. trade-off between responsiveness and consistency, we demonstrate that by carefully analyzing applications, it is possible to achieve the best of both worlds: strong consistency and low latency in the common case. The key idea is to exploit the semantics of the transactions involved in the execution of an application in a way that is safe and completely transparent to programmers. It is well known that strong consistency is not always required to execute transactions correctly [20, 41], and this insight has been exploited in protocols that allow transactions to operate on slightly stale replicas as long as the staleness is "not enough to affect correctness" [5, 41]. This paper takes this basic idea much further, and develops mechanisms for automatically extracting safety predicates from application source code. Our homeostasis protocol uses these predicates to allow sites to operate without communicating, as long as any inconsistency is appropriately governed. Unlike prior work, our approach is fully automated and does not require programmers to provide any information about the semantics of transactions. Example: top-k query. To illustrate the key ideas behind our approach in further detail, consider a top-k query over a distributed datastore, as illustrated in Figure 1. For simplicity we will consider the case where k = 2. This system consists of a number of item sites that each maintain a collection of (key, value) pairs that could represent data such as airline reservations or customer purchases. An aggregator site maintains a list of top-k items sorted in descending order by value. Each item site periodically receives new insertions, and the aggregator site updates the top-k list as needed. A simple algorithm that implements the top-k query is to have each item site communicate new insertions to the aggregator site, which inserts them into the current top-k list in order, and removes the smallest element of the list. However, every insertion requires a communication round with the aggregator site, even if most of the inserts are for objects not in the top-k. A better idea is to only communicate with the aggregator node if the new value is greater than the minimal value of the current top-k list. Each site can maintain a cached value of the smallest value in the top-k and only notify the aggregator site if an item with a larger value is inserted into its local state. This algorithm is illustrated in Figure 2, where each item site has a variable min with the current lowest top-k value. In expectation, most item inserts do not affect the aggregator's behavior, and consequently, it is safe for them to remain unobserved by the aggregator site. This improved top-k algorithm is essentially a simplified distributed version of the well-known threshold algorithm for top-k computation [14]. However, note that this algorithm can be ex

Creating a Relational Distributed Object Store

ArXiv, 2013

In and of itself, data storage has apparent business utility. But when we can convert data to information, the utility of stored data increases dramatically. It is the layering of relation atop the data mass that is the engine for such conversion. Frank relation amongst discrete objects sporadically ingested is rare, making the process of synthesizing such relation all the more challenging, but the challenge must be met if we are ever to see an equivalent business value for unstructured data as we already have with structured data. This paper describes a novel construct, referred to as a relational distributed object store (RDOS), that seeks to solve the twin problems of how to persistently and reliably store petabytes of unstructured data while simultaneously creating and persisting relations amongst billions of objects.

D.: owl:sameAs and Linked Data: An Empirical Study

2015

Linked Data is a steadily growing presence on the Web. In Linked Data, the description of resources can be obtained incrementally by dereferencing the URIs of resources via the HTTP protocol. The use of owl:sameAs further enriches the Linked Data space by declaratively supporting distributed semantic data integration at the instance level. When con-suming Linked Data, users should be careful when handling owl:sameAs: in that URIs linked by owl:sameAs may not be appropriate for simple aggregation, and that recursively ex-ploring owl:sameAs may lead to considerable network over-head. In this work, we discuss and conduct an empirical pilot study on the usage of owl:sameAs in the Linked Data community. The results include initial quantitative mea-sures of the usage of owl:sameAs. Based on observations of these results, we further discuss several strategies for dealing with owl:sameAs in Linked Data applications.

A formalism for consistency and partial replication

2004

Replicating data in a distributed system improves availability at the cost of maintaining consistency, since each site's view may be partial or stale. Although a number of protocols have been proposed to achieve various degrees of consistency [1 4], we lack a common framework for understanding and comparing them. This paper presents such a framework.

Loading...

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

References (14)

  1. Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked Data -The Story So Far. International Journal on Semantic Web and Information Systems, 4(2):1-22, January 2009.
  2. Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. DBpedia -A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web, 7(3):154-165, September 2009.
  3. Min Cai and Martin Frank. RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In Proceedings of the 13th international conference on World Wide Web, pages 650-657. ACM, 2004.
  4. P.A. Chirita, Stratos Idreos, Manolis Koubarakis, and Wolfgang Nejdl. Publish/- subscribe for rdf-based p2p networks. The Semantic Web: Research and Applica- tions, pages 182-197, 2004.
  5. P. Johnson and R. Thomas. RFC677: The maintenance of duplicate databases. 1976.
  6. Wolfgang Nejdl, Boris Wolf, Changtao Qu, and Stefan Decker. EDUTELLA: a P2P networking infrastructure based on RDF. Proceedings of the, 2002.
  7. Gérald Oster, Pascal Urso, Pascal Molli, and Abdessamad Imine. Data Consistency for P2P Collaborative Editing. In Conference on Computer-Supported Cooperative Work, 2006.
  8. Nuno Preguica, Joan Manuel Marques, Marc Shapiro, and Mihai Letia. A Com- mutative Replicated Data Type for Cooperative Editing. 2009 29th IEEE Interna- tional Conference on Distributed Computing Systems, pages 395-403, June 2009.
  9. Yasushi Saito and Marc Shapiro. Optimistic replication. ACM Computing Surveys, 37(1):42-81, March 2005.
  10. Berners-Lee Tim and Connolly Dan. Delta: an ontology for the distribution of differences between RDF graphs. http://www.w3.org/DesignIssues/Diff, 2004.
  11. Giovanni Tummarello, Christian Morbidoni, R. Bachmann-Gmur, and Orri Erling. RDFSync: efficient remote synchronization of RDF models. Proceedings of the ISWC/ASWC2007, pages 537-551, 2007.
  12. Giovanni Tummarello, Christian Morbidoni, Joakim Petersson, Paolo Puliti, and F. Piazza. RDFGrowth, a P2P annotation exchange algorithm for scalable Seman- tic Web applications. The First International Workshop on Peer-to-Peer Knowl- edge Management, 2004.
  13. Stéphane Weiss, Pascal Urso, and Pascal Molli. Logoot : a scalable optimistic replication algorithm for collaborative editing on p2p networks. In International Conference on Distributed Computing Systems (ICDCS). IEEE, 2009.
  14. Stephane Weiss, Pascal Urso, and Pascal Molli. Logoot-undo: Distributed col- laborative editing system on p2p networks. IEEE Transactions on Parallel and Distributed Systems, 21(8), 2010.

Synchronizing semantic stores with commutative replicated data types

2012

Abstract Social semantic web technologies led to huge amounts of data and information being available. The production of knowledge from this information is challenging, and major efforts, like DBpedia, has been done to make it reality. Linked data provides interconnection between this information, extending the scope of the knowledge production.

B-Set: a synchronization method for distributed semantic stores

— Nowadays, there are increasing interests in developing methods for synchronizing distributed triple-stores by ensuring eventual data consistency in distributed architecture. The most well-known of them have been designed to serve as a common replicated data type (CRDT), where all concurrent operations commute independently of the centralized control. In this context, CRDT has been proposed for semantic stores, such as SWOOKI, C-Set and SU-Set. However none of the exiting synchronization solutions mention how to ensure Causality, Consistency and Intention preservation criteria of CCI model. This paper proposes B-Set, a new CRDT for the synchronization of semantic stores. B-Set is designed not only to ensure convergence of triples replicas but also to preserve user's intentions integrated in distributed architecture. The sets of operations are also defined in order to allow concurrent editing of the same shared triple-stores.

Consistency awareness in a distributed collaborative system for semantic stores

In distributed collaborative systems for semantic stores editing, multiple users can add, delete and change RDF statements starting from the same replicas and achieving to the same results at the end of the collaborative session. To improve the performance for such systems, the development of an efficient awareness mechanism is very important in order to help users to better understand the semantic stores evolution. Moreover, maintaining the consistency in replicated architecture is one of the most significant problems. However, none of the existing approaches describes how to define the awareness mechanism for distributed semantic stores performing concurrent changes. In this paper, we propose a new powerful optimistic replication solution called AB-Set, which can ensure not only a consistency criteria when editing data but also use semantic web technologies to define an awareness mechanism for making users aware of the different status of the store they share and update regardless of the concurrency level.

An optimized conflict-free replicated set

2012

Abstract: Eventual consistency of replicated data supports concurrent updates, reduces latency and improves fault tolerance, but forgoes strong consistency. Accordingly, several cloud computing platforms implement eventually-consistent data types. The set is a widespread and useful abstraction, and many replicated set designs have been proposed. We present a reasoning abstraction, permutation equivalence, that systematizes the characterization of the expected concurrency semantics of concurrent types.

srCE: a collaborative editing of scalable semantic stores on P2P networks

Commutative Replicated Data Type (CRDT) is a convergence philosophy invented as a new generation of technique that ensures consistency maintenance of replica in collaborative editors without any difficulty over Peer-to-Peer (P2P) networks. This technique has been successfully applied to different data representation types in scalable collaborative editing for linear, tree document structure and semi-structured data types but not yet on set data type ensuring Causality, Consistency and Intention (CCI) preservation criteria. In this paper, we propose a srCE approach, a novel CRDT for a set structure to facilitate the collaborative and concurrent editing of Resource Description Framework (RDF) stores in large scale by different members of virtual community. Our approach ensures CCI model and is not tied to a specific case and therefore can be applied for any document that complies to set structure. A prototype implementation using Friend of a Friend (FOAF) data sets with and without the srCE model illustrates significant improvement in scalability and performance.

The three dimensions of data consistency

2005

Abstract Replication and consistency are essential features of any distributed system and have been studied extensively, however a systematic comparison is lacking. Therefore, we developed the Action-Constraint Framework, which captures both the semantics of replicated data and the behaviour of a replication algorithm. It enables us to decompose the problem of ensuring consistency into three simpler, easily understandable subproblems. As the sub-problems are largely orthogonal, sub-solutions can be mixed and matched.

Supporting Scalable, Persistent Semantic Web Applications

IEEE Data(base) Engineering Bulletin - DEBU, 2003

To realize the vision of the Semantic Web, efficient sto rage and retrieval of large RDF data sets is required. A common technique for persisting RDF data (graphs) is to use a single relational database table, a triple store. But, we believe a single triple store cannot scale for large-scale applications. This paper describes storing and querying persistent RDF graphs in Jena, a Semantic Web programmers' toolkit. Jena augments the triple store with property tables that cluster multiple property values in a single table row. We also describe two tools to assist in designing application-specific RDF sto rage schema. The first is a synthetic data generator that generates RDF graphs consistent with an underlying ontology. The second mines an RDF graph or an RDF query log for frequently occurring patterns. These patterns can be applied to schema design or caching strategies to improve performance. We also briefly describe Jena inferencing and a new approach to context in RDF which w...

Semantics based transaction management techniques for replicated data

Proceedings of the 1988 ACM SIGMOD international conference on Management of data - SIGMOD '88, 1988

Data is often replicated in distributed database applications to improve availability and response time. Conventional multi-copy algorithms deliver fast response times and high availability for read-only transactions while sacrificing these goals for updates. In this paper, we propose a multi-copy algorithm that works well in both retrieval and update environments by exploiting special application semantics. By subdividing transactions into various categories, and utilizing a commutativity property, we demon strate cheaper techniques and show that they guarantee correct ness. A performance comparison between our techniques and con ventional ones quantifies the extent of the savings.

Semantic Data Management

Abstract This report documents the program and the outcomes of Dagstuhl Seminar 12171 “Semantic Data Management”. The purpose of the seminar was to have a fruitful exchange of ideas between the semantic web, database systems and information retrieval communities, organised across four main themes: scalability, provenance, dynamicity and search.

Conflict-free replicated data types

2011

Replicating data under Eventual Consistency (EC) allows any replica to accept updates without remote synchronisation. This ensures performance and scalability in large-scale distributed systems (eg, clouds). However, published EC approaches are ad-hoc and error-prone. Under a formal Strong Eventual Consistency (SEC) model, we study sufficient conditions for convergence. A data type that satisfies these conditions is called a Conflict-free Replicated Data Type (CRDT).