Exploiting schemas in data synchronization (original) (raw)

Schema-Directed Data Synchronization

Increased reliance on optimistic data replication has led to burgeoning interest in tools and frameworks for synchronizing disconnected updates to replicated data. We have implemented a generic, synchronization framework, called Harmony, that can be instantiated to yield state-based synchronizers for a wide variety of tree-structured data formats. A novel feature of this framework is that the synchronization process—in particular, the recognition of situations where changes are in conflict—is driven by the schema of the structures being synchronized. We formalize Harmony’s synchronization algorithm, prove that it obeys a simple and intuitive specification, and illustrate how it can be used to synchronize a variety of specific forms of application data—sets, records, tuples, and relations.

Bringing harmony to optimism: an experiment in synchronizing heterogeneous tree-structured data

2004

Increased reliance on optimistic data replication has led to burgeoning interest in tools and frameworks for synchronizing disconnected updates to replicated data. To better understand the issues underlying the design of generic and heterogeneous synchronizers, we have implemented an experimental framework, called Harmony, that can be used to build synchronizers for tree-structured data stored in a variety of concrete formats. We present Harmony's architecture, formalize its key components (a simple core synchronization algorithm together with a set of user-defined mappings between diverse concrete data formats and common abstract schemas suitable for synchronization), and discuss how the framework can be used to synchronize a variety of specific types of application data by suitable encodings into trees-including sets, records, tuples, relations, and, with some limitations, lists and ordered XML data.

GlobData: A Platform for Supporting Multiple Consistency Modes

ISDB, 2002

GlobData is a platform that provides an object-oriented view of wide-area-networked relational databases with replicated data for ensuring high availability. We discuss the embedding of protocols in GlobData, for maintaining the consistency of replications. The protocols are able to alternate between three different modes of consistency. Modes can be changed on-line and per session, i.e., GlobData supports different and changeable consistency modes in simultaneous sessions.

Agreeing to agree: Conflict resolution for optimistically replicated data

2006

Current techniques for reconciling disconnected changes to optimistically replicated data often use version vectors or related mechanisms to track causal histories. This allows the system to tell whether the value at one replica dominates another or whether the two replicas are in conflict. However, current algorithms do not provide entirely satisfactory ways of repairing conflicts. The usual approach is to introduce fresh events into the causal history, even in situations where the causally independent values at the two replicas are actually equal. In some scenarios these events may later conflict with each other or with further updates, slowing or even preventing convergence of the whole system. To address this issue, we enrich the set of possible actions at a replica to include a notion of explicit conflict resolution between existing events, where the user at a replica declares that one set of events dominates another, or that a set of events are equivalent. We precisely specify the behavior of this refined replication framework from a user's point of view and show that, if communication is assumed to be "reciprocal" (with pairs of replicas exchanging information about their current states), then this specification can be implemented by an algorithm with the property that the information stored at any replica and the sizes of the messages sent between replicas are bounded by a polynomial function of the number of replicas in the system.

Using the transformational approach to build a safe and generic data synchronizer

2003

Abstract Reconciliating divergent data is an important issue in concurrent engineering, mobile computing and software configuration management. Currently, a lot of synchronizers or merge tools perform reconciliations. However, they do not define what is the correctness of their synchronisation. In this paper, we propose to use a transformational approach as the basic model for reasonning about synchronisation. We propose an algorithm and specific transformation functions that realize a file system synchronisation.

Abstract unordered and ordered trees CRDT

Trees are fundamental data structure for many areas of computer science and system engineering. In this report, we show how to ensure eventual consistency of optimistically replicated trees. In optimistic replication, the different replicas of a distributed system are allowed to diverge but should eventually reach the same value if no more mutations occur. A new method to ensure eventual consistency is to design Conflict-free Replicated Data Types (CRDT). In this report, we design a collection of tree CRDT using existing set CRDTs. The remaining concurrency problems particular to tree data structure are resolved using one or two layers of correction algorithm. For each of these layer, we propose different and independent policies. Any combination of set CRDT and policies can be constructed, giving to the distributed application programmer the entire control of the behavior of the shared data in face of concurrent mutations. We also propose to order these trees by adding a positionin...

Performance analysis of a tree-based consistency approach for cloud databases

2012 International Conference on Computing, Networking and Communications (ICNC), 2012

Cloud storage service is currently becoming a very popular solution for medium-sized and startup companies. However, there is still no suitable solution being offered to deploy transactional databases in a cloud platform. The maintenance of ACID (Atomicity, Consistency, Isolation and Durability) properties is the primary obstacle to the implementation of transactional cloud databases. The main features of cloud computing: scalability, availability and reliability are achieved by sacrificing consistency. While different forms of consistent states have been introduced, they do not address the needs of many database applications. In this paper we present a tree-based consistency approach, called TBC, that reduces interdependency among replica servers to minimize response time of cloud databases and to maximize the performance of those applications. Experimental results indicate that our TBC approach trades off availability and consistency with performance.

Synchronizing semantic stores with commutative replicated data types

Proceedings of the 21st international conference companion on World Wide Web - WWW '12 Companion, 2012

Social semantic web technologies led to huge amounts of data and information being available. The production of knowledge from this information is challenging, and major efforts, like DBpedia, has been done to make it reality. Linked data provides interconnection between this information, extending the scope of the knowledge production.

Writes that fall in the forest and make no sound: Semantics-based adaptive data consistency

Datastores today rely on distribution and replication to achieve improved performance and fault-tolerance. But correctness of many applications depends on strong consistency properties-something that can impose substantial overheads, since it requires coordinating the behavior of multiple nodes. This paper describes a new approach to achieving strong consistency in distributed systems while minimizing communication between nodes. The key insight is to allow the state of the system to be inconsistent during execution, as long as this inconsistency is bounded and does not affect transaction correctness. In contrast to previous work, our approach uses program analysis to extract semantic information about permissible levels of inconsistency and is fully automated. We then employ a novel homeostasis protocol to allow sites to operate independently, without communicating, as long as any inconsistency is governed by appropriate treaties between the nodes. We discuss mechanisms for optimizing treaties based on workload characteristics to minimize communication, as well as a prototype implementation and experiments that demonstrate the benefits of our approach on common transactional benchmarks. * Work done while at Cornell University. trade-off between responsiveness and consistency, we demonstrate that by carefully analyzing applications, it is possible to achieve the best of both worlds: strong consistency and low latency in the common case. The key idea is to exploit the semantics of the transactions involved in the execution of an application in a way that is safe and completely transparent to programmers. It is well known that strong consistency is not always required to execute transactions correctly [20, 41], and this insight has been exploited in protocols that allow transactions to operate on slightly stale replicas as long as the staleness is "not enough to affect correctness" [5, 41]. This paper takes this basic idea much further, and develops mechanisms for automatically extracting safety predicates from application source code. Our homeostasis protocol uses these predicates to allow sites to operate without communicating, as long as any inconsistency is appropriately governed. Unlike prior work, our approach is fully automated and does not require programmers to provide any information about the semantics of transactions. Example: top-k query. To illustrate the key ideas behind our approach in further detail, consider a top-k query over a distributed datastore, as illustrated in Figure 1. For simplicity we will consider the case where k = 2. This system consists of a number of item sites that each maintain a collection of (key, value) pairs that could represent data such as airline reservations or customer purchases. An aggregator site maintains a list of top-k items sorted in descending order by value. Each item site periodically receives new insertions, and the aggregator site updates the top-k list as needed. A simple algorithm that implements the top-k query is to have each item site communicate new insertions to the aggregator site, which inserts them into the current top-k list in order, and removes the smallest element of the list. However, every insertion requires a communication round with the aggregator site, even if most of the inserts are for objects not in the top-k. A better idea is to only communicate with the aggregator node if the new value is greater than the minimal value of the current top-k list. Each site can maintain a cached value of the smallest value in the top-k and only notify the aggregator site if an item with a larger value is inserted into its local state. This algorithm is illustrated in Figure 2, where each item site has a variable min with the current lowest top-k value. In expectation, most item inserts do not affect the aggregator's behavior, and consequently, it is safe for them to remain unobserved by the aggregator site. This improved top-k algorithm is essentially a simplified distributed version of the well-known threshold algorithm for top-k computation [14]. However, note that this algorithm can be ex