The VLDB Journal DOI 10.1007/s00778-008-0120-3 REGULAR PAPER The RUM-tree: supporting frequent updates in R-trees (original) (raw)

The RUM-tree: supporting frequent updates in R-trees using memos

The VLDB Journal, 2009

The problem of frequently updating multidimensional indexes arises in many location-dependent applications. While the R-tree and its variants are the dominant choices for indexing multi-dimensional objects, the R-tree exhibits inferior performance in the presence of frequent updates. In this paper, we present an R-tree variant, termed the RUM-tree (which stands for R-tree with update memo) that reduces the cost of object updates. The RUM-tree processes updates in a memo-based approach that avoids disk accesses for purging old entries during an update process. Therefore, the cost of an update operation in the RUM-tree is reduced to the cost of only an insert operation. The removal of old object entries is carried out by a garbage cleaner inside the RUM-tree. In this paper, we present the details of the RUM-tree and study its properties. We also address the issues of crash recovery and concurrency control for the RUM-tree. Theoretical analysis and comprehensive experimental evaluation demonstrate that the RUM-tree outperforms other R-tree variants by up to one order of magnitude in scenarios with frequent updates.

The LSM RUM-Tree: A Log Structured Merge R-Tree for Update-intensive Spatial Workloads

2021 IEEE 37th International Conference on Data Engineering (ICDE), 2021

Many applications require update-intensive workloads on spatial objects, e.g., social-network services and sharedriding services that track moving objects (devices). By buffering insert and delete operations in memory, the Log Structured Merge Tree (LSM) has been used widely in various systems because of its ability to handle insert-intensive workloads. While the focus on LSM has been on key-value stores and their optimizations, there is a need to study how to efficiently support LSM-based secondary indexes. We investigate the augmentation of a main-memory-based memo structure into an LSM secondary index structure to handle update-intensive workloads efficiently. We conduct this study in the context of an R-tree-based secondary index. In particular, we introduce the LSM RUM-tree that demonstrates the use of an Update Memo in an LSM-based Rtree to enhance the performance of the R-tree's insert, delete, update, and search operations. The LSM RUM-tree introduces novel strategies to reduce the size of the Update Memo to be a lightweight in-memory structure that is suitable for handling update-intensive workloads without introducing significant overhead. Experimental results using real spatial data demonstrate that the LSM RUM-tree achieves up to 9.6x speedup on update operations and up to 2400x speedup on query processing over the existing LSM R-tree implementations.

R-trees with Update Memos

22nd International Conference on Data Engineering (ICDE'06), 2006

The problern of frequerztly clpdating ~nulti-dimensiorzal uzdexes orises i~z man?. location-dependent applications. While the R-tree and its varianrs are one of the donzinant choices for indexing i7izrlr1-dir17erz~io1zal objects, the R-tree exhibits i1zferiorpe1for17ioizce iiz the presence offrequent updotes. 111 this paper; we preseil? all R-tl-ee variant, ternzed the RUM-tree (stand5 for R-nee with Updote Me~iio) that 177i1zi1nizes the cost of ol7jecr crpdates. The RUM-tree processes updates in a memo-based approoch that avoids disk accesses for purging old entries dcrriizg an update process. Therefore, the cost of an ~lpdate operation in the RUM-tree reduces to the cost of 01zl11 an insert operation. The renzoval of old object entries is carrred out b j a garbage cleaner iizside the RUM-tree. In this popei; rve present the details of the RUM-tree and study its pi-operties. Theoretical a~zalyszs and experiinental evaluatio~z denlo~zstrnte that the RUMtree outpe~$ornzs other R-tree variants by up to a factor of eight in sce~zarios with frequent c~pdates.

Main-memory operation buffering for efficient R-tree update

2007

Emerging communication and sensor technologies enable new applications of database technology that require database systems to efficiently support very high rates of spatial-index updates. Previous works in this area require the availability of large amounts of main memory, do not exploit all the main memory that is indeed available, or do not support some of the standard index operations.

High-concurrency locking in R-trees

… CONFERENCE ON VERY LARGE DATA BASES, 1995

In this paper we present a solution to the problem of concurrent operations in the R-tree, a dynamic access structure capable of storing multidimensional and spatial data. We describe the R-link tree, a variant of the R-tree that adds sibling pointers to nodes, a technique first deployed in Blink trees, to compensatefor concurrent structure modifications. The main obstacle to the use of sibling pointers is the lack of linear ordering among the keys in an R-tree; we overcome this by assigning sequence numbers to nodes that let us reconstruct the "lineage" of a node at any point in time. The search, insert and delete algorithms for R-link trees are designed to completely avoid holding locks during I/O operations and to allow concurrent modifications of the tree structure. In addition, we further describe how to achieve degree 3 consistency with an inexpensive predicate locking mechanism and demonstrate how to make R-link trees recoverable in a write-ahead logging environment. Experiments verify the performance advantage of R-link trees over simpler locking protocols.

The PH-Tree - A Space-Efficient Storage Structure and Multi-Dimensional Index

We propose the PATRICIA-hypercube-tree, or PH-tree, a multi-dimensional data storage and indexing structure. It is based on binary PATRICIA-tries combined with hypercubes for efficient data access. Space efficiency is achieved by combining prefix sharing with a space optimised implementation. This leads to storage space requirements that are comparable or below storage of the same data in non-index structures such as arrays of objects. The storage structure also serves as a multi-dimensional index on all dimensions of the stored data. This enables efficient access to stored data via point and range queries. We explain the concept of the PH-tree and demonstrate the performance of a sample implementation on various datasets and compare it to other spatial indices such as the kD-tree. The experiments show that for larger datasets beyond 10^7 entries, the PH-tree increasingly and consistently outperforms other structures in terms of space efficiency, query performance and update performance. For some highly skewed datasets, it even shows super-constant performance, becoming faster for larger datasets.

GLIP: A Concurrency Control Protocol for Clipping Indexing

2011

The project Concurrency Control Protocol for Clipping Indexing deals with the multidimensional databases. In multidimensional indexing trees, the overlapping of nodes will tend to degrade query performance, as one single point query may need to traverse multiple branches of the tree if the query point is in an overlapped area. Multidimensional databases are beginning to be used in a wide range of applications. To meet this fast-growing demand, the R-tree family is being applied to support fast access to multidimensional data, for which the R+-tree exhibits outstanding search performance. In order to support efficient concurrent access in multiuser environments, concurrency control mechanisms for multidimensional indexing have been proposed. However, these mechanisms cannot be directly applied to the R+-tree because an object in the R+-tree may be indexed in multiple leaves. This paper proposes a concurrency control protocol for R-tree variants with object clipping, namely, Granular Locking for clipping indexing (GLIP). GLIP is the first concurrency control approach specifically designed for the R+-tree and its variants, and it supports efficient concurrent operations with serializable isolation, consistency, and deadlock-free. Experimental tests on both real and synthetic data sets validated the effectiveness and efficiency of the proposed concurrent access framework.

Multiversion concurrency control for the generalized search tree

Concurrency and …, 2009

Many read-intensive systems where fast access to data is more important than the rate at which data can change make use of multidimensional index structures, like the generalized search tree (GiST). Although in these systems the indexed data are rarely updated and read access is highly concurrent, the existing concurrency control mechanisms for multidimensional index structures are based on locking techniques, which cause significant overhead. In this article we present the multiversion-GiST (MVGiST), an inmemory mechanism that extends the GiST with multiversion concurrency control. The MVGiST enables lock-free read access and ensures a consistent view of the index structure throughout a reader's series of queries, by creating lightweight, read-only versions of the GiST that share unchanging nodes among themselves. An example of a system with high read to write ratio, where providing wait-free queries is of utmost importance, is a large-scale directory that indexes web services according to their input and output parameters. A performance evaluation shows that for low update rates, the MVGiST significantly improves scalability w.r.t. the number of concurrent read accesses when compared with a traditional, locking-based concurrency control mechanism. We propose a technique to control memory consumption and confirm through our evaluation that the MVGiST efficiently manages memory.

The log-structured merge-tree (LSM-tree)

Acta Informatica, 1996

High-performance transaction system applications typically insert rows in a History table to provide an activity trace; at the same time the transaction system generates log records for purposes of system recovery. Both types of generated information can benefit from efficient indexing. An example in a well-known setting is the TPC-A benchmark application, modified to support efficient queries on the History for account activity for specific accounts. This requires an index by account-id on the fast-growing History table. Unfortunately, standard disk-based index structures such as the B-tree will effectively double the I/O cost of the transaction to maintain an index such as this in real time, increasing the total system cost up to fifty percent. Clearly a method for maintaining a real-time index at low cost is desirable. The Log-Structured Merge-tree (LSM-tree) is a disk-based data structure designed to provide low-cost indexing for a file experiencing a high rate of record inserts (and deletes) over an extended period. The LSM-tree uses an algorithm that defers and batches index changes, cascading the changes from a memory-based component through one or more disk components in an efficient manner reminiscent of merge sort. During this process all index values are continuously accessible to retrievals (aside from very short locking periods), either through the memory component or one of the disk components. The algorithm has greatly reduced disk arm movements compared to a traditional access methods such as B-trees, and will improve costperformance in domains where disk arm costs for inserts with traditional access methods overwhelm storage media costs. The LSM-tree approach also generalizes to operations other than insert and delete. However, indexed finds requiring immediate response will lose I/O efficiency in some cases, so the LSM-tree is most useful in applications where index inserts are more common than finds that retrieve the entries. This seems to be a common property for History tables and log files, for example. The conclusions of Section 6 compare the hybrid use of memory and disk components in the LSM-tree access method with the commonly understood advantage of the hybrid method to buffer disk pages in memory.

Master-client R-trees: a new parallel R-tree architecture

Proceedings. Eleventh International Conference on Scientific and Statistical Database Management, 1999

Scienti c databases must be able to e ciently run subset retrievals of multi-dimensional data sets. If the data sets are very large signi cant retrieval speedups can be obtained via parallelism. In this paper we present a new parallel distributed shared nothing Rtree architecture. To the best of our knowledge this is the rst signi cant experimental study demonstrating practical application of parallel Rtrees in a shared nothing environment. We provide experimental results demonstrating actual speedups for several synthetic and real data sets. In addition, we conduct experimental studies to investigate the e ect of several declustering strategies and communication parameters.