Range Query Research Papers - Academia.edu (original) (raw)

2025, ACM Transactions on Information Systems

A scheme to answer best-match queries from a file containing a collection of objects is described. A best-match query is to find the objects in the file that are closest (according to some (dis)similarity measure) to a given target. Previous work [5, 331] suggests that one can reduce the number of comparisons required to achieve the desired results using the triangle inequality, starting with a data structure for the file that reflects some precomputed intrafile distances. We generalize the technique to allow the optimum use of any given set of precomputed intrafile distances. Some empirical results are presented which illustrate the effectiveness of our scheme, and its performance relative to previous algorithms.

2025

Exponential growth in number of possible strategies with the increase in number of relations in a query has been identified as a major problem in the field of query optimization of relational databases. Present database systems use exhaustive search to find the best possible strategy. But as the size of a query grows, exhaustive search method itself becomes quite expensive. Other algorithms like A * algorithm, Simulated Annealing etc. have been suggested as a solution. However, all these algorithms fail to produce the best results; necessarily required for query execution. We did some modifications to the A * algorithm to produce a randomized form of the algorithm and compared it with the original A * algorithm and exhaustive search. The comparison results have shown improved A * algorithm to be almost equivalent in output quality along with a colossal decrease in search space in comparison to exhaustive search method. I.

2025, Canadian Conference on Computational Geometry

We present data structures for triangular emptiness and reporting queries for a planar point set, where the query triangle contains the origin. The data structures use near-linear space and achieve polylogarithmic query times.

2025

2025, arXiv (Cornell University)

The skyline of a set P of points (SKY (P )) consists of the "best" points with respect to minimization or maximization of the attribute values. A point p dominates another point q if p is as good as q in all dimensions and it is strictly better than q in at least one dimension. In this work, we focus on the static 2-d space and provide expected performance guarantees for 3-sided Range Skyline Queries on the Grid, where N is the cardinality of P , B the size of a disk block, and R the capacity of main memory. We present the MLR-tree (Modified Layered Range-tree), which offers optimal expected cost for finding planar skyline points in a 3-sided query rectangle, , by single scanning only the points contained in SKY (P ). In particular, it supports skyline queries in a 3-sided range in O (t • t P AM (N )) time (O ((t/B) • t P AM (N )) I/Os), where t is the answer size and t P AM (N ) the time required for answering predecessor queries for d in a PAM (Predecessor Access Method) structure, which is a special component of MLR-tree and stores efficiently root-to-leaf paths or sub-paths. By choosing PAM structures with O(1) expected time for predecessor queries under discrete µ-random distributions of the x and y coordinates, MLR-tree supports skyline queries in optimal O(t) expected time (O(t/B) expected number of I/Os) with high probability. The space complexity becomes superlinear and can be reduced to linear for many special practical cases. If we choose a PAM structure with O(1) amortized time for batched predecessor queries (under no assumption on distributions of the x and y coordinates), MLR-tree supports batched skyline queries in optimal O(t) amortized time, however the space becomes exponential.In dynamic case, the update time complexity is affected by a O(log 2 N ) factor.

2025, Proceedings of the 13th International Conference on Database Theory

This work studies the problem of 2-dimensional searching for the 3-sided range query of the form [a, b] × (-∞, c] in both main and external memory, by considering a variety of input distributions. A dynamic linear main memory solution is proposed, which answers 3-sided queries in O(log n + t) worst case time and scales with O(log log n) expected with high probability update time, under continuous µ-random distributions of the x and y coordinates, where n is the current number of stored points and t is the size of the query output. Our expected update bound constitutes a considerable improvement over the O(log n) update time bound achieved by the classic Priority Search Tree of McCreight [23], as well as over the Fusion Priority Search Tree of Willard , which requires O( log n log log n ) time for all operations. Moreover, we externalize this solution, gaining O(log B n + t/B) worst case and O(logBlogn) amortized expected with high probability I/Os for query and update operations respectively, where B is the disk block size. Then, combining the Modified Priority Search Tree with the Priority Search Tree [23], we achieve a query time of O(log log n + t) expected with high probability and an update time of O(log log n) expected with high probability, under the assumption that the x-coordinates are continuously drawn from a smooth distribution and the ycoordinates are continuously drawn from a more restricted class of distributions. The total space is linear. Finally, we externalize this solution, obtaining a dynamic data struc-

2025, Lecture Notes in Computer Science

We present NEFOS (NEsted FOrest of balanced treeS), a new cache-aware indexing scheme that supports insertions and deletions in O(1) worst-case block transfers for rebalancing operations (given and update position) and searching in O(log B log n) expected block transfers, (B= disk block size and n= number of stored elements). The expected search bound holds with high probability for any (unknown) realistic input distribution. Our expected search bound constitutes an improvement over the O(log B log n) expected bound for search achieved by the ISB-tree (Interpolation Search B-tree), since the latter holds with high probability for the class of smooth only input distributions. We define any unknown distribution as realistic if the smoothness doesn't appear in the whole data set, still it may appear locally in small spatial neighborhoods. This holds for a variety of real-life non-smooth distributions like skew, zipfian, powlaw, beta e.t.c.. The latter is also verified by an accompanying experimental study. Moreover, NEFOS is a B-parametrized concrete structure, which works for both I/O and RAM model, without any kind of transformation or adaptation. Also, it is the first time an expected sub-logarithmic bound for search operation was achieved for a broad family of non-smooth input distributions.

2025, International Journal of Organizational and Collective Intelligence

In this paper, the authors present a time-efficient approach to index objects moving on the plane in order to answer range queries about their future positions. Each object is moving with non small velocity u, meaning that the velocity value distribution is skewed (Zipf) towards in some range , where is a positive lower threshold. This algorithm enhances a previously described solution (Sioutas, Tsakalidis, Tsichlas, Makris, & Manolopoulos, 2007) by accommodating the ISB-tree access method as presented in Kaporis et al. (2005). Experimental evaluation shows the improved performance, scalability, and efficiency of the new algorithm.

2025, Data & Knowledge Engineering

We present a set of time-efficient approaches to index objects moving on the plane to efficiently answer range queries about their future positions. Our algorithms are based on previously described solutions as well as on the employment of efficient access methods. Finally, an experimental evaluation is included that shows the performance, scalability and efficiency of our methods.

2025, arXiv (Cornell University)

The problem of recovering (count and sum) range queries over multidimensional data only on the basis of aggregate information on such data is addressed. This problem can be formalized as follows. Suppose that a transformation τ producing a summary from a multidimensional data set is used. Now, given a data set D, a summary S = τ (D) and a range query r on D, the problem consists of studying r by modelling it as a random variable defined over the sample space of all the data sets D ′ such that τ (D ′ ) = S. The study of such a random variable, done by the definition of its probability distribution and the computation of its mean value and variance, represents a well-founded, theoretical probabilistic approach for estimating the query only on the basis of the available information (that is the summary S) without assumptions on original data.

2025, Lecture Notes in Computer Science

This paper presents the SkipTree, a new balanced, distributed data structure for storing data with multidimensional keys in a peer-topeer network. The SkipTree supports range queries as well as single point queries which are routed in O(log n) hops. SkipTree is fully decentralized with each node being connected to O(log n) other nodes. The memory usage for maintaining the links at each node is O(log n log log n) on average and O(log 2 n) in the worst case. Load balance is also guaranteed to be within a constant factor.

2025, Computer Communications

This paper presents a new balanced, distributed data structure for storing data with multidimensional 24 keys in a peer-to-peer network. It supports range queries as well as single point queries which are routed 25 in Oðlog nÞ hops. Our structure, called SkipTree, is fully decentralized with each node being connected to 26 Oðlog nÞ other nodes. We propose modifications to the structures, so that the memory usage for maintain-27 ing the link structure at each node is reduced from the worst case of OðnÞ to Oðlog n log log nÞ on the aver-28 age and Oðlog 2 nÞ in the worst case. It is also shown that the load balancing is guaranteed to be within a 29 constant factor. Our experimental results verify our theoretical proofs. 30 Ó 2009 Published by Elsevier B.V. 31 Litwin et al. modified the original hash-based LH* [16] structure to support range queries in RP* [15,16]. Based on the previous work of distributed data structures like LH* [16], RP* [15] and Distributed Random Tree (DRT) [12], new data structures based on either hashing or key comparison have been proposed like Chord [23], Viceroy [18], Koorde [10], Tapestry [25], Pastry [22], PeerDB [20], and P-Grid [1]. Most existing peer-to-peer (or P2P) overlays require Hðlog nÞ links per node in order to achieve Oðlog nÞ hops 55 for routing. Viceroy [18], Koorde [10], D2b [6], FissionE [17], and 56 MOORE [7] which are based on DHTs, are the remarkable excep-57 tions in that they achieve Oðlog nÞ hops with only Oð1Þ links per 58 node at the cost of restricted or no load balancing. Family Tree 59 [24] is the first overlay network which does not use hashing but 60 supports routing in Oðlog nÞ hops with only Oð1Þ links per node. 61 Typically, the systems which are based on DHTs and hashing 62 lack the range-query operation, locality properties and control over 63 distribution of keys, due to hashing. In contrast, those based on key 64 comparison, although requiring more complicated load balancing 65 techniques, do better in these respects. P-Grid [1] by Aberer et al. 66 is one of the systems based on key comparison which uses a dis-67 tributed binary tree to partition a single dimensional space with 68 network nodes representing the leaves of the tree and each node 69 having a link to some node in every sibling subtree along the path 70 from the root to that node. Gridella [2] a P2P system based on P-71 Grid working on GNutella has also been developed. Other systems 72 like P-Tree [5] have been proposed that provide range queries in 73 single dimensional space. Besides, some data structures like dB-74 Trees [9] based on B-Trees have been developed for distributed 75 environments. 76 SkipNet [8] on which our new system relies heavily, is another 77 system for single dimensional spaces based on an extension to skip 78 lists. We basically extend SkipNet to handel multi-dimensional 79 spaces. 80 G-Grid [21] is a solution proposed for the multidimensional case 81 which is also based on partitioning the space into regions. However, 82

2025, 2008 IEEE 24th International Conference on Data Engineering Workshop

The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We consider a particular type of similarity join: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r. For this sake, we devise a new metric index, coined List of Twin Clusters, which indexes both sets jointly (instead of the natural approach of indexing one or both sets independently). Our results show significant speedups over the basic quadratic-time naive alternative. Furthermore, we show that our technique can be easily extended to other similarity join variants, e.g., finding the k-closest pairs.

2025, Lecture Notes in Computer Science

As Internet applications become larger and more complex, the task of managing them becomes overwhelming. "Abnormal" events such as software updates, failures, attacks, and hotspots become frequent. The SELFMAN project is tackling this problem by combining two technologies, namely structured overlay networks and advanced component models, to make the system self managing. Structured overlay networks (SONs) developed out of peer-to-peer systems and provide robustness, scalability, communication guarantees, and efficiency. Component models provide the framework to extend the self-managing properties of SONs over the whole system. SELFMAN is building a self-managing transactional storage and using it for two application demonstrators: a distributed Wiki and an on-demand media streaming service. This paper provides an introduction and motivation for the ideas underlying SELF-MAN and a snapshot of its contributions midway through the project. We explain our methodology for building self-managing systems as networks of interacting feedback loops. We then summarize the work we have done to make SONs a practical basis for our architecture: using an advanced component model, handling network partitions, handling failure suspicions, and doing range queries with load balancing. Finally, we show the design of a self-managing transactional storage on a SON.

2025

Geographical information system and information retrieval has been very interesting research field since last two decades. Geographical Information Retrieval (GIR) is new direction of these two fields. A better search method provides better result to the end user. In this paper, we present a survey of different aspects of spatial access methods on the basis of query type and region type for spatial search. At the end of this paper, we conclude the relative performance of different access methods with their pros and cons.

2025

Résumé: Scalable and Distributed Data Structures (SDDS) are a class of data structures completely dedicated to distributed environments. They allow the management of large amounts of data while maintaining steady and optimum performances.... more

2025

Similarity search is a fundamental operation for applications that deal with multimedia data. For a query in a multimedia database it is meaningless to look for elements exactly equal to a given one as query. Instead, we need to measure the similarity (or dissimilarity) between the query object and each object of the database. The similarity search problem can be formally defined through the concept of metric space, which provides a formal framework that is independent of the application domain. In a metric database, the objects from a metric space can be stored and similarity queries about them can be efficiently answered. In general, the search efficiency is understood as minimizing the number of distance calculations required to answer them. Therefore, the goal is to preprocess the dataset by building an index, such that queries can be answered with as few distance computations as possible. However, with very large metric databases is not enough to preprocess the dataset by build...

2025

Most search methods in metric spaces assume that the topology of the object collection is reasonably regular. However, there exist nested metric spaces, where objects in the collection can be grouped into clusters or subspaces, in such a way that different dimensions or variables explain the differences between objects inside each subspace. This paper proposes a two levels index to solve search problems in spaces with this topology. The idea is to have a first level with a list of clusters, which are identified and sorted using Sparse Spatial Selection (SSS) and Lists of Clusters techniques, and a second level having an index for each dense cluster, based on pivot selection, using SSS. It is also proposed for future work to adjust the second level indexes through dynamic pivots selection to adapt the pivots according to the searches performed in the database.

2025

This paper presents a data structure based on Sparse Spatial Selection (SSS) for similarity searching. An algorithm that tries periodically to adjust pivots to the use of database index is presented. This index is dynamic. In this way, it is possible to improve the amount of discriminations done by the pivots. So, the primary objective of indexes is achieved: to reduce the number of distance function evaluations, as it is showed in the experimentation.

2025

Query-by-content by means of similarity search is a fundamental operation for applications that deal with multimedia data. For this kind of query it is meaningless to look for elements exactly equal to the one given as query. Instead, we need to measure dissimilarity between the query object and each database object. The metric space model is a paradigm that allows modeling all similarity search problems. Metric databases permit to store objects from a metric space and efficiently perform similarity queries over them, in general, by reducing the number of distance evaluations needed. Therefore, the goal is to preprocess a particular dataset in such a way that queries can be answered with as few distance computations as possible. Moreover, for a very large metric database it is not enough to preprocess the dataset by building an index, it is also necessary to speed up the queries via high performance computing using GPU. In this work we show an implementation of a pure GPU architectu...

2025

With the technology advance and the growth of Internet, the information that can be found in this net, as well as the number of users that access to look for specific data is bigger. Therefore, it is desirable to have a search system that allows to retrieve information at a reasonable time and in an efficient way. In this paper we show two computing paradigms appropriate to apply in the treatment of large amounts of data consisting of objects such as images, text, sound and video, using hybrid computing over MPI+OpenMP and GPGPU. The proposal is developed through experience gained in the construction of various indexes and the subsequent search, through them, of multimedia objects.

2025

In a replicated environment, same data are often distributed to several sites in order to improve the data availability, fault tolerance and faster access. However, when a replica is modified by a user, the other replicas become stale. In such situation a mechanism is needed to keep them consistent, which means all the replicated data have the same content. This can be achieved using a synchronization mechanism and it provides the same view of all the replicas by propagating the modifications done on the first replica to others. This process is usually called as update propagation which is the main part of the consistency mechanism. When replicas are read-only or when no consistency guarantees are required by the users of the replicated system, a synchronization system is not needed. In the present chapter, different replica synchronization techniques are studied. Also, based on the agent based framework proposed in the previous chapter an update propagation mechanism is proposed to...

2025, Information Processing Letters

A cost model for the performance of the k-nearest neighbor query in multidimensional data space is presented. Two concepts, the regional average volume and the density function, are introduced to predict the performance for uniform and non-uniform data distributions. The experiment shows that the prediction based on this model is accurate within an acceptable range of the error in low and mid dimensions.

2025, Very Large Data Bases

For a large number of data management problems, it would be very useful to be able to obtain a few samples from a data set, and to use the samples to guess the largest (or smallest) value in the entire data set. Min/max online aggregation, top-k query processing, outlier detection, and distance join are just a few possible applications. This paper details a statistically rigorous, Bayesian approach to attacking this problem. Just as importantly, we demonstrate the utility of our approach by showing how it can be applied to two specific problems that arise in the context of data management.

2025, Springer eBooks

Data integrity can be problematic when integrating and organizing information from many sources. In this paper we describe efficient mechanisms that enable a group of data owners to contribute data sets to an untrusted third-party publisher, who then answers users' queries. Each owner gets a proof from the publisher that his data is properly represented, and each user gets a proof that the answer given to them is correct. This allows owners to be confident that their data is being properly represented and for users to be confident they are getting correct answers. We show that a group of data owners can efficiently certify that an untrusted third party publisher has computed the correct digest of the owners' collected data sets. Users can then verify that the answers they get from the publisher are the same as a fully trusted publisher would provide, or detect if they are not. The results presented support selection and range queries on multi-attribute data sets and are an extension of earlier work on Authentic Publication which assumed that a single trusted owner certified all of the data.

2025, IFIP International Federation for Information Processing

2025, Algorithmica

Query answers from on-line databases can easily be corrupted by hackers or malicious intent by the database publisher. Thus it is important to provide mechanisms which allow clients to trust the results from on-line queries. Authentic publication is a novel scheme which allows untrusted publishers to securely answer queries from clients on behalf of trusted o-line data owners. Publishers validate

2025, Algorithmica

Query answers from on-line databases can easily be corrupted by hackers or malicious database publishers. Thus it is important to provide mechanisms which allow clients to trust the results from on-line queries. Authentic Publication allows untrusted publishers to securely answer queries from clients on behalf of trusted off-line data owners. Publishers validate answers using hard-to-forge verification objects (VOs), which clients can check efficiently. This approach provides greater scalability, by making it easy to add more publishers, and better security, since on-line publishers don't need to be trusted. To make authentic publication attractive, it is important for the VOs to be small, efficient to compute and efficient to verify. This has lead researchers to independently develop several different schemes for efficient VO computation based on specific data structures. Our goal is to develop a unifying framework for these disparate results, leading to a generalized security result. In this paper we characterize a broad class of data structures which we call Search DAGs, and we develop a generalized algorithm for the construction of VOs for Search DAGs. We prove that the VOs thus constructed are secure, and that they are efficient to compute and verify. We demonstrate how this approach easily captures existing work on simple structures such as binary trees, multi-dimensional range trees, tries, and skip lists. Once these are shown to be Search DAGs, the requisite security and efficiency results immediately follow from our general theorems. Going further, we also use Search DAGs to produce and prove the security of authenticated versions of two complex data models for efficient multi-dimensional range searches. This allows efficient VOs to be computed (size O(log N + T )) for typical 1D and 2D range queries, where the query answer is of size T and the database is of size N . We also show I/O-efficient schemes to construct the VOs. For a system with disk blocks of size B, we answer 1D and 3-sided range queries and compute the VOs with O(log B N + T /B) I/O operations using linear size data structures.

2025, IFIP International Federation for Information Processing

2025, Algorithmica

2025, Journal of Systems and Software

In several applications, data objects move on pre-defined spatial networks such as road segments, railways, and invisible air routes. Many of these objects exhibit similarity with respect to their traversed paths, and therefore two objects can be correlated based on their motion similarity. Useful information can be retrieved from these correlations and this knowledge can be used to define similarity classes. In this paper, we study similarity search for moving object trajectories in spatial networks. The problem poses some important challenges, since it is quite different from the case where objects are allowed to move freely in any direction without motion restrictions. New similarity measures should be employed to express similarity between two trajectories that do not necessarily share any common sub-path. We define new similarity measures based on spatial and temporal characteristics of trajectories, such that the notion of similarity in space and time is well expressed, and moreover they satisfy the metric properties. In addition, we demonstrate that similarity range queries in trajectories are efficiently supported by utilizing metric-based access methods, such as M-trees.

2025, Data & Knowledge Engineering

In contrast to regular queries that are evaluated only once, a continuous query remains active over a period of time and has to be continuously evaluated to provide up to date answers. We propose a method for continuous range query processing for different types of queries, characterized by mobility of objects and/or queries which all follow paths in an underlying spatial network. The method assumes an available 2D indexing scheme for indexing spatial network data. An appropriately extended R * -tree, that primarily is used as an indexing scheme for network segments, provides matching of queries and objects according to their locations on the network or their network routes. The method introduces an additional pre-refinement step which generates main-memory data structures to support efficient, incremental reevaluation of continuous range queries in periodically performed refinement steps.

2024

For large scale distributed systems, designing energy efficient protocols and services has become significant while considering conventional performance criteria like scalability, reliability, fault-tolerance and security. Due to its extensive applicability in diverse areas, we consider frequent item set discovery problem in this context. A simulation model is developed for ProFID protocol, which is a distributed protocol is developed to find frequent item set discovery in unstructured networks on Peersim.

2024, Knowledge and Information Systems

We propose a locally adaptive technique to address the problem of setting the bandwidth parameters for kernel density estimation. Our technique is efficient and can be performed in only two dataset passes. We also show how to apply our technique to efficiently solve range query approximation, classification and clustering problems for very large datasets. We validate the efficiency and accuracy of our technique by presenting experimental results on a variety of both synthetic and real datasets.

2024, Sensors

The Wireless Sensor Network similarity search problem has received considerable research attention due to sensor hardware imprecision and environmental parameter variations. Most of the state-of-the-art distributed data centric storage (DCS) schemes lack optimization for similarity queries of events. In this paper, a DCS scheme with metric based similarity searching (DCSMSS) is proposed. DCSMSS takes motivation from vector distance index, called iDistance, in order to transform the issue of similarity searching into the problem of an interval search in one dimension. In addition, a sector based distance routing algorithm is used to efficiently route messages. Extensive simulation results reveal that DCSMSS is highly efficient and significantly outperforms previous approaches in processing similarity search queries.

2024, Springer eBooks

Given an array A of size n, we consider the problem of answering range majority queries: given a query range [i..j] where 1 ≤ i ≤ j ≤ n, return the majority element of the subarray A[i..j] if it exists. We describe a linear space data structure that answers range majority queries in constant time. We further generalize this problem by defining range α-majority queries: given a query range [i..j], return all the elements in the subarray A[i..j] with frequency greater than α(j − i + 1). We prove an upper bound on the number of α-majorities that can exist in a subarray, assuming that query ranges are restricted to be larger than a given threshold. Using this upper bound, we generalize our range majority data structure to answer range α-majority queries in O(1 α) time using O(n lg(1 α + 1)) space, for any fixed α ∈ (0, 1). This result is interesting since other similar range query problems based on frequency have nearly logarithmic lower bounds on query time when restricted to linear space.

2024

We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on value-based queries (e.g., average) with numerical error tolerance. In this paper, we cover entity-based queries (e.g., a nearest neighbor query returns object names rather than a single value). In particular, we study non-value-based tolerance (e.g., the answer to the nearest-neighbor query should rank third or above). We investigate different non-value-based error tolerance definitions and discuss how they are applied to two classes of entity-based queries: non-rankbased and rank-based queries. Extensive experiments show that our protocols achieve significant savings in both communication overhead and server computation.

2024

Cloud computing is a technology that facilitates the storing and managing of data in a decentralized manner. It includes a number of models and provides numerous services. It has many advantages and relatively few disadvantages, which makes the move to cloud computing quite attractive. However, since the data is out of the owner's control, concerns have arisen with regards to data confidentiality. Encryption techniques have previously been proposed to provide users with confidentiality in terms of outsource storage. These encryption algorithms allow for queries to be processed using encrypted data without decryption. However, a number of these encryption algorithms are weak, enabling adversaries to compromise data simply by compromising an algorithm. We propose a combination of encryption algorithms and a distribution system to improve database confidentiality. This scheme distributes the database across the clouds based on the level of security that is provided by the encryption algorithms utilized. A hybrid cloud model is used in this research, which is a combination of public and private clouds, with the critical activities taking place within the private cloud. We analyzed our scheme by designing and conducting an experiment and by comparing our scheme with existing solutions. The results demonstrate that our scheme offers a highly secure approach that provides users with data confidentiality. It also provides acceptable overhead performance and supports query processing. algorithm for the outsourced storage of sensitive data (Damen, 2002; Blomer, 2003). Consequently, several studies have used AES-CBC to store sensitive data (Mohamed, 2012;

2024, Symposium on Discrete Algorithms

We present space-time tradeoffs for approximate spherical range counting queries. Given a set S of n data points in R d along with a positive approximation factor ε, the goal is to preprocess the points so that, given any Euclidean ball B, we can return the number of points of any subset of S that contains all the points within a (1 -ε)-factor contraction of B, but contains no points that lie outside a (1 + ε)-factor expansion of B. In many applications of range searching it is desirable to offer a tradeoff between space and query time. We present here the first such tradeoffs for approximate range counting queries. Given 0 < ε ≤ 1/2 and a parameter γ, where 2 ≤ γ ≤ 1/ε, we show how to construct a data structure of space O(nγ d log(1/ε)) that allows us to answer ε-approximate spherical range counting queries in time O(log(nγ) + 1/(εγ) d-1 ). The data structure can be built in time O(nγ d log(n/ε) log(1/ε)). Here n, ε, and γ are asymptotic quantities, and the dimension d is assumed to be a fixed constant. At one extreme (low space), this yields a data structure of space O(n log(1/ε)) that can answer approximate range queries in time O(log n+(1/ε) d-1 ) which, up to a factor of O(log 1/ε) in space, matches the best known result for approximate spherical range counting queries. At the other extreme (high space), it yields a data structure of space O((n/ε d ) log(1/ε)) that can answer queries in time O(log n + log 1/ε). This is the fastest known query time for this problem. Our approach is broadly based on methods developed for approximate Voronoi diagrams (AVDs), but it involves a number of significant extensions from the context of nearest neighbor searching to range searching. These include generalizing AVD node-separation properties from leaves to internal nodes of the tree and constructing efficient generator sets through a radial decomposition of space. We have also developed new arguments to analyze the time and space requirements in this more general setting.

2024

We establish two new lower bounds for the halfspace range searching problem: Given a set of n points in R d , where each point is associated with a weight from a commutative semigroup, compute the semigroup sum of the weights of the points lying within any query halfspace. Letting m denote the space requirements, we prove a lower bound for general semigroups of Ω n 1-1/(d+1) /m 1/(d+1) and for integral semigroups of Ω n/m 1/d . Our lower bounds are proved in the semigroup arithmetic model. Neglecting logarithmic factors, our result for integral semigroups matches the best known upper bound due to Matoušek. Our result for general semigroups improves upon the best known lower bound due to Brönnimann, Chazelle, and Pach. Moreover, Fonseca and Mount have recently shown that, given uniformly distributed points, halfspace range queries over idempotent semigroups can be answered in O n 1-1/(d+1) /m 1/(d+1) time in the semigroup arithmetic model. As our lower bounds are established for uniformly distributed point sets, it follows that they also resolve the computational complexity of halfspace range searching over idempotent semigroups in this important special case.

2024

2024, Lecture Notes in Computer Science

A consistent query protocol (CQP) allows a database owner to publish a very short string c which commits her and everybody else to a particular database D, so that any copy of the database can later be used to answer queries and give short proofs that the answers are consistent with the commitment c. Here commits means that there is at most one database D that anybody can find (in polynomial time) which is consistent with c. (Unlike in some previous work, this strong guarantee holds even for owners who try to cheat while creating c.) Efficient CQPs for membership and one-dimensional range queries are known [5, 17, 22]: given a query pair a, b ∈ R, the server answers with all the keys in the database which lie in the interval [a, b] and a proof that the answer is correct. This paper explores CQPs for more general types of databases. We put forward a general technique for constructing CQPs for any type of query, assuming the existence of a data structure/algorithm with certain inherent robustness properties that we define (called a data robust algorithm). We illustrate our technique by constructing an efficient protocol for orthogonal range queries, where the database keys are points in R d and a query asks for all keys in a rectangle [a 1 , b 1 ] ×. .. × [a d , b d ]. Our data-robust algorithm is within a O(log N) factor of the best known standard data structure (a range tree, due to Bentley [2]). We modify our protocol so that it is also private, that is, the proofs leak no information about the database beyond the query answers. We show a generic modification to ensure privacy based on zeroknowledge proofs, and also give a new, more efficient protocol tailored to hash trees.

2024

The use of the join operator in metric spaces leads to what is known as a similarity join, where objects of two datasets are paired if they are somehow similar. We propose an heuristic that solves the 1-NN selfsimilarity join, that is, a similarity join of a dataset with itself, that brings together each element with its nearest neighbor within the same dataset. Solving the problem using a simple brute-force algorithm requires O(n) distance calculations, since it requires to compare every element against all others. We propose a simple divide-and-conquer algorithm that gives an approximated solution for the self-similarity join that computes only O(n 3 2 ) distances. We show how the algorithm can be easily modified in order to improve the precision up to 31% (i.e., the percentage of correctly found 1-NNs) and such that 79% of the results are within the 10-NN, with no significant extra distance computations. We present how the algorithm can be executed in parallel and prove that usin...

2024, Lecture Notes in Computer Science

This paper describes the parallelization of the Spatial Approximation Tree. This data structure has been shown to be an efficient index structure for solving range queries in high-dimensional metric space databases. We propose a method for load balancing the work performed by the processors. The method is self-tuning and is able to dynamically follow changes in the work-load generated by user queries. Empirical results with different databases show efficient performance in practice. The algorithmic design is based on the use of the bulk-synchronous model of parallel computing.

2024

Many computational applications need to look for informa- tion in a database. Nowadays, the predominance of non- conventional databases makes the similarity search (i.e., searching elements of the database that are "similar" to a given query) becomes a preponderant concept. The Spatial Approximation Tree has been shown that it compares favorably against alternative data structures for similarity searching in metric spaces of medium to high di- mensionality ("difficult" spaces) or queries with low selec- tivity. However, for the construction process the tree root has been randomly selected and the tree ,in its shape and performance, is completely determined by this selection. Therefore, we are interested in improve mainly the searches in this data structure trying to select the tree root so to re- flect some of the own characteristics of the metric space to be indexed. We regard that selecting the root in this way it allows a better adaption of the data structure ...

2024, Journal of Discrete Algorithms

The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We solve two variants of the similarity join problem: (1) range joins: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r; and (2) k-closest pair joins: Find the k closest object pairs (one from each set). For this sake, we devise a new metric index, coined List of Twin Clusters (LTC), which indexes both sets jointly, instead of the natural approach of indexing one or both sets independently. Finally, we show how to use the LTC in order to solve classical range queries. Our results show significant speedups over the basic quadratic-time naive alternative for both join variants, and that the LTC is competitive with the original list of clusters when solving range queries. Furthermore, we show that our technique has a great potential for improvements.