Space-efficient online computation of quantile summaries (original) (raw)
An-appro ximate quan tile summary of a sequence of N elements is a data structure that can answer quantile queries about the sequence to within a precision of N. We presen t a new online algorithm for computing-appro ximate quantile summaries of very large data sequences. The algorithm has a worst-case space requirement o f O 1 log N. This improves upon the previous best result ofO 1 log 2 N. Moreover, in con trast to earlier deterministic algorithms, our algorithm does not require a priori knowledge of the length of the input sequence. Finally, the actual space bounds obtained on experimental data are signi cantly better than the worst case guarantees of our algorithm as well as the observed space requirements of earlier algorithms. 1.1 Quantile Estimation for Database Applications Recent w ork e.g. 8, 9, 12 has highlighted the importance of quantile estimators for database users and implementors. Quantile estimates are used to estimate the size of intermediate results, to allow query optimizers to estimate the cost of competing plans to resolv e database queries. P arallel databases attempt to partition the data into value ranges such that the size of all partitions are roughly equal. Quantile estimates can be used to choose the ranges without inspecting the actual data. Quantile estimates ha veseveral other uses in databases as w ell. User-interfaces may estimate result sizes of queries, and provide feedback to users. This feedback m a y prev ent expensiv e and incorrect queries from being issued, and may ag discrepancies betw een the user's model of the database and its actual content. Quantile estimates are also used by database users to characterize the distribution of real world data sets. The existing body of w ork has also iden ti ed particular properties that quan tileestimators require in order to be useful for these database applications | properties that may not be strictly necessary when estimating quantiles in other domains. Some of the desirable properties are as follows. 1 The algorithm should provide tunable and explicit a priori guarantees on the precision of the approximation. We say that a quantile summary is-appr oximate if it can be used to answer an y quantile query to within a precision of N. In other words, for any giv en rankr, a n-appro ximate quan tile summary returns a v alue whose rank r 0 is guaranteed to be within the interval r , N; r + N. 2 The algorithm should be data independent. Neither its guarantees should be a ected by the arriv al order or distribution of values, nor should it require a priori kno wledge of the size of the dataset. 3 The algorithm should execute in a single pass over the data. 4 The algorithm should have a s small a memory footprint as possible. We note here that the memory footprint applies to temporary storage during the computation. We can alw aysconstruct an-appro ximate summary of size O1= as follows. We rst construct an
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.