Dan Suciu - Academia.edu (original) (raw)

Papers by Dan Suciu

Logical Methods in Computer Science, 2022

Integrity constraints such as functional dependencies (FD) and multi-valued dependencies (MVD) ar... more Integrity constraints such as functional dependencies (FD) and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencie...

Recent work has demonstrated the catastrophic effects of poor cardinality estimates on query proc... more Recent work has demonstrated the catastrophic effects of poor cardinality estimates on query processing time. In particular, underestimating query cardinality can result in overly optimistic query plans which take orders of magnitude longer to complete than one generated with the true cardinality. Cardinality bounding avoids this pitfall by computing a strict upper bound on the query's output size using statistics about the database such as table sizes and degrees, i.e. value frequencies. In this paper, we extend this line of work by proving a novel bound called the Degree Sequence Bound which takes into account the full degree sequences and the max tuple multiplicity. This bound improves upon previous work incorporating degree constraints which focused on the maximum degree rather than the degree sequence. Further, we describe how to practically compute this bound using a learned approximation of the true degree sequences.

In this paper, we study the communication complexity for the problem of computing a conjunctive q... more In this paper, we study the communication complexity for the problem of computing a conjunctive query on a large database in a parallel setting with p servers. In contrast to previous work, where upper and lower bounds on the communication were specified for particular structures of data (either data without skew, or data with specific types of skew), in this work we focus on worst-case analysis of the communication cost. The goal is to find worst-case optimal parallel algorithms, similar to the work of [17] for sequential algorithms. We first show that for a single round we can obtain an optimal worst-case algorithm. The optimal load for a conjunctive query q when all relations have size equal to M is O(M/p 1/ψ *), where ψ * is a new query-related quantity called the edge quasi-packing number, which is different from both the edge packing number and edge cover number of the query hypergraph. For multiple rounds, we present algorithms that are optimal for several classes of queries....

From 06.02.05 to 11.02.05, the Dagstuhl Seminar 05061 Foundations of Semistructured Data was held... more From 06.02.05 to 11.02.05, the Dagstuhl Seminar 05061 Foundations of Semistructured Data was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The rst section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available. 05061 Summary "Foundations of Semistructured Data" As in the rst seminar on this topic, the aim o the workshop was to bring together people from the areas related to semi-structured data. However, besides the presentation of recent work, this time the main goal was to identify the main lines of a common framework for future foundational work on semi-structured data. These lines of research are summarized b...

This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilist... more This paper proposes a new approach for approximate evaluation of #P-hard queries with probabilistic databases. In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known results of PTIME self-join-free conjunctive queries: A query is safe if and only if our algorithm returns one single plan. We also apply three relational query optimization techniques to evaluate all minimal safe plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers.

Fairness is increasingly recognized as a critical component of machine learning systems. However,... more Fairness is increasingly recognized as a critical component of machine learning systems. However, it is the underlying data on which these systems are trained that often reflect discrimination, suggesting a database repair problem. Existing treatments of fairness rely on statistical correlations that can be fooled by statistical anomalies, such as Simpson's paradox. Proposals for causality-based definitions of fairness can correctly model some of these situations, but they require specification of the underlying causal models. In this paper, we formalize the situation as a database repair problem, proving sufficient conditions for fair classifiers in terms of admissible variables as opposed to a complete causal model. We show that these conditions correctly capture subtle fairness violations. We then use these conditions as the basis for database repair algorithms that provide provable fairness guarantees about classifiers trained on their training labels. We evaluate our algori...

In this paper, we outline steps towards supporting “data analysis on a budget” when operating in ... more In this paper, we outline steps towards supporting “data analysis on a budget” when operating in a setting where data must be bought, possibly periodically. We model the problem, and explore the design choices for analytic appli-cations as well as potentially fruitful algorithmic techniques to reduce the cost of acquiring data. Simulations suggest that an order of magnitude improvements are possible.

We consider the indexing problem for heterogeneous data, where objects are sets of attribute-valu... more We consider the indexing problem for heterogeneous data, where objects are sets of attribute-value pairs, and the queries specify values for an arbitrary subset of the attributes. This problem occurs in a variety of applications, such as searching individual databases, searching entire collections of heterogeneous data sources, locating sources in distributed systems, and indexing large XML documents. To date no efficient data structure is known for such queries. In its most simplified form the problem we address becomes the partial match problem, which has been studied extensively and is know to be computationally hard. We describe here the first practical technique for building such an index. Our basic idea is to precumpute certain queries and store their results. User queries are then answered by retrieving the "closest" stored query and removing from its answers all false positives. The crux of the technique consists in chosing which queries to precompute. There are se...

Variance is a popular and often necessary component of aggregation queries. It is typically used ... more Variance is a popular and often necessary component of aggregation queries. It is typically used as a secondary measure to ascertain statistical properties of the result such as its error. Yet, it is more expensive to compute than primary measures such as SUM, MEAN, and COUNT. There exist numerous techniques to compute variance. While the definition of variance implies two passes over the data, other mathematical formulations lead to a singlepass computation. Some single-pass formulations, however, can suffer from severe precision loss, especially for large datasets. In this paper, we study variance implementations in various real-world systems and find that major database systems such as PostgreSQL and most likely System X, a major commercial closed-source database, use a representation that is efficient, but suffers from floating point precision loss resulting from catastrophic cancellation. We review literature over the past five decades on variance calculation in both the statis...

We present a constant-round algorithm in the massively parallel computation (MPC) model for evalu... more We present a constant-round algorithm in the massively parallel computation (MPC) model for evaluating a natural join where every input relation has two attributes. Our algorithm achieves a load of tildeO(m/p1/rho)\tilde{O}(m/p^{1/\rho})tildeO(m/p1/rho) where mmm is the total size of the input relations, ppp is the number of machines, rho\rhorho is the join's fractional edge covering number, and tildeO(.)\tilde{O}(.)tildeO(.) hides a polylogarithmic factor. The load matches a known lower bound up to a polylogarithmic factor. At the core of the proposed algorithm is a new theorem (which we name {\em the isolated cartesian product theorem}) that provides fresh insight into the problem's mathematical structure. Our result implies that the {\em subgraph enumeration problem}, where the goal is to report all the occurrences of a constant-sized subgraph pattern, can be settled optimally (up to a polylogarithmic factor) in the MPC model.

We consider the indexing problem for heterogeneous data, where objects are sets of attribute-valu... more We consider the indexing problem for heterogeneous data, where objects are sets of attribute-value pairs, and the queries specify values for an arbitrary subset of the attributes. This problem occurs in a variety of applications, such as searching individual databases, searching entire collections of heterogeneous data sources, locating sources in distributed systems, and indexing large XML documents. To date no efficient data structure is known for such queries. In its most simplified form the problem we address becomes the par ial match problem, which has been studied extensively and is know to be computationally hard. We describe here the first practical technique for building such an index. Our basic idea is to precumpute certain queries and store their results. User queries are then answered by retrieving the “closest” stored query and removing from its answers all false positives. The crux of the technique consists in chosing which queries to precompute. There are several desi...

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020

ACM Transactions on Database Systems, 2017

We prove exponential lower bounds on the running time of the state-of-the-art exact model countin... more We prove exponential lower bounds on the running time of the state-of-the-art exact model counting algorithms—algorithms for exactly computing the number of satisfying assignments, or the satisfying probability, of Boolean formulas. These algorithms can be seen, either directly or indirectly, as building Decision-Decomposable Negation Normal Form (decision-DNNF) representations of the input Boolean formulas. Decision-DNNFs are a special case of d -DNNFs where d stands for deterministic . We show that any knowledge compilation representations from a class (called DLDDs in this article) that contain decision-DNNFs can be converted into equivalent Free Binary Decision Diagrams (FBDDs) , also known as Read-Once Branching Programs , with only a quasi-polynomial increase in representation size. Leveraging known exponential lower bounds for FBDDs, we then obtain similar exponential lower bounds for decision-DNNFs, which imply exponential lower bounds for model-counting algorithms. We also ...

Journal of the ACM, 2015

Data is increasingly being bought and sold online, and Web-based marketplace services have emerge... more Data is increasingly being bought and sold online, and Web-based marketplace services have emerged to facilitate these activities. However, current mechanisms for pricing data are very simple: buyers can choose only from a set of explicit views, each with a specific price. In this article, we propose a framework for pricing data on the Internet that, given the price of a few views, allows the price of any query to be derived automatically. We call this capability query-based pricing . We first identify two important properties that the pricing function must satisfy, the arbitrage-free and discount-free properties. Then, we prove that there exists a unique function that satisfies these properties and extends the seller's explicit prices to all queries. Central to our framework is the notion of query determinacy, and in particular instance-based determinacy : we present several results regarding the complexity and properties of it. When both the views and the query are unions of c...

In Search of Elegance in the Theory and Practice of Computation, 2013

Proceedings of the 2013 international conference on Management of data - SIGMOD '13, 2013

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014

Logical Methods in Computer Science, 2022

We consider the indexing problem for heterogeneous data, where objects are sets of attribute-valu... more We consider the indexing problem for heterogeneous data, where objects are sets of attribute-value pairs, and the queries specify values for an arbitrary subset of the attributes. This problem occurs in a variety of applications, such as searching individual databases, searching entire collections of heterogeneous data sources, locating sources in distributed systems, and indexing large XML documents. To date no efficient data structure is known for such queries. In its most simplified form the problem we address becomes the par ial match problem, which has been studied extensively and is know to be computationally hard. We describe here the first practical technique for building such an index. Our basic idea is to precumpute certain queries and store their results. User queries are then answered by retrieving the “closest” stored query and removing from its answers all false positives. The crux of the technique consists in chosing which queries to precompute. There are several desi...

Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, 2020

ACM Transactions on Database Systems, 2017

Journal of the ACM, 2015

In Search of Elegance in the Theory and Practice of Computation, 2013

Proceedings of the 2013 international conference on Management of data - SIGMOD '13, 2013

Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 2014