Prediction of Horizontal Data Partitioning Through Query Execution Cost Estimation (original) (raw)
Related papers
Dynamic Workload-Based Partitioning Algorithms for Continuously Growing Databases
Lecture Notes in Computer Science, 2013
Applications with very large databases, where data items are continuously appended, are becoming more and more common. Thus, the development of efficient data partitioning is one of the main requirements to yield good performance. In the case of applications that have complex access patterns, e.g. scientific applications, workload-based partitioning could be exploited. However, existing workload-based approaches, which work in a static way, cannot be applied to very large databases. In this paper, we propose DynPart and DynPartGroup, two dynamic partitioning algorithms for continuously growing databases. These algorithms efficiently adapt the data partitioning to the arrival of new data elements by taking into account the affinity of new data with queries and fragments. In contrast to existing static approaches, our approach offers constant execution time, no matter the size of the database, while obtaining very good partitioning efficiency. We validated our solution through experimentation over real-world data; the results show its effectiveness. ⋆ Work partially funded by the CNPq-INRIA HOSCAR project. Data are appended to the catalog database as new observations are performed and the resulting database size is estimated to reach 100TB very soon. Scientists around the globe can access the database with queries that may contain a considerable number of attributes. The volume of data that such applications hold poses important challenges for data management. In particular, efficient solutions are needed to partition and distribute the data in multiple servers, e.g., in a cluster. An efficient partitioning scheme would try to minimize the number of fragments that are accessed in the execution of a query, thus minimizing the overhead of the distributed execution. Vertical partitioning solutions, such as column-oriented databases [18], may be useful for physical design on each node, but fail to provide an efficient distributed partitioning, in particular for applications with high dimensional queries, where joins would have to be executed by transferring data between nodes. Traditional horizontal partitioning approaches, such as hashing or range-based partitioning, are unable to capture the complex access patterns present in scientific computing applications, especially because these applications usually make use of complicated relations, including mathematical operations, over a big set of columns, and are difficult to be predefined a priori.
Automated partitioning design in parallel database systems
2011
In recent years, Massively Parallel Processors (MPPs) have gained ground enabling vast amounts of data processing. In such environments, data is partitioned across multiple compute nodes, which results in dramatic performance improvements during parallel query execution. To evaluate certain relational operators in a query correctly, data sometimes needs to be re-partitioned (i.e., moved) across compute nodes. Since data movement operations are much more expensive than relational operations, it is crucial to design a suitable data partitioning strategy that minimizes the cost of such expensive data transfers. A good partitioning strategy strongly depends on how the parallel system would be used. In this paper we present a partitioning advisor that recommends the best partitioning design for an expected workload. Our tool recommends which tables should be replicated (i.e., copied into every compute node) and which ones should be distributed according to specific column(s) so that the cost of evaluating similar workloads is minimized. In contrast to previous work, our techniques are deeply integrated with the underlying parallel query optimizer, which results in more accurate recommendations in a shorter amount of time. Our experimental evaluation using a real MPP system, Microsoft SQL Server 2008 Parallel Data Warehouse, with both real and synthetic workloads shows the effectiveness of the proposed techniques and the importance of deep integration of the partitioning advisor with the underlying query optimizer.
Horizontal Partitioning by Predicate Abstraction and Its Application to Data Warehouse Design
Lecture Notes in Computer Science, 2010
We propose a new method for horizontal partitioning of relations based on predicate abstraction by using a finite set of arbitrary predicates defined over the whole domains of relations. The method is formal and compositional: arbitrary fragments of relations can be partitioned with arbitrary number of predicates. We apply this partitioning to address the problem of finding suitable design for a relational data warehouse modeled using star schemas such that the performance of a given workload is optimized. We use a genetic algorithm to generate an appropriate solution for this optimization problem. The experimental results confirm effectiveness of our approach.
An Active System for Dynamic Vertical Partitioning of Relational Databases
2016
Abstract. Vertical partitioning is a well known technique to improve query response time in relational databases. This consists in dividing a table into a set of fragments of attributes according to the queries run against the table. In dynamic systems the queries tend to change with time, so it is needed a dynamic vertical partitioning technique which adapts the fragments according to the changes in query patterns in order to avoid long query response time. In this paper, we propose an active system for dynamic vertical partitioning of relational databases, called DYVEP (DYnamic VErtical Partitioning). DYVEP uses active rules to vertically fragment and refragment a database without intervention of a database administrator (DBA), maintaining an acceptable query response time even when the query patterns in the database suffer changes. Experiments with the TPC-H benchmark demonstrate efficient query response time.
Locality-aware Partitioning in Parallel Database Systems
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
Parallel database systems horizontally partition large amounts of structured data in order to provide parallel data processing capabilities for analytical workloads in sharednothing clusters. One major challenge when horizontally partitioning large amounts of data is to reduce the network costs for a given workload and a database schema. A common technique to reduce the network costs in parallel database systems is to co-partition tables on their join key in order to avoid expensive remote join operations. However, existing partitioning schemes are limited in that respect since only subsets of tables in complex schemata sharing the same join key can be co-partitioned unless tables are fully replicated. In this paper we present a novel partitioning scheme called predicate-based reference partition (or PREF for short) that allows to co-partition sets of tables based on given join predicates. Moreover, based on PREF, we present two automatic partitioning design algorithms to maximize data-locality. One algorithm only needs the schema and data whereas the other algorithm additionally takes the workload as input. In our experiments we show that our automated design algorithms can partition database schemata of different complexity and thus help to effectively reduce the runtime of queries under a given workload when compared to existing partitioning approaches.
Vertical partitioning algorithms for database design
ACM Transactions on Database Systems, 1984
This paper addresses the vertical partitioning of a set of logical records or a relation into fragments. The rationale behind vertical partitioning is to produce fragments, groups of attribute columns, that “closely match” the requirements of transactions. Vertical partitioning is ...
Efficient Partitioning of Large Databases without Query Statistics
Efficient Partitioning of Large Databases without Query Statistics, 2016
An efficient way of improving the performance of a database management system is distributed processing. Distribution of data involves fragmentation or partitioning, replication, and allocation process. Previous research works provided partitioning based on empirical data about the type and frequency of the queries. These solutions are not suitable at the initial stage of a distributed database as query statistics are not available then. In this paper, I have presented a fragmentation technique, Matrix based Fragmentation (MMF), which can be applied at the initial stage as well as at later stages of distributed databases. Instead of using empirical data, I have developed a matrix, Modified Create, Read, Update and Delete (MCRUD), to partition a large database properly. Allocation of fragments is done simultaneously in my proposed technique. So using MMF, no additional complexity is added for allocating the fragments to the sites of a distributed database as fragmentation is synchronized with allocation. The performance of a DDBMS can be improved significantly by avoiding frequent remote access and high data transfer among the sites. Results show that proposed technique can solve the initial partitioning problem of large distributed databases.
Applying Vertical Fragmentation Techniques in Logical Design of Multidimensional Databases
Lecture Notes in Computer Science, 2000
In the context of multidimensional databases implemented o n relational DBMSs through star schemes, the most effective technique t o enhance performances consists of materializing redundant aggregates called views. In this paper we investigate the problem of vertical fragmentation of views aimed at minimizing the workload response time. Each view includes several measures which not necessarily are always requested together; thus, the system performance may be increased by partitioning the views into smaller tables. On the other hand, drill-across queries involve measures taken from two or more views; in this case the access costs may be decreased by unifying these views into larger tables. After formalizing the fragmentation problem as a 0-1 integer linear programming problem, we define a cost function and outline a branch-and-bound algorithm t o minimize it. Finally, we demonstrate the usefulness of our approach b y presenting a set of experimental results based on the TPC-D benchmark.
Algorithms and support for horizontal class partitioning in object-oriented databases
2000
Horizontal partitioning is a logical database design technique which facilitates efficient execution of queries by reducing the irrelevant objects accessed. Given a set of most frequently executed queries on a class, the horizontal partitioning generates horizontal class fragments (each of which is a subset of object instances of the class), that meet the queries requirements. There are two types of
A framework for database optimization and workload control
2010
Nowadays, database optimizers take advantage of indexes and materialized views (MVs) to produce query execution plans. While indexes and MVs can speed up the execution of queries, there are costs to store and manage them. This paper presents a mathematical model and a greedy implementation to recommend a set of indexes and MVs in order to optimize the database, given a database workload and a disk space constraint. Our approach is generic, so it can be used to any relational database system that takes advantage of MVs in plan selection. While it was developed for frequently updated databases, it also produced interesting results in read only databases and to estimate the behavior of new databases (with predefined workloads).