Volker Markl - Academia.edu (original) (raw)

Papers by Volker Markl

Proceedings of the 2018 International Conference on Management of Data, 2018

Query processing on GPU-style coprocessors is severely limited by the movement of data. With tera... more Query processing on GPU-style coprocessors is severely limited by the movement of data. With teraflops of compute throughput in one device, even high-bandwidth memory cannot provision enough data for a reasonable utilization. Query compilation is a proven technique to improve memory efficiency. However, its inherent tuple-at-a-time processing style does not suit the massively parallel execution model of GPU-style coprocessors. This compromises the improvements in efficiency offered by query compilation. In this paper, we show how query compilation and GPU-style parallelism can be made to play in unison nevertheless. We describe a compiler strategy that merges multiple operations into a single GPU kernel, thereby significantly reducing bandwidth demand. Compared to operator-at-a-time, we show reductions of memory access volumes by factors of up to 7.5x resulting in shorter kernel execution times by factors of up to 9.5x.

The VLDB Journal, 2018

Processor manufacturers build increasingly specialized processors to mitigate the effects of the ... more Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall to deliver improved performance. Currently, database engines are manually optimized for each processor: A costly and error prone process. In this paper, we propose concepts to enable the database engine to perform per-processor optimization automatically. Our core idea is to create variants of generated code and to learn a fast variant for each processor. We create variants by modifying parallelization strategies, specializing data structures, and applying different code transformations. Our experimental results show that the performance of variants may diverge up to two orders of magnitude. Therefore, we need to generate custom code for each processor to achieve peak performance. We show that our approach finds a fast custom variant for multi-core CPUs, GPUs, and MICs.

Most of the operations of the relational algebra or of SQL (such as projection with duplicate eli... more Most of the operations of the relational algebra or of SQL (such as projection with duplicate elimination, joins, ordering, group-by and aggregations) are efficiently processed using a sorted stream of tuples. Often, these operations are combined with restrictions in one or several attributes. Previous research has proposed algorithms for efficiently dealing with this kind of query pattern, which is highly relevant with respect to data warehousing, data mining and GIS (geographic information systems). In this paper, we present a cost model that enables a concise estimation of both memory costs and run-time costs for processing queries with restrictions in multiple attributes that may in addition involve a sort operation. Our cost model considers uniformly distributed UB-trees (universal B-trees) with independent dimensions, and it is derived analytically in three steps, starting with a very simple, perfectly idealized partitioning scheme, moving on to imperfect partitioning schemes,...

Bulk loading is used to efficiently build a table or access structure if a large data set is avai... more Bulk loading is used to efficiently build a table or access structure if a large data set is available at index time, e.g., the spool process of a data warehouse or the creation of intermediate results during query processing. The authors introduce the TempTris algorithm that creates a multidimensional partitioning from a one-dimensionally sorted stream of tuples. In order to achieve that, TempTris exploits the fact that a one-dimensional order can be used as a partial multidimensional order for the creation of a multidimensional partitioning. In this way, TempTris avoids external sorting for the creation of a multidimensional index. In combination with the Tetris sort algorithm, TempTris can be used to create intermediate query processing results that can (without external sorting), be reused to generate various sort orders. As an example of this new processing technique we propose an efficient algorithm for computing an aggregation lattice. Thus, TempTris can also be used to speed...

The VLDB Journal, 2015

Visual analysis of high-volume numerical data is ubiquitous in many industries, including finance... more Visual analysis of high-volume numerical data is ubiquitous in many industries, including finance, banking, and discrete manufacturing. Contemporary, RDBMS-based systems for visualization of high-volume numerical data have difficulty to cope with the hard latency requirements and high ingestion rates of interactive visualizations. Existing solutions for lowering the volume of large data sets disregard the spatial properties of visualizations and result in visualization errors. In this work, we introduce VDDA, a Visualization-Driven Data Aggregation approach that provides highquality to error-free visualizations of high-volume data sets, at high data reduction rates. Based on the M4 aggregation for producing error-free line charts, we develop a complete set of visualization-driven data aggregation operators for the most common chart types. We describe how to model aggregation-based data reduction at the query level in a visualization-driven query rewriting system. Our approach is generic and applicable to any visualization system that consumes data stored in relational databases. Using real world data sets from high-tech manufacturing, stock markets, and sports analytics domains, we demonstrate that our visualization-driven data aggregation can reduce data volumes by up to two orders of magnitude,while preserving pixel-perfect visualizations of the raw data.

Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405)

Efficient star query processing is crucial for a performant data warehouse (DW) implementation an... more Efficient star query processing is crucial for a performant data warehouse (DW) implementation and much work is available on physical optimization (e.g., indexing and schema design) and logical optimization (e.g., preaggregated materialized views with query rewriting). One important step in the query processing phase is, however, still a bottleneck: the residual join of results from the fact table with the dimension tables in combination with grouping and aggregation. This phase typically consumes between 50% and 80% of the overall processing time. In typical DW scenarios pre-grouping methods only have a limited effect as the grouping is usually specified on the hierarchy levels of the dimension tables and not on the fact table itself. In this paper, we suggest a combination of hierarchical clustering and pre-grouping as we have implemented in the relational DBMS Transbase. Exploiting hierarchy semantics for the pre-grouping of fact table result tuples is several times faster than conventional query processing. The reason for this is that hierarchical pre-grouping reduces the number of join operations significantly. With this method even queries covering a large part of the fact table can be executed within a time span acceptable for interactive query processing.

VLDB '02: Proceedings of the 28th International Conference on Very Large Databases, 2002

Star queries are the most prevalent kind of queries in data warehousing, OLAP and business intell... more Star queries are the most prevalent kind of queries in data warehousing, OLAP and business intelligence applications. Thus, there is an imperative need for efficiently processing star queries. To this end, a new class of fact table organizations has emerged that exploits path-based surrogate keys in order to hierarchically cluster the fact table data of a star schema [DRSN98, MRB99, KS01]. In the context of these new organizations, star query processing changes radically. In this paper, we present a complete abstract processing plan that captures all the necessary steps in evaluating such queries over hierarchically clustered fact tables. Furthermore, we present optimizations for surrogate key processing and a novel early grouping transformation for grouping on the dimension hierarchies. Our algorithms have been already implemented in a commercial relational database management system (RDBMS) and the experimental evaluation, as well as customer feedback, indicates speedups of orders of magnitude for typical star queries in real world applications.

Proceedings. IDEAS'99. International Database Engineering and Applications Symposium (Cat. No.PR00265)

Data-warehousing applications cope with enormous data sets in the range of Gigabytes and Terabyte... more Data-warehousing applications cope with enormous data sets in the range of Gigabytes and Terabytes. Queries usually either select a very small set of this data or perform aggregations on a fairly large data set. Materialized views storing pre-computed aggregates are used to efficiently process queries with aggregations. This approach increases resource requirements in disk space and slows down updates because of the view maintenance problem. Multidimensional hierarchical clustering (MHC) of OLAP data overcomes these problems while offering more flexibility for aggregation paths. Clustering is introduced as a way to speed up aggregation queries without additional storage cost for materialization. Performance and storage cost of our access method are investigated and compared to current query processing scenarios. In addition performance measurements on real world data for a typical star schema are presented.

We investigate the usability and performance of the UB-Tree (universal B-Tree) for multidimension... more We investigate the usability and performance of the UB-Tree (universal B-Tree) for multidimensional data, as they arise in all relational databases and in particular in data- warehousing and data-mining applications. The UB-Tree is balanced and has all the guaranteed performance characteristics of B-Trees, i.e., it requires linear space for storage and logarithmic time for the basic operations of insertion, retrieval

A multidimensional access method offering significant performance increases by intelligently part... more A multidimensional access method offering significant performance increases by intelligently partitioning the query space is applied to relational database management systems (RDBMS). We introduce a formal model for multidimensional partitioned relations and discuss several typical query patterns. The model identifies the significance of multidimensional range queries and sort operations. The discussion of current access methods gives rise to the need for a multidimensional partitioning of relations. A detailed analysis of space partitioning focussing especially on Z-ordering illustrates the principle benefits of multidimensional indexes. After describing the UB-Tree and its standard algorithms for insertion, deletion, point queries, and range queries, we introduce the spiral algorithm for nearest neighbor queries with UB-Trees and the Tetris algorithm for efficient access to a table in arbitrary sort order. We then describe the complexity of the involved algorithms and give solutions to selected algorithmic problems for a prototype implementation of UB-Trees on top of several RDBMSs. A cost model for sort operations with and without range restrictions is used both for analyzing our algorithms and for comparing UB-Trees with state-of-the-art query processing. Performance comparisons with traditional access methods practically confirm the theoretically expected superiority of UB-Trees and our algorithms over traditional access methods: Query processing in RDBMS is accelerated by several orders of magnitude, while the resource requirements in main memory space and disk space are substantially reduced. Benchmarks on some queries of the TPC-D benchmark as well as the data warehousing scenario of a fruit juice company illustrate the potential impact of our work on relational algebra, SQL, and commercial applications. The results of this thesis were developed by the author managing the MISTRAL project, a joint research and development project with SAP AG (Germany), Teijin Systems Technology Ltd. (Japan), NEC (Japan), Hitachi (Japan), Gesellschaft für Konsumforschung (Germany), and TransAction Software GmbH (Germany). I thank my supervisor, Prof. Rudolf Bayer, Ph.D., for his support and the many fruitful discussions and inspirations. He helped me a lot with his ideas and his confidence, especially in the beginning, when the feasibility of the work and the road ahead were still unclear. My master students and interns did a large portion of the prototype implementation. In addition they helped with analysis, and carried out many of the performance measurements. I especially thank Nils Frielinghaus, who was the first student who dared to be supervised by me. Our discussions produced ideas, which were crucial for the success of the MISTRAL project. The same holds for the work of Roland Pieringer, who helped to prove the practical benefits of our approach by porting our pilot implementation to Oracle and doing performance measurements at SAP in Walldorf. When the MISTRAL project grew, I was not alone anymore: My team members Robert Fenk, Frank Ramsak, Stefan Sixl, and Martin Zirkel worked on the project with the same enthusiasm as I did. We had numerous inspiring discussions. For this thesis, my colleagues also did a lot of proofreading, which certainly improved the quality of the work.

Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337), 1999

Most operations of the relational algebra or SQL require a sorted stream of tuples for efficient ... more Most operations of the relational algebra or SQL require a sorted stream of tuples for efficient processing. Therefore, processing complex relational queries relies on efficient access to a table in some sort order. In principle, indexes could be used, but they are superior to a full table scan only, if the result set is sufficiently restricted in the index attribute. In this paper we present the Tetris algorithm, which utilizes restrictions to process a table in sort order of any attribute without the need of external sorting. The algorithm relies on the space partitioning of a multidimensional access method. A sweep line technique is used to read data in sort order of any attribute, while accessing each disk page of a table only once. Results are produced earlier than with traditional sorting techniques, allowing better response times for interactive applications and pipelined processing of the result set. We describe a prototype implementation of the Tetris algorithm using UB-Trees on top of Oracle 8, define a cost model and present performance measurements for some queries of the TPC-D benchmark.

Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789), 2000

This paper considers the issue of bulk loading large data sets for the UB-Tree, a multidimensiona... more This paper considers the issue of bulk loading large data sets for the UB-Tree, a multidimensional index structure. Especially in dataware housing (DW), data mining and OLAP it is necessary to have efficient bulk loading techniques, because loading occurs not continuously, but only from time to time with usually large data sets. We propose two techniques, one for initial loading, which creates a new UB-Tree, and one for incremental loading, which adds data to an existing UB-Tree. Both techniques try to minimize I/O and CPU cost. Measurements with artificial data and data of a commercial data warehouse demonstrate that our algorithms are efficient and able to handle large data sets. As well as the UB-Tree, they are easily integrated into a RDBMS.

Proceedings International Database Engineering and Applications Symposium

Advanced data warehouses and web databases have set the demand for processing large sets of time ... more Advanced data warehouses and web databases have set the demand for processing large sets of time ranges, quality classes, fuzzy data, personalized data and extended objects. Since, all of these data types can be mapped to intervals, interval indexing can dramatically speed up or even be an enabling technology for these new applications. We introduce a method for managing intervals by indexing the dual space with the UB-Tree. We show that our method is an effective and efficient solution, benefitting from all good characteristics of the UB-Tree, i.e., concurrency control, worst case guarantees for insertion, deletion and update as well as efficient query processing. Our technique can easily be integrated into an RDBMS engine providing the UB-Tree as access method. We also show that our technique is superior and more flexible to previously suggested techniques.

In diesem Paper werden Hierarchien in relationale Datenbanksysteme eingeführt. Die Anfragesprache... more In diesem Paper werden Hierarchien in relationale Datenbanksysteme eingeführt. Die Anfragesprache unterliegt minimalen Änderungen. Aus der logischen Sichtweise der Behandlung von Hierarchien wird z.B. die Maintenance verbessert, aus der physischen Sicht kann die Abarbeitung von Anfragen erheblich beschleunigt werden. Zu diesem Zweck müssen Metadaten in Form von Abhängigkeitsbeziehungen zwischen den Hierarchiestufen im DBMS vorhanden sein. Ziel ist es, mit möglichst geringen Änderungen für den Benutzer, diese Optimierungen zu ermöglichen.

The management and query processing of one dimensional intervals is a special case of extended ob... more The management and query processing of one dimensional intervals is a special case of extended object handling. One dimensional intervals play an important role in temporal databases and they can also be used for fuzzy matching, fuzzy logic and measuring quality classes, etc. Most existing multidimensional access methods for extended objects do not address this special problem and most of them are main memory access methods that do not support efficient access to secondary storage. The research in the application of the UB-Tree to extended objects is part of my doctoral work. The contribution of this article is a specific solution for managing and querying one dimensional intervals with the UB-Tree, a multidimensional extension of the classical B-Tree. The combination of UB-Tree and transformation of extended objects to parameter space is an effective solution for this specific problem.

Only few multidimensional access methods have made their way into commercial relational DBMS. Eve... more Only few multidimensional access methods have made their way into commercial relational DBMS. Even if a RDBMS ships with a multidimensional index, the multidimensional index usually is an add-on like Oracle SDO, which is not integrated into the SQL interpreter, query processor and query optimizer of the DBMS kernel. Our demonstration shows TransBase HyperCube, a commercial RDBMS, whose kernel fully integrates the UB-Tree, a multidimensional extension of the B-Tree. This integration was performed in an ESPRIT project funded by the European Commission. We put the main emphasis of our demonstration on the application of UB-Tree indexes in realworld databases for OLAP. However, we also address general issues of UB-Trees like creation, spacerequirements, or comparison to other indexing methods.

Multidimensional access methods like the UB-Tree can be used to accelerate almost any query proce... more Multidimensional access methods like the UB-Tree can be used to accelerate almost any query processing operation, if proper query processing algorithms are used: Relational queries or SQL queries consist of restrictions, projections, ordering, grouping and aggregation, and join operations. In the presence of multidimensional restrictions or sorting, multidimensional range query or Tetris algorithms efficiently process these operations. In addition, these algorithms also efficiently support queries that generate some hierarchical restrictions (for instance by following 1:n foreign key relationships). In this paper we investigate the impacts on query processing in RDBMS when using UB-Trees and multidimensional hierarchical clustering for physical data organization. We illustrate the benefits by performance measurements of queries for a star schema from a real world application of a SAP business information warehouse. The performance results reported in this paper were measured with our prototype implementation of UB-Trees on top of Oracle 8. We compare the performance of UB-Trees to native query processing techniques of Oracle, namely access via an index organized table, which essentially stores a relation in a clustered B*-Tree, and access via a full table scan of an entire relation. In addition we measure the performance of the intersection of multiple bitmap indexes to answer multidimensional range queries.