Elastic Pipelining in an In-Memory Database Cluster (original) (raw)

Scheduling Resources to Multiple Pipelines of One Query in a Main Memory Database Cluster

IEEE Transactions on Knowledge and Data Engineering, 2018

To fully utilize the resources of a main memory database cluster, we additionally take the independent parallelism into account to parallelize multiple pipelines of one query. However, scheduling resources to multiple pipelines is an intractable problem. Traditional static approaches to this problem may lead to a serious waste of resources and suboptimal execution order of pipelines, because it is hard to predict the actual data distribution and fluctuating workloads at compile time. In response, we propose a dynamic scheduling algorithm, List with Filling and Preemption (LFPS), based on two novel techniques. (1) Adaptive filling improves resource utilization by issuing more extra pipelines to adaptively fill idle resource "holes" during execution. (2) Rank-based preemption strictly guarantees scheduling the pipelines on the critical path first at run time. Interestingly, the latter facilitates the former filling idle "holes" with best efforts to finish multiple pipelines as soon as possible. We implement LFPS in our prototype database system. Under the workloads of TPC-H, experiments show our work improves the finish time of parallelizable pipelines from one query up to 2.5X than a static approach and 2.1X than a serialized execution.

In-Database Processing and In-Memory Analytics

Computer Communications and Networks, 2015

The rapid growth of "big-data" intensified the problem of data movement when processing data analytics: Large amounts of data need to move through the memory up to the CPU before any computation takes place. To tackle this costly problem, Processing-in-Memory (PIM) inverts the traditional data processing by pushing computation to memory with an impact on performance and energy efficiency. In this paper, we present an experimental study on processing database SIMD operators in PIM compared to current x86 processor (i.e., using AVX512 instructions). We discuss the execution time gap between those architectures. However, this is the first experimental study, in the database community, to discuss the trade-offs of execution time and energy consumption between PIM and x86 in the main query execution systems: materialized, vectorized, and pipelined. We also discuss the results of a hybrid query scheduling when interleaving the execution of the SIMD operators between PIM and x86 processing hardware. In our results, the hybrid query plan reduced the execution time by 45%. It also drastically reduced energy consumption by more than 2× compared to hardware-specific query plans.

Industrial-strength parallel query optimization: Issues and lessons

Information Systems, 1994

In the industrial context of the EDS project, we have designed and implemented a query optimizer which we have integrated within a parallel database system. The optimizer takes as input a query expressed in ESQL, an extension of SQL with objects and rules, and produces a minimum cost parallel execution plan. Our research agenda has focused on several di cult problems: support of ESQL's advanced features such as path expressions and recursion, modelling of parallel execution spaces and extensibility of the search strategy. In this paper, we give a retrospective on the optimizer project with emphasis on our design goals, research contributions and implementation decisions. We also describe the current optimizer prototype and report on experiments performed with a pilot application. Finally, we present the lessons learned.

Pipelined Query Processing in Coprocessor Environments

Proceedings of the 2018 International Conference on Management of Data, 2018

Query processing on GPU-style coprocessors is severely limited by the movement of data. With teraflops of compute throughput in one device, even high-bandwidth memory cannot provision enough data for a reasonable utilization. Query compilation is a proven technique to improve memory efficiency. However, its inherent tuple-at-a-time processing style does not suit the massively parallel execution model of GPU-style coprocessors. This compromises the improvements in efficiency offered by query compilation. In this paper, we show how query compilation and GPU-style parallelism can be made to play in unison nevertheless. We describe a compiler strategy that merges multiple operations into a single GPU kernel, thereby significantly reducing bandwidth demand. Compared to operator-at-a-time, we show reductions of memory access volumes by factors of up to 7.5x resulting in shorter kernel execution times by factors of up to 9.5x.

Multi-Query Optimization for Parallel Dataflow Systems

2009

Abstract Existing parallel dataflow systems are strictly reactive in their optimizations. At best, such approaches approximate the optimal strategy, missing opportunities to optimize across multiple queries and reschedule queries to improve locality. We propose three techniques that improve query execution performance by utilizing high-level knowledge of the workload. The first technique predictively replicates data to improve aggregate read bandwidth and increase locality.

Invited Project Review: Industrial-strength parallel query optimization: issues and lessons

1994

In the industrial context of the EDS project, we have designed and implemented a query optimizer which we have integrated within a parallel database system. The optimizer takes as input a query expressed in ESQL, an extension of SQL with objects and rules, and produces a minimum cost parallel execution plan. Our research agenda has focused on several difficult problems: support of ESQL's advanced features such as path expressions and recursion, modelling of parallel execution spaces and extensibility of the search strategy. In this paper, we give a retrospective on the optimizer project with emphasis on our design goals, research contributions and implementation decisions. We also describe the current optimizer prototype and report on experiments performed with a pilot application. Finally, we present the lessons learned.

Overtaking CPU DBMSes with a GPU in Whole- Query Analytic Processing with Parallelism- Friendly Execution Plan... Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

Existing work on accelerating analytic DB query processing with (discrete) GPUs fails to fully realize their potential for speedup through parallelism: Published results do not achieve significant speedup over more performant CPU-only DBMSes when processing complete queries. This paper presents a successful effort to better meet this challenge, in the form of a proof-of-concept query processing framework. The framework constitutes a graft onto an existing DBMS, altering some parts of it and replacing its execution engine entirely. It intensively refactors query execution plans, making them better-parallelizable, before executing them on either a CPU or on GPU. This results in a significant speedup even on a CPU, and a further speedup when using a GPU, over the chosen host DBMS (MonetDB) — which itself already bests most published results utilizing a GPU for query processing. Finally, we outline some concrete future improvements on our results which can cut processing time by half and possibly much more.

Dependency-aware reordering for parallelizing query optimization in multi-core CPUs

2009

The state of the art commercial query optimizers employ cost-based optimization and exploit dynamic programming (DP) to find the optimal query execution plan (QEP) without evaluating redundant sub-plans. The number of alternative QEPs enumerated by the DP query optimizer can increase exponentially, as the number of joins in the query increases. Recently, by exploiting the coming wave of multi-core processor architectures, a state of the art parallel optimization algorithm , referred to as PDPsva, has been proposed to parallelize the "time-consuming" DP query optimization process itself. While PDPsva significantly extends the practical use of DP to queries having up to 20-25 tables, it has several limitations: 1) supporting only the size-driven DP enumerator, 2) statically allocating search space, and 3) not fully exploiting parallelism. In this paper, we propose the first generic solution for parallelizing any type of bottom-up optimizer, including the graph-traversal driven type, and for supporting dynamic search allocation and full parallelism. This is a challenging problem, since recently developed, state of art DP optimizers such as DPcpp [21] and DP hyp are very difficult to parallelize due to tangled dependencies in the join pairs they generate. Unless the solution is very carefully devised, a lot of synchronization conflicts are bound to occur. By viewing a serial bottom-up optimizer as one which generates a totally ordered sequence of join pairs in a streaming fashion, we propose a novel concept of dependency-aware reordering, which minimizes waiting time caused by dependencies of join pairs. To maximize parallelism, we also introduce a series of novel performance optimization techniques: 1) pipelining of join pair generation and plan generation; 2) the synchronization-free global MEMO; and 3) threading across dependencies. Through extensive experiments with various query topologies, we show that our solution supports any type of bottom up optimization, achieving linear speedup for each type. Despite the fact that our solution is generic, due to sophisticated optimization techniques, our generic parallel optimizer outperforms PDPsva tailored to size-driven enumeration. Experimental results also show that our solution is much more robust than PDPsva with respect to search space allocation.