Inverting Middleware Framework: a Framework for Perfromance Analysis of Distributed OLAP Benchmark on Clusters of PCs by Filtering and Abstracting Low Level Resource Usage (original) (raw)

Scalability and resource usage of an OLAP benchmark on clusters of PCs

2002

Designing clusters of PCs for distributed databases processing OLAP (On Line Analytical Processing) workloads in parallel with good scalability remains a particular challenge as we are lacking a deep understanding of the architectural issues around resource usage by standard DBMSs on distributed platforms. To address this problem, we present a novel performance monitoring framework for filtering and abstracting samples of performance data from low level counters into a high level performance picture. Our framework is used side by side with the DBMS and delivers many interesting insights about the most critical resource in the different queries and systems configuration. As required for a larger distributed hardware/software system, our solution comprises software instrumentation at the OS level, tools for gathering performance relevant data and an analytical model for performance evaluation and performance prediction to future platforms. We demonstrate the viability of our approach with the in-depth analysis of distributed TPC-D, a standard OLAP benchmark running on clusters of commodity PCs. Based on the data provided by our framework, we isolate and resolve a few crucial performance issues of OLAP workloads on clusters. For different queries, we give a workload characterization in terms of resource usage, quantify the optimal scalability and investigate the impact of the networking speed on the overall application performance. We show that the disk performance and CPU speed remains the most critical resource bottleneck for most queries. Queries with a lot of inter-node communication are limited by the communication software inefficiency within the DBMS and not by the raw networking speeds. A systematic performance evaluation constitutes a solid basis for architectural decisions and system optimization in clusters of PCs that are dedicated to large parallel database systems.

ParGRES: a middleware for executing OLAP queries in parallel

2005

ParGRES is a middleware aimed to efficiently process heavy weight queries, typical of OLAP, on top of a database cluster. ParGRES achieves query processing speed-up through intra-and inter-query parallelism in a PC cluster environment with database replication and virtual partitioning. It accelerates both individual queries and system throughput. Our experimental results show that ParGRES yields super-linear or near-linear speed-up. ParGRES middleware keeps application and database autonomy. As a result, it offers a non-intrusive migration solution from sequential to a parallel environment. Currently, ParGRES uses PostgreSQL, but it is not DBMS dependent, and has a Web administration tool. The main features of ParGRES are: automatic parsing of SQL queries to allow for intra-query parallel execution; query processing with inter-and intra-query parallelism; virtual dynamic partition definition; result composition; update processing; and dynamic load balancing. The main contribution of ParGRES is to combine inter and intra-query parallelism with dynamic load balancing for virtual partitions, all within an open source cost-effective solution.

ParGRES: a middleware for executing OLAP queries in parallel". In: COPPE/UFRJ

2005

ParGRES is a middleware aimed to efficiently process heavy weight queries, typical of OLAP, on top of a database cluster. ParGRES achieves query processing speed-up through intra-and inter-query parallelism in a PC cluster environment with database replication and virtual partitioning. It accelerates both individual queries and system throughput. Our experimental results show that ParGRES yields super-linear or near-linear speed-up. ParGRES middleware keeps application and database autonomy. As a result, it offers a non-intrusive migration solution from sequential to a parallel environment. Currently, ParGRES uses PostgreSQL, but it is not DBMS dependent, and has a Web administration tool. The main features of ParGRES are: automatic parsing of SQL queries to allow for intra-query parallel execution; query processing with inter-and intra-query parallelism; virtual dynamic partition definition; result composition; update processing; and dynamic load balancing. The main contribution of ParGRES is to combine inter and intra-query parallelism with dynamic load balancing for virtual partitions, all within an open source cost-effective solution.

Parallel OLAP query processing in database clusters with data replication

Distributed and Parallel Databases, 2009

We consider the problem of improving the performance of OLAP applications in a database cluster (DBC), which is a low cost and effective parallel solution for query processing. Current DBC solutions for OLAP query processing provide for intra-query parallelism only, at the cost of full replication of the database. In this paper, we propose more efficient distributed database design alternatives which combine physical/virtual partitioning with partial replication. We also propose a new load balancing strategy that takes advantage of an adaptive virtual partitioning to redistribute the load to the replicas. Our experimental validation is based on the implementation of our solution on the SmaQSS DBC middleware prototype. Our experimental results using the TPC-H benchmark and a 32-node cluster show very good speedup.

Data Warehousing and OLAP: Improving Query Performance Using Distributed Computing

Data warehouses are used to store large amounts of data. This data is often used for On-Line Analytical Processing (OLAP) where short response times are essential for on-line decision support. One of the most important requirements of a data warehouse server is the query performance. The principal aspect from the user perspective is how quickly the server processes a given query: "the data warehouse must be fast". The main focus of our research is finding adequate solutions to improve query response time of typical OLAP queries and improve scalability using a distributed computation environment that takes advantage of characteristics specific to the OLAP context. Our proposal provides very good performance and scalability even on huge data warehouses.

Adaptive hybrid partitioning for OLAP query processing in a database cluster

International Journal of High Performance Computing and Networking, 2008

OLAP queries are typically heavy-weight and ad-hoc thus requiring high storage capacity and processing power. In this paper, we address this problem using a database cluster which we see as a cost-effective alternative to a tightly-coupled multiprocessor. We propose a solution to efficient OLAP query processing using a simple data parallel processing technique called adaptive virtual partitioning which dynamically tunes partition sizes, without requiring any knowledge about the database and the DBMS. To validate our solution, we implemented a Java prototype on a 32 node cluster system and ran experiments with typical queries of the TPC-H benchmark. The results show that our solution yields linear, and sometimes superlinear, speedup. In many cases, it outperforms traditional virtual partitioning by factors superior to 10.

In-Depth Analysis of OLAP Query Performance on Heterogeneous Hardware

Datenbank-Spektrum, 2021

Classical database systems are now facing the challenge of processing high-volume data feeds at unprecedented rates as efficiently as possible while also minimizing power consumption. Since CPU-only machines hit their limits, co-processors like GPUs and FPGAs are investigated by database system designers for their distinct capabilities. As a result, database systems over heterogeneous processing architectures are on the rise. In order to better understand their potentials and limitations, in-depth performance analyses are vital. This paper provides interesting performance data by benchmarking a portable operator set for column-based systems on CPU, GPU, and FPGA – all available processing devices within the same system. We consider TPC‑H query Q6 and additionally a hash join to profile the execution across the systems. We show that system memory access and/or buffer management remains the main bottleneck for device integration, and that architecture-specific execution engines and op...

Data Warehousing and OLAP in a Cluster Computer Environment

2001

Decision oriented technologies, like data warehousing and on-line analytical processing (OLAP) systems store and handle very large volumes of data, requiring more efficient ways of dealing with them. Recent advances in parallel computing and high-speed networks using a cluster of PCs or workstations (COWs) offer a low cost solution for providing this scale up in performance by parallelism of data, and it's processing, in the data warehouse. This paper investigates how the star join and data cube operations can be performed in parallel on a cluster of Pcs.

Parallel query processing for OLAP in grids

Concurrency and Computation: Practice and Experience, 2008

OLAP query processing is critical for enterprise grids. Capitalizing on our experience with the ParGRES database cluster, we propose a middleware solution, GParGRES, which exploits database replication and inter-and intra-query parallelism to efficiently support OLAP queries in a grid. GParGRES is designed as a wrapper that enables the use of ParGRES in PC clusters of a grid (in our case, Grid5000). Our approach has two levels of query splitting: grid-level splitting, implemented by GParGRES, and nodelevel splitting, implemented by ParGRES. GParGRES has been partially implemented as database grid services compatible with existing grid solutions such as the open grid service architecture and the Web services resource framework. We give preliminary experimental results obtained with two clusters of Grid5000 using queries of the TPC-H Benchmark. The results show linear or almost linear speedup in query execution, as more nodes are added in all tested configurations. N. KOTOWSKI ET AL. databases using Web services and provide transparent support for database queries . Ideally, a grid database solution must respect database autonomy (i.e. avoid database or application migration) while taking advantage of distributed and parallel computing. This can be achieved through the development of a middleware layer between the user applications and the databases. Such a middleware should provide for distributed and parallel query processing with non-intrusive techniques, considering DBMS as black-box components; hence, there is no need for database or application migration.

Sidera: A Cluster-Based Server for Online Analytical Processing

2007

Online Analytical Processing (OLAP) has become a primary component of today’s pervasive Decision Support systems. The rich multi-dimensional analysis that OLAP provides allows corporate decision makers to more fully assess and evaluate organizational progress than ever before. However, as the data repositories upon which OLAP is based become larger and larger, single CPU OLAP servers are often stretched to, or even beyond, their limits. In this paper, we present a comprehensive architectural model for a fully parallelized OLAP server. Our multi-node platform actually consists of a series of largely independent sibling servers that are “glued” together with a lightweight MPI-based Parallel Service Interface (PSI). Physically, we target the commodity-oriented, “shared nothing” Linux cluster, a model that provides an extremely cost effective alterative to the “shared everything” commercial platforms often used in high-end database environments. Experimental results demonstrate both the viability and robustness of the design.