A New Big Data Benchmark for OLAP Cube Design Using Data Pre-Aggregation Techniques (original) (raw)
Related papers
Reduced Quotient Cube: Maximize Query Answering Capacity in OLAP
IEEE Access
The data cube is a critical tool for accelerating online analysis in big data. Due to its exponential space overhead, the quotient cube, as the main data cube compression approach, was proposed to significantly reduce the number of data cells if they are aggregated over the same base tuple set, i.e. they are cover equivalent to form an equivalence class. Nevertheless, it still poses challenges to efficiently analyze massive data due to high storage space consumption. This paper proposes the reduced quotient cube (RQC) based on the following observation. (i) there are equivalence classes of various sizes in a quotient cube; (ii) the small equivalence classes usually dominate; (iii) the big equivalence classes are more capable of query answering since they can induce more data cells. Unlike the quotient cube, which preserves all the equivalence classes of equal priority, the reduced quotient cube preferentially does those with larger query answering capacity and smaller space occupied capacity. Further, we design its efficient constructing and querying algorithms. The extensive experimental results show that compared with the quotient cube, the reduced quotient cube space is only 11.3% while the maximum query capacity is 95.9%. The query time of the reduced quotient cube is reduced by 51.24% on average compared to the quotient cube.
Improving Query Processing Time of Olap Cube Using Olap Operations
The popularity of OLAP cube has been growing due to the huge volume of data and need for ad-hoc analytical queries. As OLAP cube provides multidimensional view of data the analysis of data become faster and improve response time over relational databases. The performance here is measured on the basis of throughput of the queries that is the time taken by a query in fetching the appropriate and efficient result. The processing time of query processing is observed to be better in case of OLAP cube as compared with the OLTP but still there is some hope of more improvement. In this regard applying OLAP operations on a cube found to be more appropriate approach to improve query processing time of OLAP cube. In this paper a comparative analysis is done to compare the query processing time of the OLAP cube and the OLAP operations.
Constructing OLAP cubes based on queries
Proceedings of the 4th ACM international workshop on Data warehousing and OLAP - DOLAP '01, 2001
An On-Line Analytical Processing (OLAP) user often follows a train of thought, posing a sequence of related queries against the data warehouse. Although their details are not known in advance, the general form of those queries is apparent beforehand. Thus, the user can outline the relevant portion of the data posing generalised queries against a cube representing the data warehouse.
Data Warehousing and OLAP: Improving Query Performance Using Distributed Computing
Data warehouses are used to store large amounts of data. This data is often used for On-Line Analytical Processing (OLAP) where short response times are essential for on-line decision support. One of the most important requirements of a data warehouse server is the query performance. The principal aspect from the user perspective is how quickly the server processes a given query: "the data warehouse must be fast". The main focus of our research is finding adequate solutions to improve query response time of typical OLAP queries and improve scalability using a distributed computation environment that takes advantage of characteristics specific to the OLAP context. Our proposal provides very good performance and scalability even on huge data warehouses.
Fast and dynamic OLAP exploration using UDFs
2009
OLAP is a set of database exploratory techniques to efficiently retrieve multiple sets of aggregations from a large dataset. Generally, these techniques have either involved the use of an external OLAP server or required the dataset to be exported to a specialized OLAP tool for more efficient processing. In this work, we show that OLAP techniques can be performed within a modern DBMS without external servers or the exporting of datasets, using standard SQL queries and UDFs. The main challenge of such approach is that SQL and UDFs are not as flexible as the C language to explore the OLAP lattice and therefore it is more difficult to develop optimizations. We compare three different ways of performing OLAP exploration: plain SQL queries, a UDF implementing a lattice structure, and a UDF programming the star cube structure. We demonstrate how such methods can be used to efficiently explore typical OLAP datasets.
Efficient OLAP query processing in distributed data warehouses
Information Systems, 2003
The success of Internet applications has led to an explosive growth in the demand for bandwidth from Internet Service Providers. Managing an Internet protocol network requires collecting and analyzing network data, such as flow-level traffic statistics. Such analyses can typically be expressed as OLAP queries, e.g., correlated aggregate queries and data cubes. Current day OLAP tools for this task assume the availability of the data in a centralized data warehouse. However, the inherently distributed nature of data collection and the huge amount of data extracted at each collection point make it impractical to gather all data at a centralized site. One solution is to maintain a distributed data warehouse, consisting of local data warehouses at each collection point and a coordinator site, with most of the processing being performed at the local sites. In this paper, we consider the problem of efficient evaluation of OLAP queries over a distributed data warehouse. We have developed the Skalla system for this task. Skalla translates OLAP queries, specified as certain algebraic expressions, into distributed evaluation plans which are shipped to individual sites. A salient property of our approach is that only partial results are shipped -never parts of the detail data. We propose a variety of optimizations to minimize both the synchronization traffic and the local processing done at each site. We finally present an experimental study based on TPC-R data. Our results demonstrate the scalability of our techniques and quantify the performance benefits of the optimization techniques that have gone into the Skalla system. r
Complexity Analysis of query processing in Distribute OLAP Systems
The success of Internet applications has led to an explosive growth in the highly demand of data processing in data warehouse. Data warehousing playing an important role and known as backbone of today's cloud computing. OLAP system used for data analysis in data warehouse. Such analyses can be extract by OLAP queries, e.g., correlated aggregate queries and data cubes. Now a days OLAP analysis can be extract in centralized data warehouse. Increasing in the data volume also effect the method of inquiries from database. Traditional OLTP system greatly replaces the OLAP system in which data is stored even for the months and years. This huge amount of data also makes the user query processing more complex. OLAP query are more concerning the top level management for take important decision for the business. So, efficient execution of queries become an essential tool on the which the decision of the management are dependent i.e. aggregates and data cubes. OLAP systems have the responsibility to available data from centralized DWH. In this paper, we analyses the issues regarding the efficacy of queries in OLAP by using distributed DWH. For the efficient processing of user quires, optimization techniques are used for querying data from multiple sites in distributed database environment.
Proceedings 2004 VLDB Conference, 2004
Data cube has been playing an essential role in fast OLAP (online analytical processing) in many multi-dimensional data warehouses. However, there exist data sets in applications like bioinformatics, statistics, and text processing that are characterized by high dimensionality, e.g., over 100 dimensions, and moderate size, e.g., around 10 6 tuples. No feasible data cube can be constructed with such data sets. In this paper we will address the problem of developing an efficient algorithm to perform OLAP on such data sets. Experience tells us that although data analysis tasks may involve a high dimensional space, most OLAP operations are performed only on a small number of dimensions at a time. Based on this observation, we propose a novel method that computes a thin layer of the data cube together with associated value-list indices. This layer, while being manageable in size, will be capable of supporting flexible and fast OLAP operations in the original high dimensional space. Through experiments we will show that the method has I/O costs that scale nicely with dimensionality. Furthermore, the costs are comparable to that of accessing an existing data cube when full materialization is possible.
Selecting and Allocating Cubes in Multi-Node OLAP Systems
Concepts and Competitive Analytics
OLAP queries are characterized by short answering times. Materialized cube views, a pre-aggregation and storage of group-by values, are one of the possible answers to that condition. However, if all possible views were computed and stored, the amount of necessary materializing time and storage space would be huge. Selecting the most beneficial set, based on the profile of the queries and observing some constraints as materializing space and maintenance time, a problem denoted as cube views selection problem, is the condition for an effective OLAP system, with a variety of solutions for centralized approaches. When a distributed OLAP architecture is considered, the problem gets bigger, as we must deal with another dimension-space. Besides the problem of the selection of multidimensional structures, there's now a node allocation one; both are a condition for performance. This chapter focuses on distributed OLAP systems, recently introduced, proposing evolutionary algorithms for the selection and allocation of the distributed OLAP Cube, using a distributed linear cost model. This model uses an extended aggregation lattice as framework to capture the distributed semantics, and introduces processing nodes' power and real communication costs parameters, allowing the estimation of query and maintenance costs in time units. Moreover, as we have an OLAP environment, whit several nodes, we will have parallel processing and then, the evaluation of the fitness of evolutionary solutions is based on cost estimation algorithms that simulate the execution of parallel tasks, using time units as cost metric.