Leveraging 24/7 Availability and Performance for Distributed Real-Time Data Warehouses (original) (raw)

24/7 Real-Time Data Warehousing: A Tool for Continuous Actionable Knowledge

Technological evolution has redefined many business models. Many decision makers are now required to act near real-time, instead of periodically, given the latest transactional information. Decision-making occurs much more frequently and considers the latest business data. Since data warehouses (DWs) are the core of business intelligence, decision support systems need to deal with 24/7 real-time requirements. Thus, the ability to deal with continuous data loading and decision support availability simultaneously is critical, for producing continuous actionable knowledge. The main challenge in this context is to efficiently manage the DW’s refreshment, when data sources change, to recapture consistency and accuracy with those sources, while maintaining OLAP availability and database performance. This paper proposes a simple, fast and efficient solution based on database replication and temporary tables to change a traditional enterprise DW into a real-time DW, enabling continuous data loading and OLAP availability on a 24/7 schedule. Experimental evaluations using a real-world DW and the TPC-H decision support benchmark show its advantages and analyze its impact in OLAP performance.

From Data Warehouses to Streaming Warehouses: A Survey on the Challenges for Real-Time Data Warehousing and Available Solutions

International Journal of Computer Applications, 2013

Data Warehouses usually work on history data. In most cases, the Data Warehouse is loaded with data from operational or transactional systems on a weekly or nightly basis. As today's decisions in the business world are becoming real-time, it is only natural that Data Warehouse, Business Intelligence, Decision Support and OLAP systems must quickly begin incorporating real-time data. When shifting from a traditional offline and time-consuming data warehousing system to a real-time system, two important considerations are speeding up the ETL and the OLAP process. This survey looks into the various challenges involved in building a real-time Data Warehouse and some of the solutions available to overcome them.

Epsilon Equitable Partition: On Scheduling Data Loading and View Maintenance in Soft Real-time Data Warehouses

2009

Data warehouses contain historic data providing information for analytical processing, decision making and data mining tools. However, several business intelligence applications nowadays require access to real-time data to make sound decisions. As a consequence, there is a great demand to incorporate new data from sources to the data warehouse as fast as possible. That motivates the construction of real-time data warehouse. Despite high demand, moving from a traditional data warehouse to a real-time one is not straightforward since we have to deal with the problem of efficiently scheduling various activities, viz., updates, view maintenances and OLAP transactions in a timely manner. In addition, OLAP transactions are now associated with deadlines (transaction timeliness) and data freshness requirement (data timeliness). Balancing between these two requirements poses another challenge in real-time data warehousing context. In this paper, we present an efficient technique aimed at updating data and performing view maintenance for real-time data warehouses while still enforcing these two timing requirements for the OLAP transactions. Our proposed approach aims at addressing the issues of applying updates and performing view maintenance while still effectively serving user OLAP queries in real-time data warehouse environment. Through extensive empirical studies, we demonstrate the efficiency of ORAD in achieving the goal of building a real-time data warehouse.

Query optimisation in real-time data warehouses

International Journal of Intelligent Information and Database Systems, 2019

A real-time data warehouse (RTDW) allows decision makers to analyse fresh data as fast as possible in order to support real-time decision processes. In this paper, we focus on optimisation techniques to speed up query processing; in particular, we propose a dynamic selection of materialised views algorithm (DynaSeV) which selects views from results of incoming queries. Secondly, we suggest a new update policy to dynamically maintain materialised views. In addition, we propose a novel data partitioning approach for RTDW, called 2LPA-RTDW (Two-Level data Partitioning Approach for RTDW) by allowing unbalance of data amount in each partition. Then, we present our architecture called DETL-(m, k)-firm-RTDW architecture (decentralised extract-transform-load approach based on (m, k)-Firm constraints for RTDW) which deals with diversity and disparities in data source systems to reduce the time for ETL. Finally, we evaluate our contributions using the TPC-DS (TPC, 2014) benchmark; the preliminary results are quite promising.

Efficient, chunk-replicated node partitioned data warehouses

Proceedings of the 2008 International Symposium on Parallel and Distributed Processing with Applications, ISPA 2008, 2008

Much has been said about processing efficiently data in parallel database servers, and some data warehouse applications must process in the order of tens to hundreds of Gigabytes efficiently. Yet, there is no effective approach targeted at using non-dedicated low-cost platforms efficiently in this context. Imagine taking together 10 or 1000 commodity PCs and setting-up a data crunching platform for large database-resident data with acceptable performance. There are significant inter-related data layout and processing challenges when the computational, storage and network hardware are heterogeneous and slow. We propose how to place, replicate and load-balance the data efficiently in this context. This work innovates in several respects: being practically as fast as fullmirroring without its overhead, exploring schema, chunk-wise placement, replication and load-balanced processing to be faster and more flexible than previous efforts. Our findings are complemented by an evaluation using TPC-H performance benchmark queries.

A survey of parallel and distributed data Warehouses

International Journal of Data Warehousing and Mining, 2009

Data Warehouses are a crucial technology for current competitive organizations in the globalized world. Size, speed and distributed operation are major challenges concerning those systems. Many data warehouses have huge sizes and the requirement that queries be processed quickly and efficiently, so parallel solutions are deployed to render the necessary efficiency. Distributed operation, on the other hand, concerns global commercial and scientific organizations that need to share their data in a coherent distributed data warehouse. In this paper we review the major concepts, systems and research results behind parallel and distributed data warehouses.

Model and procedure for performance and availability-wise parallel warehouses

Distributed and Parallel Databases, 2009

Consider data warehouses as large data repositories queried for analysis and data mining in a variety of application contexts. A query over such data may take a large amount of time to be processed in a regular PC. Consider partitioning the data into a set of PCs (nodes), with either a parallel database server or any database server at each node and an engine-independent middleware. Nodes and network may even not be fully dedicated to the data warehouse. In such a scenario, care must be taken for handling processing heterogeneity and availability, so we study and propose efficient solutions for this. We concentrate on three main contributions: a performance-wise index, measuring relative performance; a replication-degree; a flexible chunk-wise organization with on-demand processing. These contributions extend the previous work on de-clustering and replication and are generic in the sense that they can be applied in very different contexts and with different data partitioning approaches. We evaluate their merits with a prototype implementation of the system.

Distributed real time database systems: background and literature review

Distributed and Parallel Databases, 2008

Today’s real-time systems (RTS) are characterized by managing large volumes of dispersed data making real-time distributed data processing a reality. Large business houses need to do distributed processing for many reasons, and they often must do it in order to stay competitive. So, efficient database management algorithms and protocols for accessing and manipulating data are required to satisfy timing constraints of supported applications. Therefore, new research in distributed real-time database systems (DRTDBS) is needed to investigate possible ways of applying database systems technology to real-time systems. This paper first discusses the performance issues that are important to DRTDBS, and then surveys the research that has been done so far on the issues like priority assignment policy, commit protocols and optimizing the use of memory in non-replicated/replicated environment pertaining to distributed real time transaction processing. In fact, this study provides a foundation for addressing performance issues important for the management of very large real time data and pointer to other publications in journals and conference proceedings for further investigation of unanswered research questions.

A Survey of Real-Time Data Warehouse and ETL

2015

Data Warehouses (DW) are typically designed for efficient processing of read only analysis queries over large data, allowing only offline updates at night. The current trends of business globalization and online business activities available 24/7 means DW must support the increasing demands for the latest versions of the data. RealTime data warehousing aims to meet the increasing demands of Business Intelligence (BI) for the latest versions of the data. Informed decision-making is required for competitive success in the new global marketplace, which is fraught with uncertainty and rapid technology changes. Decision makers must adjust operational processes, corporate strategies, and business models at lightning speed and must be able to leverage business intelligence instantly and take immediate action. Sound decisions are based on data that is analyzed according to well-defined criteria. Such data typically resides in a DW for purposes of performing statistical and analytical proces...

IJERT-A Novel Approach For Updates In Streaming Data Warehouses By Scalable Scheduling

International Journal of Engineering Research and Technology (IJERT), 2013

https://www.ijert.org/a-novel-approach-for-updates-in-streaming-data-warehouses-by-scalable-scheduling https://www.ijert.org/research/a-novel-approach-for-updates-in-streaming-data-warehouses-by-scalable-scheduling-IJERTV2IS70095.pdf The project includes a streaming data warehouse update problem as a scheduling problem where jobs correspond to the process that load new data into tables and the objective is to minimize data staleness over time. The proposed scheduling framework that handles the complications encountered by a stream warehouse: view hierarchies and priorities, data consistency, inability to preempt updates, heterogeneity of update jobs caused by different inter arrival times and data volumes among different sources and transient overload. Update scheduling in streaming data warehouses which combine the features of traditional data warehouses and data stream systems. The need for on-line warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their Execution time and their overhead to the warehouse processes. The problem with this approach is that new data may arrive on multiple streams, but there is no mechanism for limiting the number of tables that can be updated simultaneously.