Last Minute Notes (LMNs) Data Warehousing (original) (raw)

Last Updated : 24 Jan, 2025

A **Data Warehouse (DW) is a centralized system that stores large amounts of structured data from various sources, optimized for analysis, reporting, and decision-making. Unlike transactional databases, which handle daily operations, a data warehouse focuses on analytical processing. This article covers last minute notes of **Data Warehousing. These Last Minute Notes provide concise and clear summaries of key concepts, offering quick revision on topics like data warehousing, OLAP, and more, helping you grasp essential information effectively and efficiently for exams.

Table of Content

Data Warehousing Basics

Data warehousing involves collecting data from various operational sources, transforming it into a suitable format, and storing it in a central repository. It supports complex queries and analytics by providing historical, integrated, and summarized data for reporting.

**Need for Data Warehousing

  1. **Consolidated Data: Collects data from multiple sources into one place, making it easier for analysis.
  2. **Faster Decision Making: Provides timely and accurate insights for business decisions.
  3. **Historical Analysis: Allows businesses to analyze trends over time.
  4. **Improved Data Quality: Cleans and structures data, ensuring consistency.

**Characteristics of Data Warehousing

  1. **Subject-Oriented: Organized around key business subjects (sales, inventory, finance).
  2. **Integrated: Combines data from multiple sources in a consistent format.
  3. **Time-Variant: Stores historical data to analyze trends over different periods.
  4. **Non-Volatile: Once data is loaded, it’s read-only, ensuring data consistency over time.

**Functions of Data Warehousing

  1. **Data Consolidation: Combines data from different systems into a single data store.
  2. **Data Analysis: Supports complex queries and reports for decision support.
  3. **Data Mining: Extracts patterns and trends from large datasets.
  4. **OLAP: Performs multi-dimensional analysis, allowing data to be viewed from various perspectives.

**Types of Data Warehouses

  1. **Enterprise Data Warehouse (EDW): A centralized warehouse containing all the organizational data for reporting across departments.
  2. **Data Mart: A smaller, focused subset of the data warehouse, catering to specific departments or business units (e.g., marketing, finance).
  3. **Virtual Data Warehouse: A warehouse built using virtual views and real-time queries over the operational database rather than storing physical data.

**Data Warehousing vs DBMS

Feature Data Warehousing DBMS (Database Management System)
Purpose Analytical queries and reporting Transactional processing (CRUD)
Data Type Historical, integrated, and summarized data Current, operational data
Query Type Complex queries, ad-hoc reports Simple queries for data modification
Time-Variance Stores historical data Stores current data

**Data Marts

A **Data Mart is a smaller, more focused subset of a data warehouse, catering to a specific department or business unit (e.g., marketing, finance). It provides faster access to relevant data for departmental analysis.

**Data Warehouse vs Data Mart

Feature Data Warehouse Data Mart
Scope Enterprise-wide Department-specific
Data Volume Larger, more comprehensive Smaller, focused
Query Complexity More complex, cross-departmental analysis Simpler, departmental-focused queries

The **ETL (Extract, Transform, Load) process is fundamental to data warehousing. It involves:

  1. **Extract: Gathering data from various source systems.
  2. **Transform: Cleaning, filtering, and transforming data into a consistent format.
  3. **Load: Storing the transformed data into the data warehouse for analysis.

**Data Warehouse Architecture

Data warehouse architecture consists of three primary layers:

  1. **Bottom Tier: The **data warehouse database where all data is stored.
    • Often implemented using relational databases or specialized columnar databases.
  2. **Middle Tier: **OLAP Servers (Online Analytical Processing), which facilitate complex querying and analysis.
    • **ROLAP (Relational OLAP): Works directly with relational databases.
    • **MOLAP (Multidimensional OLAP): Pre-aggregates data in cubes for fast access.
    • **HOLAP (Hybrid OLAP): Combines both ROLAP and MOLAP.
  3. **Top Tier: **Front-End Tools for querying, reporting, and visualization, allowing users to interact with the data.

**Operational and Informational Systems

Aspect Operational Systems Informational Systems (DW)
Purpose Process daily transactions Analyze historical and summarized data
Data Type Real-time, transactional data Historical, aggregated data
Processing Speed High (real-time operations) Optimized for complex queries and analysis

**Derived Data

OLAP Technology

**OLAP is a technology used for multidimensional data analysis, enabling users to interactively analyze large datasets from multiple perspectives. It allows fast querying and reporting by organizing data in a multidimensional cube, which makes it easier for businesses to derive insights from large data sets.

**Features of OLAP

  1. **Multidimensional Analysis: OLAP provides an interface for analyzing data across multiple dimensions (e.g., time, geography, product).
  2. **Data Cube: A multi-dimensional array where each cell represents a data measure (e.g., sales) for specific dimensions (e.g., region, time).
  3. **Measures: Quantitative data (e.g., sales, profit) stored in the OLAP cube.
  4. **Dimensions: The perspectives (e.g., time, location) from which the data is analyzed.
  5. **Drill-Down and Roll-Up: Drill-down refers to navigating from summary data to more detailed data, while roll-up summarizes detailed data into higher-level aggregates.

**Types of OLAP Operations:

  1. **Slice: Extracts a sub-cube by fixing one or more dimensions to a specific value (e.g., sales for the year 2022).
  2. **Dice: A more specific version of slice, selecting a sub-cube based on multiple conditions.
  3. **Drill-Down: Navigates from summary data to more detailed data.
  4. **Roll-Up: Summarizes data, typically by climbing up the hierarchy of dimensions (e.g., from daily sales to monthly sales).
  5. **Pivot (Rotate): Changes the orientation of the cube to view data from different perspectives.

**Types of OLAP:

  1. **MOLAP (Multidimensional OLAP):
    • MOLAP systems store data in a multidimensional cube format, which allows fast processing of queries by pre-aggregating data.
    • Examples: **Microsoft Analysis Services, IBM Cognos.
    • **Advantages: Faster query performance, better at handling complex calculations, uses pre-aggregation of data.
    • **Disadvantages: More storage space required to store pre-aggregated data.
  2. **ROLAP (Relational OLAP):
    • ROLAP systems store data in relational databases and perform real-time calculations on the data as queries are processed.
    • Examples: **Oracle OLAP, SAP BusinessObjects.
    • **Advantages: Can handle large data sets, doesn't require as much storage as MOLAP.
    • **Disadvantages: Slower query performance compared to MOLAP, complex data transformations required at runtime.
  3. **HOLAP (Hybrid OLAP):
    • HOLAP combines features of both MOLAP and ROLAP, where detailed data is stored in relational databases, and aggregated data is stored in multidimensional cubes.
    • Examples: **Microsoft SQL Server Analysis Services.
    • **Advantages: Faster query performance for aggregated data, efficient storage for detailed data.
    • **Disadvantages: Requires sophisticated data management and integration between the two data storage models.

**Comparison of OLAP Types:

Feature MOLAP ROLAP HOLAP
**Data Storage Multi-dimensional cubes Relational databases Combination of both
**Query Performance Fast due to pre-aggregated data Slower as calculations are done on-demand Balanced, fast for aggregates
**Storage Efficiency Requires more storage space More efficient storage Optimized, depending on data type
**Complexity Simpler to implement More complex to manage Requires integration of both

**OLAP Implementation:

**Challenges in Implementing OLAP:

  1. **Data Integration: Integrating data from multiple heterogeneous sources, especially when dealing with real-time data, can be challenging.
  2. **Data Quality: Ensuring the accuracy, consistency, and completeness of data used for OLAP is critical.
  3. **Storage Requirements: MOLAP systems, in particular, require large storage for pre-aggregated data, which can be costly.
  4. **Scalability: As data grows, OLAP systems need to efficiently handle increasing complexity and volume, especially in ROLAP systems.
  5. **Performance Issues: Complex queries in ROLAP and HOLAP systems can sometimes result in slower performance compared to MOLAP systems.
  6. **Cost: Implementing an OLAP solution can be expensive, particularly with the need for specialized hardware and software.

Data Transformation

Data transformation is a crucial process in data warehousing that involves converting data from its original format into a format suitable for analysis and querying. It ensures consistency, quality, and accessibility of data within a data warehouse. The transformed data can then be used for decision-making, reporting, and data analysis.

**Types of Data Transformation Techniques

Data transformation techniques are used to clean, prepare, and refine raw data for better usability. Below are some of the most important data transformation methods used in data warehousing:

1. Normalization (Standardization)

**Formula for Normalization:

\text{Normalized Value} = \frac{X - \min(X)}{\max(X) - \min(X)}

Where:

**2. Aggregation:

**Example of Aggregation:

If you have sales data for each day, you can aggregate it by month:

**3. Discretization:

**Example:

**4. Sampling:

**Use Case:

Data Warehouse Modelling

Data Warehouse Modeling refers to the process of designing the structure of a data warehouse that efficiently supports data storage and retrieval for analysis and decision-making. The goal of data warehouse modeling is to organize the data in a way that allows fast querying and effective reporting.

Objectives of Data Warehouse Modeling

Advantages of Data Warehouse Modeling

Disadvantages of Data Warehouse Modeling

Elements of Dimensional Modeling

Dimensional modeling is the most popular method of data warehouse design, and it is based on creating fact and dimension tables.

  1. **Fact: The central data elements that are typically numeric and represent business events or transactions (e.g., sales revenue, quantity sold).
  2. **Dimensions: Descriptive attributes related to the facts that provide context for analysis (e.g., product, time, location).
  3. **Measures: The values that are summarized, aggregated, or analyzed within the fact table (e.g., total sales, average profit).

Fact Table

The **Fact Table contains quantitative data for analysis and is typically the central table in a star or snowflake schema.

**Characteristics of a Fact Table

  1. **Contains Measures: The fact table stores numeric data or aggregated measures such as revenue, units sold, etc.
  2. **Foreign Keys: The fact table contains foreign keys that reference the primary key in the dimension tables.
  3. **Granularity: Defines the level of detail stored in the fact table, such as daily, monthly, or yearly data.
  4. **High Volume: Fact tables typically have a large number of rows because they record events or transactions over time.
  5. **Non-volatile: Once data is inserted into the fact table, it is not updated or deleted.

Dimension Table

The **Dimension Table provides descriptive context to the facts and helps break down the data for better analysis.

**Characteristics of a Dimension Table:

  1. **Contains Attributes: Stores descriptive information such as product names, time, customer, or region.
  2. **Primary Key: Each dimension table has a primary key that uniquely identifies each record in the table.
  3. **Low Volume: Compared to the fact table, dimension tables tend to have fewer rows, but more detailed information.
  4. **Hierarchical Structure: Dimensions can be organized in hierarchies (e.g., Date → Month → Quarter → Year, or Product → Category → Subcategory).
  5. **Non-volatile: Similar to fact tables, the data in dimension tables is generally not updated or deleted once loaded.

Schema for Multidimensional Data Models

A **multidimensional data model is a method of organizing data for analytical processing, enabling the efficient retrieval of data for complex queries. The primary goal is to organize data in such a way that users can easily slice, dice, roll-up, and drill-down into data. This is achieved through **schemas like **Star Schema and **Snowflake Schema, both of which are designed to enhance query performance and simplify the structure of the data warehouse.

Star Schema

The **Star Schema is one of the simplest and most popular types of multidimensional data models. It organizes data into a central **fact table connected to several **dimension tables, creating a star-like structure.

**Fact and Dimension Table in Star Schema

  1. **Fact Table:
    • The fact table is the central table in the schema and contains **quantitative data (e.g., sales, revenue, quantities).
    • It includes **foreign keys that reference the **dimension tables and stores **measures or **facts.
    • Example: A fact table could contain data like total sales, units sold, etc., along with foreign keys to dimensions like **time, **product, and **store.
  2. **Dimension Tables:
    • Dimension tables contain **descriptive information (e.g., time, location, product) related to the facts.
    • Each dimension table has a **primary key, and the attributes define the context for the data.
    • Example: A product dimension might contain product name, product category, and product description.

**Characteristics of Star Schema

  1. **Simplicity: The schema is easy to understand and query due to its straightforward design.
  2. **Performance: It offers faster query performance as it avoids the complexity of joins across multiple tables.
  3. **Redundancy: Data is denormalized, leading to redundant data in dimension tables.
  4. **Intuitive: Easy for users to understand and work with, especially for non-technical business users.

**Advantages of Star Schema

  1. **Faster Queries: Since data is denormalized, queries run faster as they do not require complex joins.
  2. **Simple Design: Easy to design and understand, making it suitable for ad-hoc reporting and analysis.
  3. **Simpler to Use: Users can easily navigate and query the data, especially in business intelligence tools.

**Disadvantages of Star Schema

  1. **Redundancy: Denormalization leads to redundancy in the dimension tables, which can increase storage requirements.
  2. **Maintenance: Updating and maintaining the data can become complex because of the redundancy.
  3. **Less Flexible: Less flexibility for storing hierarchical relationships in the dimension tables.

Snowflake Schema

The **Snowflake Schema is a more complex form of the star schema where dimension tables are **normalized, meaning they are broken down into additional tables to reduce redundancy.

**Characteristics of Snowflake Schema

  1. **Normalization: Dimension tables are split into multiple related tables, reducing redundancy.
  2. **Complex Structure: More tables are required compared to the star schema, making the structure more complex.
  3. **Smaller Dimension Tables: Due to normalization, the size of dimension tables is smaller compared to the star schema.
  4. **Better Data Integrity: Reduced redundancy leads to better data integrity and consistency.

**Advantages of Snowflake Schema

  1. **Reduced Redundancy: Normalizing dimension tables reduces duplication of data, saving storage space.
  2. **Better Data Integrity: The schema ensures data consistency across the dimension tables by eliminating redundancy.
  3. **Smaller Data Storage: Since data is split into smaller, more efficient tables, it can lead to better storage management.

**Disadvantages of Snowflake Schema

  1. **Complexity: Snowflake schemas are more complex to design and maintain due to the additional normalization.
  2. **Slower Queries: More joins are required between tables, which can slow down query performance.
  3. **Difficult for Users: Users may find it harder to understand and query due to the more complex table relationships.

**Storage Efficiency in Snowflake Schema

**Concept Hierarchies in Data Warehousing

In **data warehousing, **concept hierarchies are used to represent different levels of abstraction within the data. These hierarchies provide a way to model relationships among data and allow for better data analysis by grouping information at various levels. Hierarchical models are key to organizing data efficiently in a way that makes it easier to retrieve and analyze data based on specific criteria.

**Hierarchical Model:

**Basic Concepts:

  1. **Levels of Abstraction: Concept hierarchies represent data at various levels of abstraction. For example, **year > quarter > month > day for time-related data.
  2. **Parent-Child Relationship: In the hierarchy, each data entity (parent) can have one or more related entities (children), forming a tree structure.
  3. **Aggregation: Data at lower levels can be aggregated to form higher levels. For example, monthly sales can be aggregated to form quarterly or yearly sales.

**Tree-Structure Diagrams:

**Data Retrieval Facility:

**Update Facility:

**Virtual Records:

**Mapping of Hierarchies to Files:

**The IMS Database System:

Measures in Data Warehousing

In data warehousing, measures refer to the quantitative data that is used for analysis and reporting. These measures are the core of any data analysis process, as they provide the numerical information that is aggregated, computed, and analyzed. Measures are typically stored in fact tables and are essential for generating reports, business intelligence insights, and decision-making.

Measures in data warehousing can be classified into three primary categories based on how they are computed and aggregated. These categories help determine how measures are processed during data analysis.

1. Holistic Measures

**2. Distributive Measures:

**3. Algebraic Measures:

Importance of Categorizing Measures