Last Minute Notes (LMNs) - Data Warehousing

Last Updated : 24 Jan, 2025

A Data Warehouse (DW) is a centralized system that stores large amounts of structured data from various sources, optimized for analysis, reporting, and decision-making. Unlike transactional databases, which handle daily operations, a data warehouse focuses on analytical processing. This article covers last minute notes of Data Warehousing. These Last Minute Notes provide concise and clear summaries of key concepts, offering quick revision on topics like data warehousing, OLAP, and more, helping you grasp essential information effectively and efficiently for exams.

Table of Content

Data Warehousing Basics

Data warehousing involves collecting data from various operational sources, transforming it into a suitable format, and storing it in a central repository. It supports complex queries and analytics by providing historical, integrated, and summarized data for reporting.

Need for Data Warehousing

Consolidated Data: Collects data from multiple sources into one place, making it easier for analysis.
Faster Decision Making: Provides timely and accurate insights for business decisions.
Historical Analysis: Allows businesses to analyze trends over time.
Improved Data Quality: Cleans and structures data, ensuring consistency.

Characteristics of Data Warehousing

Subject-Oriented: Organized around key business subjects (sales, inventory, finance).
Integrated: Combines data from multiple sources in a consistent format.
Time-Variant: Stores historical data to analyze trends over different periods.
Non-Volatile: Once data is loaded, it’s read-only, ensuring data consistency over time.

Functions of Data Warehousing

Data Consolidation: Combines data from different systems into a single data store.
Data Analysis: Supports complex queries and reports for decision support.
Data Mining: Extracts patterns and trends from large datasets.
OLAP: Performs multi-dimensional analysis, allowing data to be viewed from various perspectives.

Types of Data Warehouses

Enterprise Data Warehouse (EDW): A centralized warehouse containing all the organizational data for reporting across departments.
Data Mart: A smaller, focused subset of the data warehouse, catering to specific departments or business units (e.g., marketing, finance).
Virtual Data Warehouse: A warehouse built using virtual views and real-time queries over the operational database rather than storing physical data.

Data Warehousing vs DBMS

Data Warehousing: Optimized for read-heavy operations, analytical queries, and reporting on historical data.
DBMS: Optimized for transactional systems, handling daily operations, and supporting write-heavy operations (CRUD operations).

Feature	Data Warehousing	DBMS (Database Management System)
Purpose	Analytical queries and reporting	Transactional processing (CRUD)
Data Type	Historical, integrated, and summarized data	Current, operational data
Query Type	Complex queries, ad-hoc reports	Simple queries for data modification
Time-Variance	Stores historical data	Stores current data

Data Marts

A Data Mart is a smaller, more focused subset of a data warehouse, catering to a specific department or business unit (e.g., marketing, finance). It provides faster access to relevant data for departmental analysis.

Data Warehouse vs Data Mart

Data Warehouse: Contains enterprise-wide data for organization-wide analysis.
Data Mart: Contains specialized data for specific departments or user groups.

Feature	Data Warehouse	Data Mart
Scope	Enterprise-wide	Department-specific
Data Volume	Larger, more comprehensive	Smaller, focused
Query Complexity	More complex, cross-departmental analysis	Simpler, departmental-focused queries

ETL Process (Extract, Transform, Load)

The ETL (Extract, Transform, Load) process is fundamental to data warehousing. It involves:

Extract: Gathering data from various source systems.
Transform: Cleaning, filtering, and transforming data into a consistent format.
Load: Storing the transformed data into the data warehouse for analysis.

Data Warehouse Architecture

Data warehouse architecture consists of three primary layers:

Bottom Tier: The data warehouse database where all data is stored.
- Often implemented using relational databases or specialized columnar databases.
Middle Tier: OLAP Servers (Online Analytical Processing), which facilitate complex querying and analysis.
- ROLAP (Relational OLAP): Works directly with relational databases.
- MOLAP (Multidimensional OLAP): Pre-aggregates data in cubes for fast access.
- HOLAP (Hybrid OLAP): Combines both ROLAP and MOLAP.
Top Tier: Front-End Tools for querying, reporting, and visualization, allowing users to interact with the data.

Operational and Informational Systems

Operational Systems: Handle daily transactions (e.g., sales, inventory) and are designed for real-time data processing.
Informational Systems (Data Warehouses): Used for querying historical data to make strategic decisions, supporting analytics and business intelligence.

Aspect	Operational Systems	Informational Systems (DW)
Purpose	Process daily transactions	Analyze historical and summarized data
Data Type	Real-time, transactional data	Historical, aggregated data
Processing Speed	High (real-time operations)	Optimized for complex queries and analysis

Derived Data

Derived Data is data that is derived from existing data through processes like aggregation, summarization, and transformation.
Examples include: monthly sales totals, average customer ratings, aggregate profit margins.

OLAP Technology

OLAP is a technology used for multidimensional data analysis, enabling users to interactively analyze large datasets from multiple perspectives. It allows fast querying and reporting by organizing data in a multidimensional cube, which makes it easier for businesses to derive insights from large data sets.

Features of OLAP

Multidimensional Analysis: OLAP provides an interface for analyzing data across multiple dimensions (e.g., time, geography, product).
Data Cube: A multi-dimensional array where each cell represents a data measure (e.g., sales) for specific dimensions (e.g., region, time).
Measures: Quantitative data (e.g., sales, profit) stored in the OLAP cube.
Dimensions: The perspectives (e.g., time, location) from which the data is analyzed.
Drill-Down and Roll-Up: Drill-down refers to navigating from summary data to more detailed data, while roll-up summarizes detailed data into higher-level aggregates.

Types of OLAP Operations:

Slice: Extracts a sub-cube by fixing one or more dimensions to a specific value (e.g., sales for the year 2022).
Dice: A more specific version of slice, selecting a sub-cube based on multiple conditions.
Drill-Down: Navigates from summary data to more detailed data.
Roll-Up: Summarizes data, typically by climbing up the hierarchy of dimensions (e.g., from daily sales to monthly sales).
Pivot (Rotate): Changes the orientation of the cube to view data from different perspectives.

Types of OLAP:

MOLAP (Multidimensional OLAP):
- MOLAP systems store data in a multidimensional cube format, which allows fast processing of queries by pre-aggregating data.
- Examples: Microsoft Analysis Services, IBM Cognos.
- Advantages: Faster query performance, better at handling complex calculations, uses pre-aggregation of data.
- Disadvantages: More storage space required to store pre-aggregated data.
ROLAP (Relational OLAP):
- ROLAP systems store data in relational databases and perform real-time calculations on the data as queries are processed.
- Examples: Oracle OLAP, SAP BusinessObjects.
- Advantages: Can handle large data sets, doesn't require as much storage as MOLAP.
- Disadvantages: Slower query performance compared to MOLAP, complex data transformations required at runtime.
HOLAP (Hybrid OLAP):
- HOLAP combines features of both MOLAP and ROLAP, where detailed data is stored in relational databases, and aggregated data is stored in multidimensional cubes.
- Examples: Microsoft SQL Server Analysis Services.
- Advantages: Faster query performance for aggregated data, efficient storage for detailed data.
- Disadvantages: Requires sophisticated data management and integration between the two data storage models.

Comparison of OLAP Types:

Feature	MOLAP	ROLAP	HOLAP
Data Storage	Multi-dimensional cubes	Relational databases	Combination of both
Query Performance	Fast due to pre-aggregated data	Slower as calculations are done on-demand	Balanced, fast for aggregates
Storage Efficiency	Requires more storage space	More efficient storage	Optimized, depending on data type
Complexity	Simpler to implement	More complex to manage	Requires integration of both

OLAP Implementation:

MOLAP Systems: Implemented using dedicated OLAP servers that store data in multidimensional cubes. Examples include Essbase and Microsoft SQL Server Analysis Services.
ROLAP Systems: Use relational databases like Oracle and SQL Server, which perform queries on demand, accessing data stored in tables.
HOLAP Systems: A combination of MOLAP and ROLAP, providing a hybrid approach by storing detailed data in relational format and aggregated data in a multidimensional cube. Examples: SQL Server HOLAP.

Challenges in Implementing OLAP:

Data Integration: Integrating data from multiple heterogeneous sources, especially when dealing with real-time data, can be challenging.
Data Quality: Ensuring the accuracy, consistency, and completeness of data used for OLAP is critical.
Storage Requirements: MOLAP systems, in particular, require large storage for pre-aggregated data, which can be costly.
Scalability: As data grows, OLAP systems need to efficiently handle increasing complexity and volume, especially in ROLAP systems.
Performance Issues: Complex queries in ROLAP and HOLAP systems can sometimes result in slower performance compared to MOLAP systems.
Cost: Implementing an OLAP solution can be expensive, particularly with the need for specialized hardware and software.

Data Transformation

Data transformation is a crucial process in data warehousing that involves converting data from its original format into a format suitable for analysis and querying. It ensures consistency, quality, and accessibility of data within a data warehouse. The transformed data can then be used for decision-making, reporting, and data analysis.

Types of Data Transformation Techniques

Data transformation techniques are used to clean, prepare, and refine raw data for better usability. Below are some of the most important data transformation methods used in data warehousing:

1. Normalization (Standardization)

Normalization is the process of adjusting the scale of data to fit a specific range, typically between 0 and 1.
It is especially useful for algorithms that rely on data scaling (e.g., machine learning models).
Standardization (a type of normalization) transforms data so that it has a mean of 0 and a standard deviation of 1.
Use Case: Standardizing sales data across different regions for fair comparison.

Formula for Normalization:

\text{Normalized Value} = \frac{X - \min(X)}{\max(X) - \min(X)}

Where:

X is the original value.
min(X) and max(X) are the minimum and maximum values of the data.

2. Aggregation:

Aggregation involves combining data from multiple records into a summary format, often using functions like sum, average, count, etc.
It helps to reduce data size while retaining important information, making it easier to analyze.
Example: Calculating monthly sales from daily transaction data.

Example of Aggregation:

If you have sales data for each day, you can aggregate it by month:

Sum of sales per day → Monthly sales
Average sales per region → Regional sales performance.

3. Discretization:

Discretization is the process of converting continuous data into discrete categories or ranges.
This technique is used when dealing with continuous numerical values that need to be grouped into predefined categories or intervals.
Use Case: Converting age values into ranges such as "Under 18", "18-30", "31-50", "Above 50".

Example:

Continuous Data: 25, 32, 50
Discretized Data: "18-30", "31-50", "Above 50"

4. Sampling:

Sampling involves selecting a subset of data for analysis instead of using the entire dataset.
This helps in reducing the computational complexity and time when dealing with very large datasets.
Common types of sampling include random sampling, systematic sampling, and stratified sampling.

Use Case:

Instead of analyzing all customer transactions, a sample of 10% of transactions may be used to identify purchasing patterns.

Data Warehouse Modelling

Data Warehouse Modeling refers to the process of designing the structure of a data warehouse that efficiently supports data storage and retrieval for analysis and decision-making. The goal of data warehouse modeling is to organize the data in a way that allows fast querying and effective reporting.

Objectives of Data Warehouse Modeling

Ease of Access: Design the schema to make it easy for users to retrieve data for analysis.
Data Quality: Ensure the data is accurate, consistent, and clean.
Optimized Performance: Create a structure that minimizes the time it takes to run queries and reports.
Scalability: Design the data warehouse to handle increasing amounts of data as the organization grows.
Business Understanding: Model the data according to business needs, ensuring the structure aligns with how users want to analyze data.

Advantages of Data Warehouse Modeling

Efficient Data Storage: Helps to organize data efficiently, minimizing redundancy.
Improved Query Performance: Optimized structures lead to faster query responses.
Simplified Reporting: Simplifies complex reports by providing a clear, organized view of the data.
Data Consistency: Ensures consistency across multiple business units and systems.
Better Business Insights: Models are designed to align with business strategies, enabling better decision-making.

Disadvantages of Data Warehouse Modeling

Complexity: Designing and maintaining an efficient data warehouse model can be complex.
Costly: The initial setup and ongoing maintenance of a data warehouse can be expensive.
Time Consuming: The process of creating and implementing the model can take a significant amount of time.
Data Redundancy: In some cases, excessive denormalization might cause data redundancy.

Elements of Dimensional Modeling

Dimensional modeling is the most popular method of data warehouse design, and it is based on creating fact and dimension tables.

Fact: The central data elements that are typically numeric and represent business events or transactions (e.g., sales revenue, quantity sold).
Dimensions: Descriptive attributes related to the facts that provide context for analysis (e.g., product, time, location).
Measures: The values that are summarized, aggregated, or analyzed within the fact table (e.g., total sales, average profit).

Fact Table

The Fact Table contains quantitative data for analysis and is typically the central table in a star or snowflake schema.

Characteristics of a Fact Table

Contains Measures: The fact table stores numeric data or aggregated measures such as revenue, units sold, etc.
Foreign Keys: The fact table contains foreign keys that reference the primary key in the dimension tables.
Granularity: Defines the level of detail stored in the fact table, such as daily, monthly, or yearly data.
High Volume: Fact tables typically have a large number of rows because they record events or transactions over time.
Non-volatile: Once data is inserted into the fact table, it is not updated or deleted.

Dimension Table

The Dimension Table provides descriptive context to the facts and helps break down the data for better analysis.

Characteristics of a Dimension Table:

Contains Attributes: Stores descriptive information such as product names, time, customer, or region.
Primary Key: Each dimension table has a primary key that uniquely identifies each record in the table.
Low Volume: Compared to the fact table, dimension tables tend to have fewer rows, but more detailed information.
Hierarchical Structure: Dimensions can be organized in hierarchies (e.g., Date → Month → Quarter → Year, or Product → Category → Subcategory).
Non-volatile: Similar to fact tables, the data in dimension tables is generally not updated or deleted once loaded.

Schema for Multidimensional Data Models

A multidimensional data model is a method of organizing data for analytical processing, enabling the efficient retrieval of data for complex queries. The primary goal is to organize data in such a way that users can easily slice, dice, roll-up, and drill-down into data. This is achieved through schemas like Star Schema and Snowflake Schema, both of which are designed to enhance query performance and simplify the structure of the data warehouse.

Star Schema

The Star Schema is one of the simplest and most popular types of multidimensional data models. It organizes data into a central fact table connected to several dimension tables, creating a star-like structure.

Fact and Dimension Table in Star Schema

Fact Table:
- The fact table is the central table in the schema and contains quantitative data (e.g., sales, revenue, quantities).
- It includes foreign keys that reference the dimension tables and stores measures or facts.
- Example: A fact table could contain data like total sales, units sold, etc., along with foreign keys to dimensions like time, product, and store.
Dimension Tables:
- Dimension tables contain descriptive information (e.g., time, location, product) related to the facts.
- Each dimension table has a primary key, and the attributes define the context for the data.
- Example: A product dimension might contain product name, product category, and product description.

Characteristics of Star Schema

Simplicity: The schema is easy to understand and query due to its straightforward design.
Performance: It offers faster query performance as it avoids the complexity of joins across multiple tables.
Redundancy: Data is denormalized, leading to redundant data in dimension tables.
Intuitive: Easy for users to understand and work with, especially for non-technical business users.

Advantages of Star Schema

Faster Queries: Since data is denormalized, queries run faster as they do not require complex joins.
Simple Design: Easy to design and understand, making it suitable for ad-hoc reporting and analysis.
Simpler to Use: Users can easily navigate and query the data, especially in business intelligence tools.

Disadvantages of Star Schema

Redundancy: Denormalization leads to redundancy in the dimension tables, which can increase storage requirements.
Maintenance: Updating and maintaining the data can become complex because of the redundancy.
Less Flexible: Less flexibility for storing hierarchical relationships in the dimension tables.

Snowflake Schema

The Snowflake Schema is a more complex form of the star schema where dimension tables are normalized, meaning they are broken down into additional tables to reduce redundancy.

Characteristics of Snowflake Schema

Normalization: Dimension tables are split into multiple related tables, reducing redundancy.
Complex Structure: More tables are required compared to the star schema, making the structure more complex.
Smaller Dimension Tables: Due to normalization, the size of dimension tables is smaller compared to the star schema.
Better Data Integrity: Reduced redundancy leads to better data integrity and consistency.

Advantages of Snowflake Schema

Reduced Redundancy: Normalizing dimension tables reduces duplication of data, saving storage space.
Better Data Integrity: The schema ensures data consistency across the dimension tables by eliminating redundancy.
Smaller Data Storage: Since data is split into smaller, more efficient tables, it can lead to better storage management.

Disadvantages of Snowflake Schema

Complexity: Snowflake schemas are more complex to design and maintain due to the additional normalization.
Slower Queries: More joins are required between tables, which can slow down query performance.
Difficult for Users: Users may find it harder to understand and query due to the more complex table relationships.

Storage Efficiency in Snowflake Schema

Space-Efficient: The normalization reduces the redundancy and overall size of dimension tables, which can save storage space.
Increased Management: However, due to its more complex structure, managing and updating the schema can be more time-consuming.

Concept Hierarchies in Data Warehousing

In data warehousing, concept hierarchies are used to represent different levels of abstraction within the data. These hierarchies provide a way to model relationships among data and allow for better data analysis by grouping information at various levels. Hierarchical models are key to organizing data efficiently in a way that makes it easier to retrieve and analyze data based on specific criteria.

Hierarchical Model:

A hierarchical model organizes data in a tree-like structure, where each record is a node connected to other nodes (parent-child relationship). This model is used to represent data in a way that is easy to navigate and understand.
Example: In a sales data hierarchy, the levels could include region > country > state > city > store.

Basic Concepts:

Levels of Abstraction: Concept hierarchies represent data at various levels of abstraction. For example, year > quarter > month > day for time-related data.
Parent-Child Relationship: In the hierarchy, each data entity (parent) can have one or more related entities (children), forming a tree structure.
Aggregation: Data at lower levels can be aggregated to form higher levels. For example, monthly sales can be aggregated to form quarterly or yearly sales.

Tree-Structure Diagrams:

A tree-structure diagram visually represents the hierarchy. The top node (root) represents the highest level of abstraction, and child nodes represent lower levels.
Example:
- Root: Organization
- Level 1: Departments (Sales, Marketing, HR)
- Level 2: Teams (Sales North, Sales South)

Data Retrieval Facility:

Data retrieval in concept hierarchies is simplified because data can be accessed at different levels of the hierarchy.
Example: To retrieve sales data for a specific region, you can simply query the region level and retrieve aggregated data, without needing to retrieve data for every individual store.

Update Facility:

The update facility ensures that changes to data at one level of the hierarchy are reflected in related levels.
For example, updating the sales amount at the store level should update the corresponding region and department levels to maintain consistency.
Cascading Updates: Changes made at one level automatically propagate to the levels above or below in the hierarchy.

Virtual Records:

Virtual records are used in concept hierarchies to represent abstracted or aggregated data without physically storing it. They can be computed dynamically during data retrieval.
Example: A virtual record for total sales in a region might not exist as a physical record but is generated by aggregating sales data from all stores within the region.

Mapping of Hierarchies to Files:

Mapping concept hierarchies to files is crucial for efficient data storage and retrieval. In a hierarchical model, data can be mapped to a series of files or tables corresponding to different levels of the hierarchy.
Example: A region-level file may contain records for regions, while a store-level file contains records for each store.

The IMS Database System:

The IMS (Information Management System) database system is a widely used system that supports hierarchical data models.
IMS enables organizations to store and manage hierarchical data and retrieve it efficiently.
It uses network data models to represent complex relationships between records in a more flexible way than traditional hierarchical systems.

Measures in Data Warehousing

In data warehousing, measures refer to the quantitative data that is used for analysis and reporting. These measures are the core of any data analysis process, as they provide the numerical information that is aggregated, computed, and analyzed. Measures are typically stored in fact tables and are essential for generating reports, business intelligence insights, and decision-making.

Measures in data warehousing can be classified into three primary categories based on how they are computed and aggregated. These categories help determine how measures are processed during data analysis.

1. Holistic Measures

Definition: Holistic measures cannot be computed by a simple function (like sum or average) applied to subsets of the data. They often require processing the entire dataset.
Example: A measure like "Median" or "Mode" is holistic because it requires all data points to be considered before computing.
Use Case: Complex metrics, such as calculating the overall growth rate of a company over time, often require holistic measures.

2. Distributive Measures:

Definition: Distributive measures can be computed by applying an aggregate function (e.g., sum, average) to a subset of data, and then combining the results across all subsets.
Example: Measures like "Sum", "Count", or "Average" are distributive because they can be computed for subgroups of data (e.g., sales for individual stores), and then summed or averaged to get the overall value.
Use Case: In sales analysis, you can calculate the total revenue for each store, and then add them up to get the overall sales revenue for all stores.

3. Algebraic Measures:

Definition: Algebraic measures are a bit more complex and can be computed through a combination of basic aggregate functions applied to subsets of data. These measures often require a more advanced mathematical or computational approach.
Example: "Variance" and "Standard Deviation" are algebraic measures because they involve applying mathematical functions (e.g., sum of squares) to subsets of data and combining them.
Use Case: In analyzing the consistency of product sales, variance or standard deviation can provide insights into how sales fluctuate over time or across regions.

Importance of Categorizing Measures

Performance Optimization: Understanding the category of a measure helps in optimizing query performance. Distributive measures are faster and easier to compute as they can be aggregated in parts, while holistic measures require more processing time.
Scalability: Categorizing measures allows for scaling data processing, especially when dealing with large datasets. For example, distributive measures can be computed incrementally and parallelly across different servers.
Appropriate Computations: By knowing the category of a measure, you can choose the right computation method. For example, algebraic measures may require advanced techniques like sorting or grouping, while distributive measures only need simple aggregation.

realravipal27

Improve

Article Tags :