Data Warehousing & Big Data Concepts
Data warehousing is the practice of consolidating large volumes of historical data from disparate sources into a central, analytically optimized repository. Big Data extends this to datasets whose volume, velocity, and variety exceed the capabilities of conventional database tools, demanding distributed processing frameworks like Hadoop and Spark.
This guide covers OLTP vs. OLAP systems, dimensional modeling (star and snowflake schemas), ETL and ELT pipelines, OLAP cube operations, the Hadoop/Spark ecosystem, and a 10-question practice quiz.
1Introduction
In the contemporary data-driven landscape, organizations leverage vast quantities of information to derive actionable insights. This necessitates specialized systems distinct from traditional operational databases. A data warehouse (DW) is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process [Inmon, 1992].
Big Data refers to datasets whose size, velocity, and variety exceed the capabilities of conventional database software tools to capture, store, manage, and analyze [Laney, 2001]. It encompasses not only sheer volume but also rapid generation rates and diverse formats, necessitating distributed processing frameworks such as Apache Hadoop and Apache Spark.
A large retail chain uses an OLTP system to process millions of daily sales transactions with full ACID guarantees. Concurrently, a data warehouse aggregates this transactional data nightly to enable business analysts to identify top-selling products by region, analyze seasonal trends, and optimize inventory -- workloads that would cripple the operational database if run directly against it.
Why Separate Systems?
Performance
OLAP queries scan billions of rows. Mixing with OLTP transactions creates resource contention and degrades both workloads.
Schema Design
OLTP favors normalized schemas for write efficiency. OLAP favors denormalized star schemas for read performance.
Historical Data
OLTP systems often purge old data. Data warehouses are non-volatile -- data is kept for years of historical trend analysis.
2Key Definitions
Essential terminology for data warehousing and big data at the university level.
Data Warehouse (DW)
Subject-oriented, integrated, time-variant, and non-volatile repository for management decision-making [Inmon, 1992].
Data Mart (DM)
A subset of the DW focused on a specific department or business function (e.g., Sales Mart, Finance Mart).
Data Lake
Repository storing raw data in its native format (structured, semi-structured, unstructured) with schema-on-read.
ETL
Extract, Transform, Load -- the classic three-phase pipeline for moving data from sources to the warehouse.
ELT
Extract, Load, Transform -- loads raw data into a powerful target first, then transforms within that system.
OLAP
Online Analytical Processing -- tools enabling fast, interactive, multidimensional analysis of historical data.
OLTP
Online Transaction Processing -- operational systems optimized for high-volume short read/write transactions with ACID guarantees.
Star Schema
Central fact table surrounded by denormalized dimension tables, joined via foreign keys. The most common DW model.
Snowflake Schema
Extension of the star schema where dimension tables are normalized into multiple related tables, reducing redundancy.
Fact Table
Central table containing quantitative measures (SalesAmount, Quantity) and foreign keys to all dimension tables.
Dimension Table
Contains descriptive attributes about business entities (product names, customer demographics, calendar dates).
Data Cube
Multidimensional array of data where each axis is a business dimension and cells contain aggregated measures.
SCD (Slowly Changing Dimension)
A dimension whose attributes change over time. Types 1/2/3 handle history by overwriting, adding rows, or adding columns.
HDFS
Hadoop Distributed File System -- stores petabyte-scale files across commodity hardware clusters with replication.
MapReduce
Programming model for parallel batch processing: Map phase (filter/sort) + Reduce phase (aggregate/summarize).
Apache Spark
Fast distributed processing engine using in-memory RDDs, DataFrames, and SparkSQL -- up to 100x faster than MapReduce for iterative workloads.
3OLTP vs. OLAP
The fundamental architectural divide in database systems separates systems optimized for operational transactions from those optimized for analytical queries. Understanding these differences is essential for designing the right system for each workload.
| Characteristic | OLTP | OLAP |
|---|---|---|
| Purpose | Day-to-day operations | Strategic analysis & reporting |
| Operations | INSERT, UPDATE, DELETE | SELECT with aggregations |
| Query Complexity | Simple, short transactions | Complex, long-running queries |
| Data Volume | Current data (small per query) | Historical data (billions of rows) |
| Schema | Normalized (3NF) | Denormalized (star/snowflake) |
| Concurrency | Thousands of users | Dozens of analysts |
| Consistency | Strict ACID | Eventual (often acceptable) |
| Response Time | Milliseconds | Seconds to minutes |
| Examples | E-commerce checkout, ATM | Sales dashboard, forecasting |
OLTP Characteristics
- High write throughput with full ACID
- Normalized schema minimizes redundancy
- Row-oriented storage optimized for updates
- Small result sets per query
- Examples: PostgreSQL, MySQL, Oracle
OLAP Characteristics
- Read-heavy with large table scans
- Denormalized schema reduces joins
- Columnar storage optimized for aggregations
- Large result sets aggregated into summaries
- Examples: Snowflake, BigQuery, Redshift
4Data Warehouse Architecture
A canonical data warehouse follows a three-tier architecture: source systems at the bottom, the warehouse core in the middle, and BI tools at the top.
Data Warehouse Three-Tier Architecture
Data flows from operational source systems through ETL into the warehouse and out to analytical tools.
Three-Tier Breakdown
Tier 1 -- Source Layer
OLTP databases, flat files, ERP/CRM systems, external APIs, and web logs. The staging area temporarily holds extracted data before transformation, isolating source systems from the ETL process.
Tier 2 -- Storage & Integration Layer
The Enterprise Data Warehouse (EDW) storing integrated, transformed, historical data in star/snowflake schemas. Data Marts provide department-specific subsets. The metadata repository tracks lineage, quality rules, and business definitions.
Tier 3 -- Presentation Layer
OLAP tools, dashboards (Tableau, Power BI), ad-hoc query tools, data mining applications, and scheduled reports. Consumers interact only with this layer, never directly with source systems.
Kimball vs. Inmon Approach
Kimball (Bottom-Up)
- Build data marts first per business process
- Integrate via conformed dimensions
- Delivers business value incrementally
- Bus architecture with shared dimensions
Inmon (Top-Down)
- Build centralized EDW first (3NF)
- Derive data marts from the EDW
- Single authoritative source of truth
- Higher upfront investment, more consistent
Metadata Management
Metadata is critical for governance and trust. Technical metadata describes schema, ETL processes, and data lineage. Business metadata provides human-readable definitions and hierarchies. Operational metadata records load times, error logs, and refresh rates. Without robust metadata, analysts cannot trust or interpret the data they are analyzing.
5Star Schema
The star schema is the most prevalent data warehouse model due to its simplicity and query performance. A central fact table is surrounded by multiple dimension tables, each connected via a foreign key relationship -- resembling a star when drawn.
Star Schema
A central FactSales table connected to four dimension tables via foreign keys -- the most common data warehouse model.
Fact Table Measure Types
Additive
Can be summed across ALL dimensions. The most flexible.
e.g., SalesAmount, Quantity
Semi-Additive
Can be summed across some dimensions but NOT time (snapshot values).
e.g., AccountBalance, InventoryLevel
Non-Additive
Cannot be meaningfully summed across any dimension. Use AVG, COUNT, MIN, MAX instead.
e.g., UnitPrice, MarginPercentage
Advantages & Disadvantages
Advantages
- Simple, intuitive structure easy to understand
- Fewer joins -- faster query performance
- Optimized for OLAP star queries
- Easy to add new dimensions
- Business users can navigate without DBA help
Disadvantages
- Data redundancy in dimension tables
- Higher storage consumption for dimensions
- Updates to dimension attributes require care
- Not optimal for write-heavy workloads
Slowly Changing Dimensions (SCD)
SCD Type 1 -- Overwrite
New data overwrites the original. No history preserved. Simple but lossy. Use when history does not matter (e.g., correcting a typo).
SCD Type 2 -- Add New Row
A new record is inserted for each change with StartDate/EndDate and IsCurrent flag. Preserves complete history for “as-of” reporting. Most common choice when historical accuracy matters.
SCD Type 3 -- Add New Column
A new column stores the previous value. Preserves only one prior state. Limited history but no row proliferation. Use when only “before and after” is needed.
6Snowflake Schema
The snowflake schema is an extension of the star schema where dimension tables are normalized into multiple related tables. Instead of one flat DimProduct table, you might have DimProduct → DimProductCategory → DimProductBrand, forming a hierarchical snowflake pattern.
Star Schema
DimProduct ├── ProductKey (PK) ├── ProductName ├── ProductCategory ← denormalized ├── CategoryDesc ← denormalized ├── ProductBrand ← denormalized └── BrandCountry ← denormalized
Snowflake Schema
DimProduct
├── ProductKey (PK)
├── ProductName
├── CategoryKey (FK) → DimCategory
│ ├── CategoryKey (PK)
│ └── CategoryDesc
└── BrandKey (FK) → DimBrand
├── BrandKey (PK)
├── BrandName
└── BrandCountryStar vs. Snowflake Comparison
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Query Complexity | Simple (fewer joins) | Complex (more joins) |
| Query Performance | Faster | Slower |
| Data Redundancy | Higher | Lower |
| Storage (dimensions) | More | Less |
| Maintenance | Simpler | More complex |
| Best For | Most DW scenarios | Very large, stable dimensions |
Star is simple and fast, but fat (redundant). Snowflake is complex and slow, but slim (normalized). In practice, the star schema's performance advantages almost always outweigh the storage savings of the snowflake schema for typical data warehouse query patterns.
7ETL & ELT Pipelines
The data pipeline is the backbone of any data warehouse, responsible for moving, cleaning, and reshaping data from operational sources into the analytical repository.
ETL Pipeline
Three-phase process for moving data from operational sources into the data warehouse.
Read data from disparate source systems.
Full Extraction
Pull all source data
Incremental (CDC)
Pull only changed rows
Sources
OLTP DBs, APIs, files
Clean and reshape data for the warehouse.
Cleaning
Fix nulls, dedup, errors
Standardization
Formats, units, codes
Aggregation
Summarize & derive fields
Denormalization
Flatten for star schema
Write transformed data into the target system.
Batch Loading
Nightly / scheduled loads
Stream Loading
Near real-time ingestion
Targets
DW, Data Mart, Data Lake
Metadata Repository tracks lineage, data quality rules, load times, and business definitions across all three phases.
ETL vs. ELT
ETL (Traditional)
Transform data before loading into the warehouse. Transformation happens in a separate ETL engine.
- Data leaves source systems cleaned
- Good when target has limited compute
- Better for strict data privacy requirements
- Tools: Informatica, SSIS, Talend
ELT (Modern / Cloud)
Load raw data first, then transform within the powerful target system using SQL or Spark.
- Raw data preserved for future use
- Leverages cloud DW compute (Snowflake, BigQuery)
- Faster ingestion, schema-on-read flexibility
- Tools: dbt, Fivetran, Airbyte
Change Data Capture (CDC)
Full extraction re-reads all source data on every run -- simple but resource-intensive. CDC identifies only changed records, enabling efficient incremental loads. Common CDC methods:
Timestamp Columns
Filter by last_modified_date > last_run. Simple but requires source schema support.
Database Triggers
Triggers fire on INSERT/UPDATE/DELETE and write changes to a CDC log table. Real-time but adds write overhead.
Log-Based CDC
Reads the database transaction log (redo log). Zero impact on source performance. Most scalable approach.
Snapshot Comparison
Compares current data with a previous snapshot to identify differences. Accurate but slow for large tables.
8OLAP Operations & Cube
OLAP (Online Analytical Processing) enables fast, interactive, multidimensional analysis. A data cube conceptually represents data as a grid where each axis is a business dimension (Time, Product, Region) and each cell contains aggregated measures (Total Sales, Average Price).
OLAP Data Cube & Operations
A 3D cube with Time, Product, and Region axes. OLAP operations navigate and aggregate across dimensions.
OLAP Operations Explained
Slice -- Select one value on one dimension
Reduces a 3D cube to a 2D plane. Example: “Show all data for Q1 2023 only.” Time dimension is fixed; Product and Region remain free.
Dice -- Select ranges on multiple dimensions
Creates a sub-cube by constraining multiple dimensions simultaneously. Example: “Q1 AND Q2, for Electronics AND Phones, in North AND East.”
Roll-up (Drill-up) -- Aggregate to higher granularity
Climbs up a dimension hierarchy. Example: Daily Sales → Monthly Sales → Annual Sales. Or City → State → Region.
Drill-down -- Navigate to lower granularity
Descends a dimension hierarchy to see more detail. Example: Monthly Sales → Daily Sales. Or Region → State → City → Store.
Pivot (Rotate) -- Swap dimensions across axes
Reorients the report view. Example: From “Products as rows, Regions as columns” to “Regions as rows, Products as columns.”
OLAP Storage Architectures
MOLAP
Pre-aggregated multidimensional arrays. Fastest queries but limited scalability.
ROLAP
Data stored in RDBMS. SQL-based. Highly scalable but slower due to on-the-fly aggregation.
HOLAP
Hybrid: detailed data in RDBMS, aggregates in multidimensional store. Balances speed and scale.
Slice (cut one layer) → Dice (cut a sub-cube) → Roll-up (go higher) → Drill-down (go deeper) → Pivot (rotate the view).
9Big Data & the Hadoop Ecosystem
Big Data is characterized by the 3 V's: Volume (terabytes to exabytes exceeding single-machine capacity), Velocity (real-time or near-real-time data streams), and Variety (structured, semi-structured, and unstructured formats). Additional V's include Veracity (data trustworthiness) and Value (business insight potential).
Hadoop / Spark Ecosystem Layers
Stacked architecture: storage at the base, resource management in the middle, processing engines above, and analytics tools on top.
Hadoop Core Components
HDFS -- Hadoop Distributed File System
Stores very large files by breaking them into blocks (default 128MB) distributed across DataNodes with 3x replication for fault tolerance. The NameNode manages the file system namespace and block locations. The DataNodes store the actual data blocks. Optimized for write-once, read-many batch access patterns.
MapReduce -- Batch Processing
Map phase: each worker independently processes an input split and emits intermediate (key, value) pairs. Shuffle & Sort: all values for the same key are grouped together. Reduce phase: aggregates the grouped values to produce the final output. Limitation: disk-based, poor for iterative algorithms.
YARN -- Resource Management
Separates resource management from data processing, enabling multiple engines (Spark, MapReduce, Flink) to share the same cluster. The ResourceManager allocates cluster resources globally. NodeManagers launch and monitor containers on each machine.
Apache Spark
Spark achieves dramatically faster performance than MapReduce by keeping intermediate data in memory using Resilient Distributed Datasets (RDDs) -- fault-tolerant, immutable, distributed collections that can be cached across iterations.
RDDs
Low-level: transformations (lazy) + actions (eager). Fault-tolerant via lineage graph. Best for unstructured data.
DataFrames
Structured, schema-aware tables. Catalyst Optimizer generates efficient query plans. Higher performance than raw RDDs for structured data.
SparkSQL
Run SQL or HiveQL directly against DataFrames and external sources (Hive, JDBC, Parquet). Bridges SQL skills with big data processing.
Lambda vs. Kappa Architecture
Lambda Architecture
Combines a Batch Layer (accurate, high-latency) with a Speed Layer (approximate, low-latency). A Serving Layer merges both views for queries.
- Pro: high accuracy + low latency
- Con: two codebases to maintain
Kappa Architecture
Eliminates the batch layer. All data -- real-time and historical -- is processed by a single stream processing engine. Historical reprocessing is done by replaying the immutable event log (e.g., Kafka).
- Pro: simpler, less code duplication
- Con: replaying large logs takes time
10Common Mistakes
Running analytical queries directly on the OLTP database
OLTP systems are not optimized for large scans and aggregations. Route analytical queries to a dedicated data warehouse with denormalized schemas and columnar storage. Mixing workloads degrades both.
Applying 3NF normalization to dimension tables in a DW
Data warehouses prioritize query performance over storage efficiency for dimensions. Use a star schema with denormalized dimensions unless redundancy is a critical issue. Every extra join slows analytical queries.
Using SCD Type 1 (overwrite) when historical accuracy is required
If business questions like “What were sales when this customer lived in City A?” matter, you need SCD Type 2. Type 1 destroys history. Always analyze business requirements for historical reporting before choosing an SCD type.
Storing pre-aggregated data in fact tables instead of atomic-level records
Fact tables should capture data at the lowest atomic level (individual transaction line items, not daily totals). Higher-level aggregates can always be computed with GROUP BY or stored as materialized views. Pre-aggregating loses analytical flexibility forever.
Loading data without profiling, cleansing, and validating it
Bad data in = bad analytics out. Invest heavily in data profiling (understand source data characteristics), cleansing (fix errors, nulls, duplicates), and validation rules in the ETL pipeline. Poor data quality destroys analyst trust in the warehouse.
Building data marts with inconsistent dimension definitions
Conformed dimensions (shared across multiple fact tables) are essential for consistent cross-functional analysis. Without them, “January 2023” in the Sales Mart may mean something different from “January 2023” in the Inventory Mart, making drill-across analysis unreliable.
Frequently Asked Questions
- Why can't I just use my existing OLTP database for analytical reporting?
- OLTP databases are optimized for high-volume, concurrent, short write transactions. Running complex analytical queries -- large table scans, aggregations, and multi-table joins -- directly on an OLTP system competes for resources with operational tasks, severely degrading performance and potentially causing outages. Data warehouses are specifically designed and optimized for these analytical workloads with denormalized schemas, columnar storage, and pre-aggregated views.
- What is the primary difference between a Data Warehouse and a Data Lake?
- A Data Warehouse stores structured, cleaned, and transformed data with a predefined schema-on-write, optimized for specific analytical queries and reporting. A Data Lake stores raw, multi-structured data (structured, semi-structured, unstructured) in its native format, deferring schema definition until read-time (schema-on-read). Data lakes offer flexibility for future analytical needs, while data warehouses provide immediate, refined insights. Data often flows from a data lake into a data warehouse after refinement.
- When should I choose ETL versus ELT?
- Choose ETL when your target has limited processing power, data quality issues require extensive cleansing before loading, or compliance mandates transformation before data lands in the final repository. Choose ELT when your target (e.g., Snowflake, BigQuery, or a data lake) has significant scalable processing power, you want schema-on-read flexibility, need to retain raw data for future unknown analytical needs, or you prioritize faster data ingestion.
- How does a surrogate key in a dimension table help in a data warehouse?
- A surrogate key is an artificial, system-generated primary key (typically an integer sequence) replacing the natural key from the source system. Benefits include: independence from source system key changes, enabling SCD Type 2 (a new surrogate key per version), faster joins (integer vs. string keys), and standardization across different source systems with inconsistent key formats.
- What are the main advantages of Apache Spark over traditional MapReduce?
- Spark achieves significantly faster performance (up to 100x for in-memory workloads) by caching intermediate data in RAM across iterations. It supports a wider range of workloads beyond batch processing -- interactive SQL (SparkSQL), real-time streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX). It also provides high-level APIs in Python, Scala, Java, and R, making development far easier than raw MapReduce.
- What is the purpose of conformed dimensions in a bus architecture?
- Conformed dimensions are shared across multiple fact tables or data marts with identical definitions and content. They ensure that when different business processes are analyzed together, shared attributes (e.g., DimDate, DimCustomer, DimProduct) have the same meaning. This enables "drill-across" analysis -- combining data from different fact tables consistently -- and is central to Kimball's bus architecture for building an integrated enterprise data warehouse from incremental data marts.
Practice Quiz
Test your understanding of data warehousing and big data concepts -- select the correct answer for each question.
1.Which of the following is NOT a characteristic of a Data Warehouse according to Inmon?
2.Which schema is generally preferred for query performance in data warehouses due to fewer joins?
3.In an ETL process, which phase is responsible for data cleansing, standardization, and aggregation?
4.Which type of Slowly Changing Dimension (SCD) preserves full historical accuracy by adding a new row for each change?
5.Which OLAP operation allows you to change the perspective of the data by swapping dimensions on different axes?
6.Which component of Hadoop is responsible for distributed storage of large files across a cluster?
7.Which of the following is NOT typically a characteristic of NoSQL databases?
8.What is the main advantage of the Kappa Architecture over the Lambda Architecture?
9.A measure like AccountBalance that can be summed across different branches at a specific point in time, but not meaningfully across time periods, is classified as:
10.What is the primary role of a "staging area" in a data warehousing architecture?
Study Tips
- Draw the star schema by hand: Sketch FactSales with four dimension tables and label the foreign keys and measures. Understanding the physical structure makes dimensional modeling intuitive.
- Practice OLAP operations on a spreadsheet: Take a simple 3D dataset (product × region × month) and manually perform slice, dice, roll-up, and drill-down operations to solidify the concepts.
- Trace an ETL pipeline end-to-end: Pick a real-world scenario (e.g., e-commerce orders → sales DW) and design each phase: what to extract, what transformations are needed, and how to load into star schema tables.
- Know the SCD types cold: Exams frequently test when to use Type 1 vs. Type 2 vs. Type 3. Map each to a specific business scenario and memorize: Type 1 = Forget, Type 2 = Keep all, Type 3 = Keep some.