ResourcesDatabaseData Warehousing & Big Data
DatabaseCollege

Data Warehousing & Big Data Concepts

Data warehousing is the practice of consolidating large volumes of historical data from disparate sources into a central, analytically optimized repository. Big Data extends this to datasets whose volume, velocity, and variety exceed the capabilities of conventional database tools, demanding distributed processing frameworks like Hadoop and Spark.

This guide covers OLTP vs. OLAP systems, dimensional modeling (star and snowflake schemas), ETL and ELT pipelines, OLAP cube operations, the Hadoop/Spark ecosystem, and a 10-question practice quiz.

1Introduction

In the contemporary data-driven landscape, organizations leverage vast quantities of information to derive actionable insights. This necessitates specialized systems distinct from traditional operational databases. A data warehouse (DW) is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process [Inmon, 1992].

Big Data refers to datasets whose size, velocity, and variety exceed the capabilities of conventional database software tools to capture, store, manage, and analyze [Laney, 2001]. It encompasses not only sheer volume but also rapid generation rates and diverse formats, necessitating distributed processing frameworks such as Apache Hadoop and Apache Spark.

In Practice

A large retail chain uses an OLTP system to process millions of daily sales transactions with full ACID guarantees. Concurrently, a data warehouse aggregates this transactional data nightly to enable business analysts to identify top-selling products by region, analyze seasonal trends, and optimize inventory -- workloads that would cripple the operational database if run directly against it.

Why Separate Systems?

Performance

OLAP queries scan billions of rows. Mixing with OLTP transactions creates resource contention and degrades both workloads.

Schema Design

OLTP favors normalized schemas for write efficiency. OLAP favors denormalized star schemas for read performance.

Historical Data

OLTP systems often purge old data. Data warehouses are non-volatile -- data is kept for years of historical trend analysis.

2Key Definitions

Essential terminology for data warehousing and big data at the university level.

Data Warehouse (DW)

Subject-oriented, integrated, time-variant, and non-volatile repository for management decision-making [Inmon, 1992].

Data Mart (DM)

A subset of the DW focused on a specific department or business function (e.g., Sales Mart, Finance Mart).

Data Lake

Repository storing raw data in its native format (structured, semi-structured, unstructured) with schema-on-read.

ETL

Extract, Transform, Load -- the classic three-phase pipeline for moving data from sources to the warehouse.

ELT

Extract, Load, Transform -- loads raw data into a powerful target first, then transforms within that system.

OLAP

Online Analytical Processing -- tools enabling fast, interactive, multidimensional analysis of historical data.

OLTP

Online Transaction Processing -- operational systems optimized for high-volume short read/write transactions with ACID guarantees.

Star Schema

Central fact table surrounded by denormalized dimension tables, joined via foreign keys. The most common DW model.

Snowflake Schema

Extension of the star schema where dimension tables are normalized into multiple related tables, reducing redundancy.

Fact Table

Central table containing quantitative measures (SalesAmount, Quantity) and foreign keys to all dimension tables.

Dimension Table

Contains descriptive attributes about business entities (product names, customer demographics, calendar dates).

Data Cube

Multidimensional array of data where each axis is a business dimension and cells contain aggregated measures.

SCD (Slowly Changing Dimension)

A dimension whose attributes change over time. Types 1/2/3 handle history by overwriting, adding rows, or adding columns.

HDFS

Hadoop Distributed File System -- stores petabyte-scale files across commodity hardware clusters with replication.

MapReduce

Programming model for parallel batch processing: Map phase (filter/sort) + Reduce phase (aggregate/summarize).

Apache Spark

Fast distributed processing engine using in-memory RDDs, DataFrames, and SparkSQL -- up to 100x faster than MapReduce for iterative workloads.

3OLTP vs. OLAP

The fundamental architectural divide in database systems separates systems optimized for operational transactions from those optimized for analytical queries. Understanding these differences is essential for designing the right system for each workload.

CharacteristicOLTPOLAP
PurposeDay-to-day operationsStrategic analysis & reporting
OperationsINSERT, UPDATE, DELETESELECT with aggregations
Query ComplexitySimple, short transactionsComplex, long-running queries
Data VolumeCurrent data (small per query)Historical data (billions of rows)
SchemaNormalized (3NF)Denormalized (star/snowflake)
ConcurrencyThousands of usersDozens of analysts
ConsistencyStrict ACIDEventual (often acceptable)
Response TimeMillisecondsSeconds to minutes
ExamplesE-commerce checkout, ATMSales dashboard, forecasting

OLTP Characteristics

  • High write throughput with full ACID
  • Normalized schema minimizes redundancy
  • Row-oriented storage optimized for updates
  • Small result sets per query
  • Examples: PostgreSQL, MySQL, Oracle

OLAP Characteristics

  • Read-heavy with large table scans
  • Denormalized schema reduces joins
  • Columnar storage optimized for aggregations
  • Large result sets aggregated into summaries
  • Examples: Snowflake, BigQuery, Redshift

4Data Warehouse Architecture

A canonical data warehouse follows a three-tier architecture: source systems at the bottom, the warehouse core in the middle, and BI tools at the top.

Data Warehouse Three-Tier Architecture

Data flows from operational source systems through ETL into the warehouse and out to analytical tools.

SOURCE SYSTEMSOLTP DatabasesFlat Files / CSVERP / CRMExternal APIsWeb LogsETL LAYERExtract(pull from sources)Staging Area(temp storage)Transform(clean, standardize)Load(write to DW)DATA WAREHOUSEEnterprise DW (EDW)Star / Snowflake SchemaData MartsSalesMartFinanceMartMetadata RepositoryLineage & DefinitionsOLAP Cube LayerPre-aggregated viewsBI TOOLS & REPORTSOLAP ToolsDashboards(Tableau, Power BI)Ad-hoc QueriesData Mining& ML ModelsScheduled ReportsETLQueryTier 1: SourcesIntegrationTier 2: StorageTier 3: Presentation
Three-tier DW architecture: source systems feed ETL, which populates the warehouse, which serves BI tools.

Three-Tier Breakdown

Tier 1 -- Source Layer

OLTP databases, flat files, ERP/CRM systems, external APIs, and web logs. The staging area temporarily holds extracted data before transformation, isolating source systems from the ETL process.

Tier 2 -- Storage & Integration Layer

The Enterprise Data Warehouse (EDW) storing integrated, transformed, historical data in star/snowflake schemas. Data Marts provide department-specific subsets. The metadata repository tracks lineage, quality rules, and business definitions.

Tier 3 -- Presentation Layer

OLAP tools, dashboards (Tableau, Power BI), ad-hoc query tools, data mining applications, and scheduled reports. Consumers interact only with this layer, never directly with source systems.

Kimball vs. Inmon Approach

Kimball (Bottom-Up)

  • Build data marts first per business process
  • Integrate via conformed dimensions
  • Delivers business value incrementally
  • Bus architecture with shared dimensions

Inmon (Top-Down)

  • Build centralized EDW first (3NF)
  • Derive data marts from the EDW
  • Single authoritative source of truth
  • Higher upfront investment, more consistent

Metadata Management

Metadata is critical for governance and trust. Technical metadata describes schema, ETL processes, and data lineage. Business metadata provides human-readable definitions and hierarchies. Operational metadata records load times, error logs, and refresh rates. Without robust metadata, analysts cannot trust or interpret the data they are analyzing.

5Star Schema

The star schema is the most prevalent data warehouse model due to its simplicity and query performance. A central fact table is surrounded by multiple dimension tables, each connected via a foreign key relationship -- resembling a star when drawn.

Star Schema

A central FactSales table connected to four dimension tables via foreign keys -- the most common data warehouse model.

FactSalesPKSalesOrderIDFKProductKeyFKCustomerKeyFKDateKeyFKStoreKeyQuantitySalesAmountDimDatePKDateKeyFullDateDayOfWeekMonthCalendarYearDateKeyDimStorePKStoreKeyStoreNameStoreCityStoreRegionStoreManagerStoreKeyDimProductPKProductKeyProductNameProductCategoryProductBrandProductColorUnitPriceProductKeyDimCustomerPKCustomerKeyCustomerNameCityStateCountryCustomerSegmentCustomerKey
Star schema: one central fact table (blue) with measures, four surrounding dimension tables with descriptive attributes.

Fact Table Measure Types

Additive

Can be summed across ALL dimensions. The most flexible.

e.g., SalesAmount, Quantity

Semi-Additive

Can be summed across some dimensions but NOT time (snapshot values).

e.g., AccountBalance, InventoryLevel

Non-Additive

Cannot be meaningfully summed across any dimension. Use AVG, COUNT, MIN, MAX instead.

e.g., UnitPrice, MarginPercentage

Advantages & Disadvantages

Advantages

  • Simple, intuitive structure easy to understand
  • Fewer joins -- faster query performance
  • Optimized for OLAP star queries
  • Easy to add new dimensions
  • Business users can navigate without DBA help

Disadvantages

  • Data redundancy in dimension tables
  • Higher storage consumption for dimensions
  • Updates to dimension attributes require care
  • Not optimal for write-heavy workloads

Slowly Changing Dimensions (SCD)

SCD Type 1 -- Overwrite

New data overwrites the original. No history preserved. Simple but lossy. Use when history does not matter (e.g., correcting a typo).

SCD Type 2 -- Add New Row

A new record is inserted for each change with StartDate/EndDate and IsCurrent flag. Preserves complete history for “as-of” reporting. Most common choice when historical accuracy matters.

SCD Type 3 -- Add New Column

A new column stores the previous value. Preserves only one prior state. Limited history but no row proliferation. Use when only “before and after” is needed.

6Snowflake Schema

The snowflake schema is an extension of the star schema where dimension tables are normalized into multiple related tables. Instead of one flat DimProduct table, you might have DimProductDimProductCategoryDimProductBrand, forming a hierarchical snowflake pattern.

Star Schema

DimProduct
├── ProductKey (PK)
├── ProductName
├── ProductCategory   ← denormalized
├── CategoryDesc      ← denormalized
├── ProductBrand      ← denormalized
└── BrandCountry      ← denormalized

Snowflake Schema

DimProduct
├── ProductKey (PK)
├── ProductName
├── CategoryKey (FK) → DimCategory
│     ├── CategoryKey (PK)
│     └── CategoryDesc
└── BrandKey (FK) → DimBrand
      ├── BrandKey (PK)
      ├── BrandName
      └── BrandCountry

Star vs. Snowflake Comparison

AspectStar SchemaSnowflake Schema
Query ComplexitySimple (fewer joins)Complex (more joins)
Query PerformanceFasterSlower
Data RedundancyHigherLower
Storage (dimensions)MoreLess
MaintenanceSimplerMore complex
Best ForMost DW scenariosVery large, stable dimensions
Memory Aid

Star is simple and fast, but fat (redundant). Snowflake is complex and slow, but slim (normalized). In practice, the star schema's performance advantages almost always outweigh the storage savings of the snowflake schema for typical data warehouse query patterns.

7ETL & ELT Pipelines

The data pipeline is the backbone of any data warehouse, responsible for moving, cleaning, and reshaping data from operational sources into the analytical repository.

ETL Pipeline

Three-phase process for moving data from operational sources into the data warehouse.

E
Extract

Read data from disparate source systems.

Full Extraction

Pull all source data

Incremental (CDC)

Pull only changed rows

Sources

OLTP DBs, APIs, files

T
Transform

Clean and reshape data for the warehouse.

Cleaning

Fix nulls, dedup, errors

Standardization

Formats, units, codes

Aggregation

Summarize & derive fields

Denormalization

Flatten for star schema

L
Load

Write transformed data into the target system.

Batch Loading

Nightly / scheduled loads

Stream Loading

Near real-time ingestion

Targets

DW, Data Mart, Data Lake

Metadata Repository tracks lineage, data quality rules, load times, and business definitions across all three phases.

ETL pipeline: Extract from sources, Transform in staging, Load into warehouse. ELT reverses Transform and Load steps.

ETL vs. ELT

ETL (Traditional)

Transform data before loading into the warehouse. Transformation happens in a separate ETL engine.

  • Data leaves source systems cleaned
  • Good when target has limited compute
  • Better for strict data privacy requirements
  • Tools: Informatica, SSIS, Talend

ELT (Modern / Cloud)

Load raw data first, then transform within the powerful target system using SQL or Spark.

  • Raw data preserved for future use
  • Leverages cloud DW compute (Snowflake, BigQuery)
  • Faster ingestion, schema-on-read flexibility
  • Tools: dbt, Fivetran, Airbyte

Change Data Capture (CDC)

Full extraction re-reads all source data on every run -- simple but resource-intensive. CDC identifies only changed records, enabling efficient incremental loads. Common CDC methods:

Timestamp Columns

Filter by last_modified_date > last_run. Simple but requires source schema support.

Database Triggers

Triggers fire on INSERT/UPDATE/DELETE and write changes to a CDC log table. Real-time but adds write overhead.

Log-Based CDC

Reads the database transaction log (redo log). Zero impact on source performance. Most scalable approach.

Snapshot Comparison

Compares current data with a previous snapshot to identify differences. Accurate but slow for large tables.

8OLAP Operations & Cube

OLAP (Online Analytical Processing) enables fast, interactive, multidimensional analysis. A data cube conceptually represents data as a grid where each axis is a business dimension (Time, Product, Region) and each cell contains aggregated measures (Total Sales, Average Price).

OLAP Data Cube & Operations

A 3D cube with Time, Product, and Region axes. OLAP operations navigate and aggregate across dimensions.

Q1Q2Q3Q4LaptopPhoneTabletSouthEastWestNorthTIMEPRODUCTREGION$84KOLAP OperationsSliceSelect one dimensionvalue to get a 2D sub-cube. e.g., "Q1 only"DiceSelect ranges acrossmultiple dimensions.e.g., "Q1, Electronics, N"Roll-upAggregate to highergranularity. e.g.,Daily → MonthlyDrill-downNavigate to lowergranularity. e.g.,Monthly → Daily
The OLAP cube: yellow band = Slice (Q1), green cell = Dice (North, Laptop, Q1). Operations annotated on right.

OLAP Operations Explained

Slice -- Select one value on one dimension

Reduces a 3D cube to a 2D plane. Example: “Show all data for Q1 2023 only.” Time dimension is fixed; Product and Region remain free.

Dice -- Select ranges on multiple dimensions

Creates a sub-cube by constraining multiple dimensions simultaneously. Example: “Q1 AND Q2, for Electronics AND Phones, in North AND East.”

Roll-up (Drill-up) -- Aggregate to higher granularity

Climbs up a dimension hierarchy. Example: Daily Sales → Monthly Sales → Annual Sales. Or City → State → Region.

Drill-down -- Navigate to lower granularity

Descends a dimension hierarchy to see more detail. Example: Monthly Sales → Daily Sales. Or Region → State → City → Store.

Pivot (Rotate) -- Swap dimensions across axes

Reorients the report view. Example: From “Products as rows, Regions as columns” to “Regions as rows, Products as columns.”

OLAP Storage Architectures

MOLAP

Pre-aggregated multidimensional arrays. Fastest queries but limited scalability.

ROLAP

Data stored in RDBMS. SQL-based. Highly scalable but slower due to on-the-fly aggregation.

HOLAP

Hybrid: detailed data in RDBMS, aggregates in multidimensional store. Balances speed and scale.

OLAP Operations Mnemonic (S.D.R.D.P)

Slice (cut one layer) → Dice (cut a sub-cube) → Roll-up (go higher) → Drill-down (go deeper) → Pivot (rotate the view).

9Big Data & the Hadoop Ecosystem

Big Data is characterized by the 3 V's: Volume (terabytes to exabytes exceeding single-machine capacity), Velocity (real-time or near-real-time data streams), and Variety (structured, semi-structured, and unstructured formats). Additional V's include Veracity (data trustworthiness) and Value (business insight potential).

Hadoop / Spark Ecosystem Layers

Stacked architecture: storage at the base, resource management in the middle, processing engines above, and analytics tools on top.

LAYER 4 — Applications & Analytics ToolsHivePigHBaseKafkaBI ToolsML LibrariesLAYER 3 — Processing EnginesApache SparkRDDs | DataFrames | SparkSQL | Streaming | MLlibMapReduceMap → Shuffle → ReduceApache FlinkStream-first processingLAYER 2 — Resource ManagementYARNResourceManager + NodeManagersMesosCluster-wide resource sharingKubernetesContainer orchestration for big dataLAYER 1 — Distributed StorageHDFSNameNode + DataNodes (replicated blocks)Amazon S3Object store, cloud-nativeAzure Blob / GCSMulti-cloud object storage
Hadoop/Spark ecosystem: HDFS stores data, YARN manages resources, Spark/MapReduce process it, Hive/Pig query it.

Hadoop Core Components

HDFS -- Hadoop Distributed File System

Stores very large files by breaking them into blocks (default 128MB) distributed across DataNodes with 3x replication for fault tolerance. The NameNode manages the file system namespace and block locations. The DataNodes store the actual data blocks. Optimized for write-once, read-many batch access patterns.

MapReduce -- Batch Processing

Map phase: each worker independently processes an input split and emits intermediate (key, value) pairs. Shuffle & Sort: all values for the same key are grouped together. Reduce phase: aggregates the grouped values to produce the final output. Limitation: disk-based, poor for iterative algorithms.

YARN -- Resource Management

Separates resource management from data processing, enabling multiple engines (Spark, MapReduce, Flink) to share the same cluster. The ResourceManager allocates cluster resources globally. NodeManagers launch and monitor containers on each machine.

Apache Spark

Spark achieves dramatically faster performance than MapReduce by keeping intermediate data in memory using Resilient Distributed Datasets (RDDs) -- fault-tolerant, immutable, distributed collections that can be cached across iterations.

RDDs

Low-level: transformations (lazy) + actions (eager). Fault-tolerant via lineage graph. Best for unstructured data.

DataFrames

Structured, schema-aware tables. Catalyst Optimizer generates efficient query plans. Higher performance than raw RDDs for structured data.

SparkSQL

Run SQL or HiveQL directly against DataFrames and external sources (Hive, JDBC, Parquet). Bridges SQL skills with big data processing.

Lambda vs. Kappa Architecture

Lambda Architecture

Combines a Batch Layer (accurate, high-latency) with a Speed Layer (approximate, low-latency). A Serving Layer merges both views for queries.

  • Pro: high accuracy + low latency
  • Con: two codebases to maintain

Kappa Architecture

Eliminates the batch layer. All data -- real-time and historical -- is processed by a single stream processing engine. Historical reprocessing is done by replaying the immutable event log (e.g., Kafka).

  • Pro: simpler, less code duplication
  • Con: replaying large logs takes time

10Common Mistakes

Confusing OLTP and OLAP Design Goals

Running analytical queries directly on the OLTP database

OLTP systems are not optimized for large scans and aggregations. Route analytical queries to a dedicated data warehouse with denormalized schemas and columnar storage. Mixing workloads degrades both.

Over-Normalizing Dimension Tables

Applying 3NF normalization to dimension tables in a DW

Data warehouses prioritize query performance over storage efficiency for dimensions. Use a star schema with denormalized dimensions unless redundancy is a critical issue. Every extra join slows analytical queries.

Ignoring Slowly Changing Dimensions

Using SCD Type 1 (overwrite) when historical accuracy is required

If business questions like “What were sales when this customer lived in City A?” matter, you need SCD Type 2. Type 1 destroys history. Always analyze business requirements for historical reporting before choosing an SCD type.

Wrong Fact Table Granularity

Storing pre-aggregated data in fact tables instead of atomic-level records

Fact tables should capture data at the lowest atomic level (individual transaction line items, not daily totals). Higher-level aggregates can always be computed with GROUP BY or stored as materialized views. Pre-aggregating loses analytical flexibility forever.

Underestimating Data Quality

Loading data without profiling, cleansing, and validating it

Bad data in = bad analytics out. Invest heavily in data profiling (understand source data characteristics), cleansing (fix errors, nulls, duplicates), and validation rules in the ETL pipeline. Poor data quality destroys analyst trust in the warehouse.

Not Using Conformed Dimensions

Building data marts with inconsistent dimension definitions

Conformed dimensions (shared across multiple fact tables) are essential for consistent cross-functional analysis. Without them, “January 2023” in the Sales Mart may mean something different from “January 2023” in the Inventory Mart, making drill-across analysis unreliable.

Frequently Asked Questions

Why can't I just use my existing OLTP database for analytical reporting?
OLTP databases are optimized for high-volume, concurrent, short write transactions. Running complex analytical queries -- large table scans, aggregations, and multi-table joins -- directly on an OLTP system competes for resources with operational tasks, severely degrading performance and potentially causing outages. Data warehouses are specifically designed and optimized for these analytical workloads with denormalized schemas, columnar storage, and pre-aggregated views.
What is the primary difference between a Data Warehouse and a Data Lake?
A Data Warehouse stores structured, cleaned, and transformed data with a predefined schema-on-write, optimized for specific analytical queries and reporting. A Data Lake stores raw, multi-structured data (structured, semi-structured, unstructured) in its native format, deferring schema definition until read-time (schema-on-read). Data lakes offer flexibility for future analytical needs, while data warehouses provide immediate, refined insights. Data often flows from a data lake into a data warehouse after refinement.
When should I choose ETL versus ELT?
Choose ETL when your target has limited processing power, data quality issues require extensive cleansing before loading, or compliance mandates transformation before data lands in the final repository. Choose ELT when your target (e.g., Snowflake, BigQuery, or a data lake) has significant scalable processing power, you want schema-on-read flexibility, need to retain raw data for future unknown analytical needs, or you prioritize faster data ingestion.
How does a surrogate key in a dimension table help in a data warehouse?
A surrogate key is an artificial, system-generated primary key (typically an integer sequence) replacing the natural key from the source system. Benefits include: independence from source system key changes, enabling SCD Type 2 (a new surrogate key per version), faster joins (integer vs. string keys), and standardization across different source systems with inconsistent key formats.
What are the main advantages of Apache Spark over traditional MapReduce?
Spark achieves significantly faster performance (up to 100x for in-memory workloads) by caching intermediate data in RAM across iterations. It supports a wider range of workloads beyond batch processing -- interactive SQL (SparkSQL), real-time streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX). It also provides high-level APIs in Python, Scala, Java, and R, making development far easier than raw MapReduce.
What is the purpose of conformed dimensions in a bus architecture?
Conformed dimensions are shared across multiple fact tables or data marts with identical definitions and content. They ensure that when different business processes are analyzed together, shared attributes (e.g., DimDate, DimCustomer, DimProduct) have the same meaning. This enables "drill-across" analysis -- combining data from different fact tables consistently -- and is central to Kimball's bus architecture for building an integrated enterprise data warehouse from incremental data marts.

Practice Quiz

Test your understanding of data warehousing and big data concepts -- select the correct answer for each question.

1.Which of the following is NOT a characteristic of a Data Warehouse according to Inmon?

2.Which schema is generally preferred for query performance in data warehouses due to fewer joins?

3.In an ETL process, which phase is responsible for data cleansing, standardization, and aggregation?

4.Which type of Slowly Changing Dimension (SCD) preserves full historical accuracy by adding a new row for each change?

5.Which OLAP operation allows you to change the perspective of the data by swapping dimensions on different axes?

6.Which component of Hadoop is responsible for distributed storage of large files across a cluster?

7.Which of the following is NOT typically a characteristic of NoSQL databases?

8.What is the main advantage of the Kappa Architecture over the Lambda Architecture?

9.A measure like AccountBalance that can be summed across different branches at a specific point in time, but not meaningfully across time periods, is classified as:

10.What is the primary role of a "staging area" in a data warehousing architecture?

Study Tips

  • Draw the star schema by hand: Sketch FactSales with four dimension tables and label the foreign keys and measures. Understanding the physical structure makes dimensional modeling intuitive.
  • Practice OLAP operations on a spreadsheet: Take a simple 3D dataset (product × region × month) and manually perform slice, dice, roll-up, and drill-down operations to solidify the concepts.
  • Trace an ETL pipeline end-to-end: Pick a real-world scenario (e.g., e-commerce orders → sales DW) and design each phase: what to extract, what transformations are needed, and how to load into star schema tables.
  • Know the SCD types cold: Exams frequently test when to use Type 1 vs. Type 2 vs. Type 3. Map each to a specific business scenario and memorize: Type 1 = Forget, Type 2 = Keep all, Type 3 = Keep some.

Related Topics