Why can't I just use my existing OLTP database for analytical reporting?

OLTP databases are optimized for high-volume, concurrent, short write transactions. Running complex analytical queries -- large table scans, aggregations, and multi-table joins -- directly on an OLTP system competes for resources with operational tasks, severely degrading performance and potentially causing outages. Data warehouses are specifically designed and optimized for these analytical workloads with denormalized schemas, columnar storage, and pre-aggregated views.

What is the primary difference between a Data Warehouse and a Data Lake?

A Data Warehouse stores structured, cleaned, and transformed data with a predefined schema-on-write, optimized for specific analytical queries and reporting. A Data Lake stores raw, multi-structured data (structured, semi-structured, unstructured) in its native format, deferring schema definition until read-time (schema-on-read). Data lakes offer flexibility for future analytical needs, while data warehouses provide immediate, refined insights. Data often flows from a data lake into a data warehouse after refinement.

When should I choose ETL versus ELT?

Choose ETL when your target has limited processing power, data quality issues require extensive cleansing before loading, or compliance mandates transformation before data lands in the final repository. Choose ELT when your target (e.g., Snowflake, BigQuery, or a data lake) has significant scalable processing power, you want schema-on-read flexibility, need to retain raw data for future unknown analytical needs, or you prioritize faster data ingestion.

How does a surrogate key in a dimension table help in a data warehouse?

A surrogate key is an artificial, system-generated primary key (typically an integer sequence) replacing the natural key from the source system. Benefits include: independence from source system key changes, enabling SCD Type 2 (a new surrogate key per version), faster joins (integer vs. string keys), and standardization across different source systems with inconsistent key formats.

What are the main advantages of Apache Spark over traditional MapReduce?

Spark achieves significantly faster performance (up to 100x for in-memory workloads) by caching intermediate data in RAM across iterations. It supports a wider range of workloads beyond batch processing -- interactive SQL (SparkSQL), real-time streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX). It also provides high-level APIs in Python, Scala, Java, and R, making development far easier than raw MapReduce.

What is the purpose of conformed dimensions in a bus architecture?

Conformed dimensions are shared across multiple fact tables or data marts with identical definitions and content. They ensure that when different business processes are analyzed together, shared attributes (e.g., DimDate, DimCustomer, DimProduct) have the same meaning. This enables "drill-across" analysis -- combining data from different fact tables consistently -- and is central to Kimball's bus architecture for building an integrated enterprise data warehouse from incremental data marts.

DatabaseCollege

Data Warehousing & Big Data Concepts

Data warehousing is the practice of consolidating large volumes of historical data from disparate sources into a central, analytically optimized repository. Big Data extends this to datasets whose volume, velocity, and variety exceed the capabilities of conventional database tools, demanding distributed processing frameworks like Hadoop and Spark.

This guide covers OLTP vs. OLAP systems, dimensional modeling (star and snowflake schemas), ETL and ELT pipelines, OLAP cube operations, the Hadoop/Spark ecosystem, and a 10-question practice quiz.

1Introduction

In the contemporary data-driven landscape, organizations leverage vast quantities of information to derive actionable insights. This necessitates specialized systems distinct from traditional operational databases. A data warehouse (DW) is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process [Inmon, 1992].

Big Data refers to datasets whose size, velocity, and variety exceed the capabilities of conventional database software tools to capture, store, manage, and analyze [Laney, 2001]. It encompasses not only sheer volume but also rapid generation rates and diverse formats, necessitating distributed processing frameworks such as Apache Hadoop and Apache Spark.

In Practice

A large retail chain uses an OLTP system to process millions of daily sales transactions with full ACID guarantees. Concurrently, a data warehouse aggregates this transactional data nightly to enable business analysts to identify top-selling products by region, analyze seasonal trends, and optimize inventory -- workloads that would cripple the operational database if run directly against it.

Why Separate Systems?

Performance

OLAP queries scan billions of rows. Mixing with OLTP transactions creates resource contention and degrades both workloads.

Schema Design

OLTP favors normalized schemas for write efficiency. OLAP favors denormalized star schemas for read performance.

Historical Data

OLTP systems often purge old data. Data warehouses are non-volatile -- data is kept for years of historical trend analysis.

2Key Definitions

Essential terminology for data warehousing and big data at the university level.

Data Warehouse (DW)

Subject-oriented, integrated, time-variant, and non-volatile repository for management decision-making [Inmon, 1992].

Data Mart (DM)

A subset of the DW focused on a specific department or business function (e.g., Sales Mart, Finance Mart).

Data Lake

Repository storing raw data in its native format (structured, semi-structured, unstructured) with schema-on-read.

ETL

Extract, Transform, Load -- the classic three-phase pipeline for moving data from sources to the warehouse.

ELT

Extract, Load, Transform -- loads raw data into a powerful target first, then transforms within that system.

OLAP

Online Analytical Processing -- tools enabling fast, interactive, multidimensional analysis of historical data.

OLTP

Online Transaction Processing -- operational systems optimized for high-volume short read/write transactions with ACID guarantees.

Star Schema

Central fact table surrounded by denormalized dimension tables, joined via foreign keys. The most common DW model.

Snowflake Schema

Extension of the star schema where dimension tables are normalized into multiple related tables, reducing redundancy.

Fact Table

Central table containing quantitative measures (SalesAmount, Quantity) and foreign keys to all dimension tables.

Dimension Table

Contains descriptive attributes about business entities (product names, customer demographics, calendar dates).

Data Cube

Multidimensional array of data where each axis is a business dimension and cells contain aggregated measures.

SCD (Slowly Changing Dimension)

A dimension whose attributes change over time. Types 1/2/3 handle history by overwriting, adding rows, or adding columns.

HDFS

Hadoop Distributed File System -- stores petabyte-scale files across commodity hardware clusters with replication.

MapReduce

Programming model for parallel batch processing: Map phase (filter/sort) + Reduce phase (aggregate/summarize).

Apache Spark

Fast distributed processing engine using in-memory RDDs, DataFrames, and SparkSQL -- up to 100x faster than MapReduce for iterative workloads.

3OLTP vs. OLAP

The fundamental architectural divide in database systems separates systems optimized for operational transactions from those optimized for analytical queries. Understanding these differences is essential for designing the right system for each workload.

Characteristic	OLTP	OLAP
Purpose	Day-to-day operations	Strategic analysis & reporting
Operations	INSERT, UPDATE, DELETE	SELECT with aggregations
Query Complexity	Simple, short transactions	Complex, long-running queries
Data Volume	Current data (small per query)	Historical data (billions of rows)
Schema	Normalized (3NF)	Denormalized (star/snowflake)
Concurrency	Thousands of users	Dozens of analysts
Consistency	Strict ACID	Eventual (often acceptable)
Response Time	Milliseconds	Seconds to minutes
Examples	E-commerce checkout, ATM	Sales dashboard, forecasting

OLTP Characteristics

High write throughput with full ACID
Normalized schema minimizes redundancy
Row-oriented storage optimized for updates
Small result sets per query
Examples: PostgreSQL, MySQL, Oracle

OLAP Characteristics

Read-heavy with large table scans
Denormalized schema reduces joins
Columnar storage optimized for aggregations
Large result sets aggregated into summaries
Examples: Snowflake, BigQuery, Redshift

4Data Warehouse Architecture

A canonical data warehouse follows a three-tier architecture: source systems at the bottom, the warehouse core in the middle, and BI tools at the top.

Data Warehouse Three-Tier Architecture

Data flows from operational source systems through ETL into the warehouse and out to analytical tools.

Three-tier DW architecture: source systems feed ETL, which populates the warehouse, which serves BI tools.

Three-Tier Breakdown

Tier 1 -- Source Layer

OLTP databases, flat files, ERP/CRM systems, external APIs, and web logs. The staging area temporarily holds extracted data before transformation, isolating source systems from the ETL process.

Tier 2 -- Storage & Integration Layer

The Enterprise Data Warehouse (EDW) storing integrated, transformed, historical data in star/snowflake schemas. Data Marts provide department-specific subsets. The metadata repository tracks lineage, quality rules, and business definitions.

Tier 3 -- Presentation Layer

OLAP tools, dashboards (Tableau, Power BI), ad-hoc query tools, data mining applications, and scheduled reports. Consumers interact only with this layer, never directly with source systems.

Kimball vs. Inmon Approach

Kimball (Bottom-Up)

Build data marts first per business process
Integrate via conformed dimensions
Delivers business value incrementally
Bus architecture with shared dimensions

Inmon (Top-Down)

Build centralized EDW first (3NF)
Derive data marts from the EDW
Single authoritative source of truth
Higher upfront investment, more consistent

Metadata Management

Metadata is critical for governance and trust. Technical metadata describes schema, ETL processes, and data lineage. Business metadata provides human-readable definitions and hierarchies. Operational metadata records load times, error logs, and refresh rates. Without robust metadata, analysts cannot trust or interpret the data they are analyzing.

5Star Schema

The star schema is the most prevalent data warehouse model due to its simplicity and query performance. A central fact table is surrounded by multiple dimension tables, each connected via a foreign key relationship -- resembling a star when drawn.

Star Schema

A central FactSales table connected to four dimension tables via foreign keys -- the most common data warehouse model.

Star schema: one central fact table (blue) with measures, four surrounding dimension tables with descriptive attributes.

Fact Table Measure Types

Additive

Can be summed across ALL dimensions. The most flexible.

e.g., SalesAmount, Quantity

Semi-Additive

Can be summed across some dimensions but NOT time (snapshot values).

e.g., AccountBalance, InventoryLevel

Non-Additive

Cannot be meaningfully summed across any dimension. Use AVG, COUNT, MIN, MAX instead.

e.g., UnitPrice, MarginPercentage

Advantages & Disadvantages

Advantages

Simple, intuitive structure easy to understand
Fewer joins -- faster query performance
Optimized for OLAP star queries
Easy to add new dimensions
Business users can navigate without DBA help

Disadvantages

Data redundancy in dimension tables
Higher storage consumption for dimensions
Updates to dimension attributes require care
Not optimal for write-heavy workloads

Slowly Changing Dimensions (SCD)

SCD Type 1 -- Overwrite

New data overwrites the original. No history preserved. Simple but lossy. Use when history does not matter (e.g., correcting a typo).

SCD Type 2 -- Add New Row

A new record is inserted for each change with StartDate/EndDate and IsCurrent flag. Preserves complete history for “as-of” reporting. Most common choice when historical accuracy matters.

SCD Type 3 -- Add New Column

A new column stores the previous value. Preserves only one prior state. Limited history but no row proliferation. Use when only “before and after” is needed.

6Snowflake Schema

The snowflake schema is an extension of the star schema where dimension tables are normalized into multiple related tables. Instead of one flat DimProduct table, you might have DimProduct → DimProductCategory → DimProductBrand, forming a hierarchical snowflake pattern.

Star Schema

DimProduct
├── ProductKey (PK)
├── ProductName
├── ProductCategory   ← denormalized
├── CategoryDesc      ← denormalized
├── ProductBrand      ← denormalized
└── BrandCountry      ← denormalized

Snowflake Schema

DimProduct
├── ProductKey (PK)
├── ProductName
├── CategoryKey (FK) → DimCategory
│     ├── CategoryKey (PK)
│     └── CategoryDesc
└── BrandKey (FK) → DimBrand
      ├── BrandKey (PK)
      ├── BrandName
      └── BrandCountry

Star vs. Snowflake Comparison

Aspect	Star Schema	Snowflake Schema
Query Complexity	Simple (fewer joins)	Complex (more joins)
Query Performance	Faster	Slower
Data Redundancy	Higher	Lower
Storage (dimensions)	More	Less
Maintenance	Simpler	More complex
Best For	Most DW scenarios	Very large, stable dimensions

Memory Aid

Star is simple and fast, but fat (redundant). Snowflake is complex and slow, but slim (normalized). In practice, the star schema's performance advantages almost always outweigh the storage savings of the snowflake schema for typical data warehouse query patterns.

7ETL & ELT Pipelines

The data pipeline is the backbone of any data warehouse, responsible for moving, cleaning, and reshaping data from operational sources into the analytical repository.

ETL Pipeline

Three-phase process for moving data from operational sources into the data warehouse.

Extract

Read data from disparate source systems.

Full Extraction

Pull all source data

Incremental (CDC)

Pull only changed rows

Sources

OLTP DBs, APIs, files

→

↓

Transform

Clean and reshape data for the warehouse.

Cleaning

Fix nulls, dedup, errors

Standardization

Formats, units, codes

Aggregation

Summarize & derive fields

Denormalization

Flatten for star schema

→

↓

Load

Write transformed data into the target system.

Batch Loading

Nightly / scheduled loads

Stream Loading

Near real-time ingestion

Targets

DW, Data Mart, Data Lake

Metadata Repository tracks lineage, data quality rules, load times, and business definitions across all three phases.

ETL pipeline: Extract from sources, Transform in staging, Load into warehouse. ELT reverses Transform and Load steps.

ETL vs. ELT

ETL (Traditional)

Transform data before loading into the warehouse. Transformation happens in a separate ETL engine.

Data leaves source systems cleaned
Good when target has limited compute
Better for strict data privacy requirements
Tools: Informatica, SSIS, Talend

ELT (Modern / Cloud)

Load raw data first, then transform within the powerful target system using SQL or Spark.

Raw data preserved for future use
Leverages cloud DW compute (Snowflake, BigQuery)
Faster ingestion, schema-on-read flexibility
Tools: dbt, Fivetran, Airbyte

Change Data Capture (CDC)

Full extraction re-reads all source data on every run -- simple but resource-intensive. CDC identifies only changed records, enabling efficient incremental loads. Common CDC methods:

Timestamp Columns

Filter by last_modified_date > last_run. Simple but requires source schema support.

Database Triggers

Triggers fire on INSERT/UPDATE/DELETE and write changes to a CDC log table. Real-time but adds write overhead.

Log-Based CDC

Reads the database transaction log (redo log). Zero impact on source performance. Most scalable approach.

Snapshot Comparison

Compares current data with a previous snapshot to identify differences. Accurate but slow for large tables.

8OLAP Operations & Cube

OLAP (Online Analytical Processing) enables fast, interactive, multidimensional analysis. A data cube conceptually represents data as a grid where each axis is a business dimension (Time, Product, Region) and each cell contains aggregated measures (Total Sales, Average Price).

OLAP Data Cube & Operations

A 3D cube with Time, Product, and Region axes. OLAP operations navigate and aggregate across dimensions.

The OLAP cube: yellow band = Slice (Q1), green cell = Dice (North, Laptop, Q1). Operations annotated on right.

OLAP Operations Explained

Slice -- Select one value on one dimension

Reduces a 3D cube to a 2D plane. Example: “Show all data for Q1 2023 only.” Time dimension is fixed; Product and Region remain free.

Dice -- Select ranges on multiple dimensions

Creates a sub-cube by constraining multiple dimensions simultaneously. Example: “Q1 AND Q2, for Electronics AND Phones, in North AND East.”

Roll-up (Drill-up) -- Aggregate to higher granularity

Climbs up a dimension hierarchy. Example: Daily Sales → Monthly Sales → Annual Sales. Or City → State → Region.

Drill-down -- Navigate to lower granularity

Descends a dimension hierarchy to see more detail. Example: Monthly Sales → Daily Sales. Or Region → State → City → Store.

Pivot (Rotate) -- Swap dimensions across axes

Reorients the report view. Example: From “Products as rows, Regions as columns” to “Regions as rows, Products as columns.”

OLAP Storage Architectures

MOLAP

Pre-aggregated multidimensional arrays. Fastest queries but limited scalability.

ROLAP

Data stored in RDBMS. SQL-based. Highly scalable but slower due to on-the-fly aggregation.

HOLAP

Hybrid: detailed data in RDBMS, aggregates in multidimensional store. Balances speed and scale.

OLAP Operations Mnemonic (S.D.R.D.P)

Slice (cut one layer) → Dice (cut a sub-cube) → Roll-up (go higher) → Drill-down (go deeper) → Pivot (rotate the view).

9Big Data & the Hadoop Ecosystem

Big Data is characterized by the 3 V's: Volume (terabytes to exabytes exceeding single-machine capacity), Velocity (real-time or near-real-time data streams), and Variety (structured, semi-structured, and unstructured formats). Additional V's include Veracity (data trustworthiness) and Value (business insight potential).

Hadoop / Spark Ecosystem Layers

Stacked architecture: storage at the base, resource management in the middle, processing engines above, and analytics tools on top.

Hadoop/Spark ecosystem: HDFS stores data, YARN manages resources, Spark/MapReduce process it, Hive/Pig query it.

Hadoop Core Components

HDFS -- Hadoop Distributed File System

Stores very large files by breaking them into blocks (default 128MB) distributed across DataNodes with 3x replication for fault tolerance. The NameNode manages the file system namespace and block locations. The DataNodes store the actual data blocks. Optimized for write-once, read-many batch access patterns.

MapReduce -- Batch Processing

Map phase: each worker independently processes an input split and emits intermediate (key, value) pairs. Shuffle & Sort: all values for the same key are grouped together. Reduce phase: aggregates the grouped values to produce the final output. Limitation: disk-based, poor for iterative algorithms.

YARN -- Resource Management

Separates resource management from data processing, enabling multiple engines (Spark, MapReduce, Flink) to share the same cluster. The ResourceManager allocates cluster resources globally. NodeManagers launch and monitor containers on each machine.

Apache Spark

Spark achieves dramatically faster performance than MapReduce by keeping intermediate data in memory using Resilient Distributed Datasets (RDDs) -- fault-tolerant, immutable, distributed collections that can be cached across iterations.

RDDs

Low-level: transformations (lazy) + actions (eager). Fault-tolerant via lineage graph. Best for unstructured data.

DataFrames

Structured, schema-aware tables. Catalyst Optimizer generates efficient query plans. Higher performance than raw RDDs for structured data.

SparkSQL

Run SQL or HiveQL directly against DataFrames and external sources (Hive, JDBC, Parquet). Bridges SQL skills with big data processing.

Lambda vs. Kappa Architecture

Lambda Architecture

Combines a Batch Layer (accurate, high-latency) with a Speed Layer (approximate, low-latency). A Serving Layer merges both views for queries.

Pro: high accuracy + low latency
Con: two codebases to maintain

Kappa Architecture

Eliminates the batch layer. All data -- real-time and historical -- is processed by a single stream processing engine. Historical reprocessing is done by replaying the immutable event log (e.g., Kafka).

Pro: simpler, less code duplication
Con: replaying large logs takes time

10Common Mistakes

Confusing OLTP and OLAP Design Goals

Running analytical queries directly on the OLTP database

OLTP systems are not optimized for large scans and aggregations. Route analytical queries to a dedicated data warehouse with denormalized schemas and columnar storage. Mixing workloads degrades both.

Over-Normalizing Dimension Tables

Applying 3NF normalization to dimension tables in a DW

Data warehouses prioritize query performance over storage efficiency for dimensions. Use a star schema with denormalized dimensions unless redundancy is a critical issue. Every extra join slows analytical queries.

Ignoring Slowly Changing Dimensions

Using SCD Type 1 (overwrite) when historical accuracy is required

If business questions like “What were sales when this customer lived in City A?” matter, you need SCD Type 2. Type 1 destroys history. Always analyze business requirements for historical reporting before choosing an SCD type.

Wrong Fact Table Granularity

Storing pre-aggregated data in fact tables instead of atomic-level records

Fact tables should capture data at the lowest atomic level (individual transaction line items, not daily totals). Higher-level aggregates can always be computed with GROUP BY or stored as materialized views. Pre-aggregating loses analytical flexibility forever.

Underestimating Data Quality

Loading data without profiling, cleansing, and validating it

Bad data in = bad analytics out. Invest heavily in data profiling (understand source data characteristics), cleansing (fix errors, nulls, duplicates), and validation rules in the ETL pipeline. Poor data quality destroys analyst trust in the warehouse.

Not Using Conformed Dimensions

Building data marts with inconsistent dimension definitions

Conformed dimensions (shared across multiple fact tables) are essential for consistent cross-functional analysis. Without them, “January 2023” in the Sales Mart may mean something different from “January 2023” in the Inventory Mart, making drill-across analysis unreliable.

Frequently Asked Questions

Why can't I just use my existing OLTP database for analytical reporting?: OLTP databases are optimized for high-volume, concurrent, short write transactions. Running complex analytical queries -- large table scans, aggregations, and multi-table joins -- directly on an OLTP system competes for resources with operational tasks, severely degrading performance and potentially causing outages. Data warehouses are specifically designed and optimized for these analytical workloads with denormalized schemas, columnar storage, and pre-aggregated views.
What is the primary difference between a Data Warehouse and a Data Lake?: A Data Warehouse stores structured, cleaned, and transformed data with a predefined schema-on-write, optimized for specific analytical queries and reporting. A Data Lake stores raw, multi-structured data (structured, semi-structured, unstructured) in its native format, deferring schema definition until read-time (schema-on-read). Data lakes offer flexibility for future analytical needs, while data warehouses provide immediate, refined insights. Data often flows from a data lake into a data warehouse after refinement.
When should I choose ETL versus ELT?: Choose ETL when your target has limited processing power, data quality issues require extensive cleansing before loading, or compliance mandates transformation before data lands in the final repository. Choose ELT when your target (e.g., Snowflake, BigQuery, or a data lake) has significant scalable processing power, you want schema-on-read flexibility, need to retain raw data for future unknown analytical needs, or you prioritize faster data ingestion.
How does a surrogate key in a dimension table help in a data warehouse?: A surrogate key is an artificial, system-generated primary key (typically an integer sequence) replacing the natural key from the source system. Benefits include: independence from source system key changes, enabling SCD Type 2 (a new surrogate key per version), faster joins (integer vs. string keys), and standardization across different source systems with inconsistent key formats.
What are the main advantages of Apache Spark over traditional MapReduce?: Spark achieves significantly faster performance (up to 100x for in-memory workloads) by caching intermediate data in RAM across iterations. It supports a wider range of workloads beyond batch processing -- interactive SQL (SparkSQL), real-time streaming (Spark Streaming), machine learning (MLlib), and graph processing (GraphX). It also provides high-level APIs in Python, Scala, Java, and R, making development far easier than raw MapReduce.
What is the purpose of conformed dimensions in a bus architecture?: Conformed dimensions are shared across multiple fact tables or data marts with identical definitions and content. They ensure that when different business processes are analyzed together, shared attributes (e.g., DimDate, DimCustomer, DimProduct) have the same meaning. This enables "drill-across" analysis -- combining data from different fact tables consistently -- and is central to Kimball's bus architecture for building an integrated enterprise data warehouse from incremental data marts.

Practice Quiz

Test your understanding of data warehousing and big data concepts -- select the correct answer for each question.

1.Which of the following is NOT a characteristic of a Data Warehouse according to Inmon?

2.Which schema is generally preferred for query performance in data warehouses due to fewer joins?

3.In an ETL process, which phase is responsible for data cleansing, standardization, and aggregation?

4.Which type of Slowly Changing Dimension (SCD) preserves full historical accuracy by adding a new row for each change?

5.Which OLAP operation allows you to change the perspective of the data by swapping dimensions on different axes?

6.Which component of Hadoop is responsible for distributed storage of large files across a cluster?

7.Which of the following is NOT typically a characteristic of NoSQL databases?

8.What is the main advantage of the Kappa Architecture over the Lambda Architecture?

9.A measure like AccountBalance that can be summed across different branches at a specific point in time, but not meaningfully across time periods, is classified as:

10.What is the primary role of a "staging area" in a data warehousing architecture?

Study Tips

Draw the star schema by hand: Sketch FactSales with four dimension tables and label the foreign keys and measures. Understanding the physical structure makes dimensional modeling intuitive.
Practice OLAP operations on a spreadsheet: Take a simple 3D dataset (product × region × month) and manually perform slice, dice, roll-up, and drill-down operations to solidify the concepts.
Trace an ETL pipeline end-to-end: Pick a real-world scenario (e.g., e-commerce orders → sales DW) and design each phase: what to extract, what transformations are needed, and how to load into star schema tables.
Know the SCD types cold: Exams frequently test when to use Type 1 vs. Type 2 vs. Type 3. Map each to a specific business scenario and memorize: Type 1 = Forget, Type 2 = Keep all, Type 3 = Keep some.

1Introduction

Why Separate Systems?

2Key Definitions

3OLTP vs. OLAP

4Data Warehouse Architecture

Data Warehouse Three-Tier Architecture

Three-Tier Breakdown

Kimball vs. Inmon Approach

Metadata Management

5Star Schema

Star Schema

Fact Table Measure Types

Advantages & Disadvantages

Slowly Changing Dimensions (SCD)

6Snowflake Schema

Star vs. Snowflake Comparison

7ETL & ELT Pipelines

ETL Pipeline

ETL vs. ELT

Change Data Capture (CDC)

8OLAP Operations & Cube

OLAP Data Cube & Operations

OLAP Operations Explained

OLAP Storage Architectures

9Big Data & the Hadoop Ecosystem

Hadoop / Spark Ecosystem Layers

Hadoop Core Components

Apache Spark

Lambda vs. Kappa Architecture

10Common Mistakes

Frequently Asked Questions

Practice Quiz

Study Tips

Related Topics