When should I use a Hash Index vs. a B+Tree Index?

Use a Hash Index when your queries consist exclusively of equality lookups (WHERE key = value) and you do not need range scans, ORDER BY, or partial key matching. Hash indexes provide O(1) average lookup time for exact matches. Use a B+Tree Index (the default in most DBMS) for virtually all other cases, including range queries, ORDER BY, LIKE prefix matching, and composite key lookups. B+Tree indexes offer O(log N) lookups and efficient range scans via linked leaf nodes.

What is the difference between RANK() and DENSE_RANK()?

Both assign ranks to rows within a partition based on the ORDER BY clause. RANK() leaves gaps after ties: if two rows tie for rank 1, the next row receives rank 3. DENSE_RANK() does not leave gaps: the next row after two rank-1 ties receives rank 2. Use DENSE_RANK() when you need consecutive ranking (e.g., "top 3 products") and RANK() when the gap conveys how many items share a higher rank.

How do I read an EXPLAIN/ANALYZE output?

EXPLAIN shows the planned execution strategy without running the query. EXPLAIN ANALYZE actually executes it and shows real timings. Key things to look for: (1) Scan type -- index scan vs. sequential/full table scan; (2) Join algorithm -- nested loop, hash join, or merge join; (3) Estimated vs. actual row counts (large discrepancies indicate stale statistics); (4) Cost estimates for each node; (5) Sort or hash operations that may spill to disk. Start reading from the innermost (most indented) operation outward.

When should I use a CTE vs. a subquery?

Use a CTE when you need to reference the same derived result set multiple times in one query, when the query logic benefits from being broken into named steps for readability, or when you need recursion (hierarchical traversal). Use a subquery for simple, one-off derived tables. Performance-wise, some DBMS inline CTEs like subqueries; others materialize them. Check your DBMS documentation and test with EXPLAIN.

What is a covering index and when should I create one?

A covering index includes all columns referenced in a query (SELECT, WHERE, ORDER BY, GROUP BY), allowing the DBMS to satisfy the query entirely from the index without accessing the base table. Create one when you have a critical, frequently executed query that reads a small subset of columns from a large table. The trade-off is increased index size and slower writes (INSERT/UPDATE/DELETE), so only create covering indexes for high-impact queries.

Why do my queries slow down even though I have indexes?

Common reasons: (1) The optimizer chooses a full table scan because the query returns a large percentage of rows, making the index less efficient than sequential reading. (2) Index columns are wrapped in functions (e.g., WHERE UPPER(name) = ...), preventing index usage. (3) Statistics are outdated -- run ANALYZE/UPDATE STATISTICS. (4) The query uses OR conditions that prevent index intersection. (5) A composite index exists but the query does not filter on the leading column (left-prefix rule violation). (6) Implicit type conversions cause index bypass.

DatabaseCollege

Advanced SQL & Query Optimization

Advanced SQL and Query Optimization delves into the sophisticated capabilities of SQL beyond basic data manipulation, coupled with the critical techniques employed by database management systems to efficiently execute queries. It involves understanding complex SQL constructs, the internal workings of a query optimizer, and strategies to write performant queries and design optimal database schemas.

This guide covers query processing architecture, index structures (B+Tree, Hash, Bitmap), join algorithms, window functions, CTEs, optimization strategies, SQL walkthroughs, and a 10-question practice quiz.

1Introduction

In the realm of modern database systems, merely writing syntactically correct SQL queries is insufficient for robust application development. As data volumes grow and query complexity escalates, efficient data retrieval becomes paramount. Advanced SQL encompasses sophisticated language constructs that enable complex analytical operations, hierarchical data traversal, and data transformation. Query Optimization is the process by which a DBMS identifies the most efficient method to execute a given SQL query, aiming to minimize resource consumption (CPU, I/O, memory) and latency.

This topic bridges the gap between the declarative nature of SQL (what to retrieve) and the procedural execution within the DBMS (how to retrieve it). It involves deep theoretical concepts from relational algebra, algorithms, data structures, and statistical analysis for cost estimation.

In Practice

Consider a large retail chain analyzing customer purchasing patterns. A simple SELECT * query on a multi-terabyte sales table without proper indexing or an optimized join strategy could take hours or crash the system. By employing window functions to calculate rolling averages and ensuring covering indexes on frequently queried columns, the same analysis can be performed in seconds, enabling real-time business intelligence.

Why It Matters

Industry Applications

Unoptimized queries in high-transaction environments cause slowdowns, timeouts, and resource exhaustion, directly impacting operations.

Theoretical Significance

Draws from algorithms, data structures, graph theory, and statistical inference for heuristic search and cost modeling.

Systems Design

Understanding how declarative SQL is translated into efficient procedural operations informs storage engine and indexing design.

2Key Definitions

Essential terminology for understanding advanced SQL and query optimization at the university level.

Query Optimizer

DBMS component that finds the most efficient execution plan using cost estimation and database statistics

Execution Plan

Step-by-step description of access methods, join algorithms, and operation order for a query

Cost Estimation

Predicting CPU, I/O, and memory consumption of candidate plans using table statistics

Cardinality Estimation

Predicting row counts from relational operators; foundational for cost estimation

Selectivity

Fraction of rows satisfying a predicate; low selectivity (0.01) means highly restrictive

Predicate Pushdown

Applying WHERE conditions as early as possible to reduce intermediate result sizes

Subquery Unnesting

Transforming subqueries into equivalent joins for better optimization opportunities

Query Rewriting

Transforming SQL into an equivalent but more efficient form (e.g., view merging, predicate simplification)

Join Ordering

Sequence in which tables are joined; drastically affects intermediate result sizes and cost

Covering Index

Index containing all columns a query needs, enabling index-only scan without table access

Window Function

SQL function computing a value over a “window” of rows related to the current row, without collapsing rows

Common Table Expression

Temporary named result set defined with WITH clause; supports recursion for hierarchical queries

3Query Processing Architecture

The journey of an SQL query from user input to result set is a multi-stage process handled by the DBMS's query processor. Understanding this pipeline is essential for diagnosing and resolving performance issues.

Query Processing Pipeline

SQL Query

1Parser

Tokenize + syntax check → AST

2Semantic Analyzer

Validate names, types, permissions

3Logical Plan

Relational algebra: σ, π, ⋈, ∪

4Query Optimizer

Cost-based · Statistics · Join ordering

5Query Executor

B+Tree scan, Hash Join, Sort

Result Set

Stage 4 — Optimizer detail: Generates multiple candidate plans (e.g., Seq Scan + Nested Loop vs. Index Scan + Hash Join), estimates cost using table statistics, histograms, and index selectivity, then picks the cheapest plan.

1. Parsing & Lexical Analysis

The SQL string is tokenized into keywords, identifiers, and operators. The parser checks syntax against the SQL grammar and constructs an Abstract Syntax Tree (AST).

2. Semantic Analysis

Verifies that table/column names exist, data types are compatible, and the user has appropriate permissions. Views are expanded and external references resolved.

3. Logical Query Plan

The validated query is translated into relational algebra operators (selection, projection, join, union). Describes what to do, not how.

4. Query Optimization

The optimizer generates multiple candidate physical plans, estimates their cost using database statistics (table sizes, histograms, index selectivity), and selects the cheapest. Modern optimizers are cost-based, using dynamic programming (Selinger algorithm) for join ordering.

5. Physical Plan Selection & Execution

The chosen plan specifies concrete algorithms: B+Tree index scan, hash join, sort-merge join, etc. The query executor runs these operations against the storage engine and returns results.

Key Insight

Cost-Based vs. Rule-Based: Cost-based optimizers use statistics to evaluate many plans and pick the cheapest. Rule-based optimizers apply fixed heuristics (“perform selections before joins”). Cost-based is dominant today because it adapts to actual data distributions, but depends on up-to-date statistics -- always run ANALYZE / UPDATE STATISTICS.

4Index Structures

Indexes are specialized data structures that speed up data retrieval. Choosing the right index type and columns is one of the most impactful optimization decisions.

B+Tree Indexes

The most common index type in relational databases. A balanced tree where all data pointers reside in leaf nodes, which are linked sequentially for efficient range scans.

Building a B+Tree Index

Insert keys one at a time, observe node splits, then search and range scan.

// Click Play or Step Forward to build the B+Tree

RootLeafLeaf link (range scan)ActiveOverflow (split)

Step through to see how a B+Tree is built with inserts, splits, and searches.

Step 0 / 15

Root Node

Entry point for every search; contains keys and pointers to internal nodes.

Internal Nodes

Guide searches with keys and child pointers. High fan-out factor keeps tree height low.

Leaf Nodes

Contain keys + data pointers. Linked sequentially for efficient range scans.

-- B+Tree lookup: O(log_f N) where f = fan-out, N = entries
-- Typically 3-5 levels for millions of records
-- Search: root -> internal nodes -> leaf node

-- Example: Create a B+Tree index
CREATE INDEX idx_orders_date ON Orders (order_date);

-- This index supports:
-- 1. Equality:  WHERE order_date = '2025-01-15'
-- 2. Range:     WHERE order_date BETWEEN '2025-01-01' AND '2025-03-31'
-- 3. ORDER BY:  ORDER BY order_date ASC/DESC
-- 4. Min/Max:   SELECT MIN(order_date), MAX(order_date)

Hash Indexes

Use a hash function to map keys directly to storage buckets. Provide O(1) average lookup for exact matches but cannot support range queries, ORDER BY, or partial key matching.

Hash Index: Insert & Lookup

O(1) exact-match lookup using h(key) = key MOD 4

// Click Play to build a hash index

Bucket 0

204

Bob

Bucket 1

101

Alice

305

Carol

Bucket 2

410

Dave

Bucket 3

503

Eve

Hash indexes map keys directly to buckets for O(1) average lookup.

Step 0 / 8

Bitmap Indexes

Create a bit array for each distinct value in the indexed column. Exceptionally effective for low-cardinality columns (gender, status, boolean flags) with multiple AND/OR conditions, resolved via bitwise operations.

Bitmap Index: Build & Query

Bit vectors for low-cardinality columns, combined with bitwise AND/OR

// Click Play to build bitmap indexes

Row	Name	Status	Dept
1	Alice	Active	Sales
2	Bob	Active	Engineering
3	Carol	Inactive	Sales
4	Dave	Active	Engineering
5	Eve	Inactive	Marketing
6	Frank	Active	Sales

Bitmap Vectors

status="Active"

110101

status="Inactive"

001010

dept="Sales"

101001

dept="Engineering"

010100

dept="Marketing"

000010

Bitmap indexes create a bit vector per distinct value. Combined with bitwise AND/OR for multi-column queries.

Step 0 / 8

Feature	B+Tree	Hash	Bitmap
Equality Lookup	O(log N)	O(1) avg	O(1) bitwise
Range Query	Excellent	Not supported	Limited
ORDER BY	Yes (sorted leaves)	No	No
Best Cardinality	Any	High	Low
Update Cost	Moderate	Low	High

Clustered vs. Non-Clustered & Composite Indexes

Clustered Index

Determines the physical storage order of data rows. Only one per table (typically the primary key). Fast for range queries and sequential access.

Non-Clustered Index

Separate structure with pointers to data rows. Multiple per table. Requires “bookmark lookup” for columns not in the index (unless it is a covering index).

-- Composite index: left-prefix rule
CREATE INDEX idx_cust_name ON Customers (LastName, FirstName);

-- Usable for:
--   WHERE LastName = 'Smith'                   (leading column)
--   WHERE LastName = 'Smith' AND FirstName = 'John'  (full prefix)
-- NOT efficiently usable for:
--   WHERE FirstName = 'John'                   (not leading column)

-- Covering index: avoids table lookup
CREATE INDEX idx_order_cover ON Orders (customer_id)
  INCLUDE (order_date, total_amount);

-- This query uses index-only scan:
SELECT order_date, total_amount
FROM Orders WHERE customer_id = 123;

5Join Algorithms

The choice of join algorithm dramatically affects query performance. The optimizer selects the best algorithm based on table sizes, available indexes, available memory, and the join condition.

Join Ordering: Why It Matters

Same query, different join orders — drastically different intermediate result sizes and costs.

Nested Loop Join

For each row in the outer table, scan the inner table for matching rows. With an index on the inner table's join column, each probe is O(log N).

Best for: Small outer table or highly selective join with index. Cost: O(N * M) without index, O(N * log M) with index.

Hash Join

Build a hash table on the smaller (build) table using the join key, then probe it with each row from the larger (probe) table.

Best for: Large, unsorted tables with equality joins. Cost: O(N + M) if hash table fits in memory.

Sort-Merge Join

Sort both tables on the join key, then merge them in a single pass. Efficient when data is already sorted (e.g., clustered index) or when the result must be ordered.

Best for: Pre-sorted data or when output must be ordered. Cost: O(N log N + M log M) for sorting + O(N + M) for merge.

Algorithm	Best When	Join Type	Memory Need
Nested Loop	Small outer, index on inner	Any (incl. non-equi)	Low
Hash Join	Large unsorted tables	Equi-join only	High (hash table)
Sort-Merge	Pre-sorted or output needs order	Equi-join, range	Moderate

6Window Functions & CTEs

Window Functions

Window functions perform calculations across a set of rows related to the current row, returning a value for each row without collapsing them (unlike GROUP BY). They are defined by the OVER() clause.

Window Functions Step by Step

PARTITION BY → ORDER BY → apply function per row

-- Step 1: Source table
SELECT name, dept, salary
FROM Employees;
-- Step 2: PARTITION BY dept
-- Divides rows into groups
OVER (PARTITION BY dept ...)
-- Step 3: ORDER BY salary DESC
-- Sorts within each partition
OVER (PARTITION BY dept
      ORDER BY salary DESC)
-- Step 4: ROW_NUMBER()
ROW_NUMBER() OVER (
  PARTITION BY dept
  ORDER BY salary DESC
)
-- Step 5: RANK()
RANK() OVER (
  PARTITION BY dept
  ORDER BY salary DESC
)
-- Step 6: DENSE_RANK()
DENSE_RANK() OVER (
  PARTITION BY dept
  ORDER BY salary DESC
)
-- Step 7: Window frame
SUM(salary) OVER (
  PARTITION BY dept
  ORDER BY salary DESC
  ROWS BETWEEN UNBOUNDED PRECEDING
       AND CURRENT ROW
) AS running_total
-- Step 8: LAG()
LAG(salary, 1) OVER (
  PARTITION BY dept
  ORDER BY salary DESC
) AS prev_salary

Name	Dept	Salary
Alice	CS	90,000
Bob	CS	85,000
Carol	CS	90,000
Dave	MATH	75,000
Eve	MATH	70,000

Step through to see how PARTITION BY, ORDER BY, and ranking functions work on each row.

Step 0 / 8

ROW_NUMBER()

Unique sequential integer per row. Always 1, 2, 3... even for ties.

RANK()

Same rank for ties, with gaps. E.g., 1, 1, 3, 4.

DENSE_RANK()

Same rank for ties, no gaps. E.g., 1, 1, 2, 3.

LAG() / LEAD()

Access rows before/after current row within the partition.

Default Frame Warning

When ORDER BY is present, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- a cumulative window. When ORDER BY is absent, the default is the entire partition. Always specify the frame explicitly to avoid surprises.

Common Table Expressions (CTEs)

CTEs provide temporary named result sets that improve query readability and enable recursive queries for hierarchical data traversal.

-- Non-recursive CTE: breaking complex queries into steps
WITH RegionalSales AS (
    SELECT region, SUM(amount) AS total_sales
    FROM Sales
    GROUP BY region
),
TopRegions AS (
    SELECT region, total_sales
    FROM RegionalSales
    WHERE total_sales > 100000
)
SELECT * FROM TopRegions ORDER BY total_sales DESC;

-- Recursive CTE: traversing an org chart hierarchy
WITH RECURSIVE OrgChart AS (
    -- Anchor: top-level managers (no manager_id)
    SELECT employee_id, name, manager_id, 1 AS level
    FROM Employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive: find direct reports
    SELECT e.employee_id, e.name, e.manager_id, oc.level + 1
    FROM Employees e
    JOIN OrgChart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM OrgChart ORDER BY level, name;

7Query Optimization Strategies

The query optimizer employs a sophisticated array of techniques to transform logical plans into efficient physical execution plans. Understanding these strategies helps you write queries that the optimizer can efficiently execute.

Predicate Pushdown

Move WHERE conditions as close to the data source as possible, filtering rows before joins or aggregations.

-- Instead of filtering after join:
SELECT * FROM (Orders JOIN Customers ON ...) WHERE City = 'London'
-- Optimizer pushes predicate down:
SELECT * FROM Orders JOIN (SELECT * FROM Customers WHERE City = 'London') ON ...

Join Reordering

The optimizer reorders joins to minimize intermediate result set sizes. For N tables, there are N!/2 possible join orders. The Selinger algorithm uses dynamic programming to explore this space efficiently.

Subquery Unnesting

Transforms subqueries into joins, enabling the optimizer to apply join reordering and algorithm selection.

-- Before: correlated subquery (executed once per outer row)
SELECT name FROM Employees
WHERE dept_id IN (SELECT id FROM Departments WHERE location = 'NYC');

-- After: unnested to join (single scan)
SELECT E.name FROM Employees E
JOIN Departments D ON E.dept_id = D.id WHERE D.location = 'NYC';

Constant Folding & Expression Simplification

Pre-evaluate constant expressions at compile time. WHERE price * 1.08 > 100 becomes WHERE price > 92.59. WHERE (X > 5 AND X < 3) simplifies to WHERE FALSE.

UNION ALL vs. UNION DISTINCT

Use UNION ALL when duplicates are acceptable -- it avoids the expensive sort/hash step required for duplicate elimination in UNION (which implies UNION DISTINCT).

Reading EXPLAIN Output

Use EXPLAIN (planned) or EXPLAIN ANALYZE (actual execution) to inspect the optimizer's chosen plan. Look for: Seq Scan on large tables (missing index?), large row estimates mismatching actual rows (stale statistics?), Sort/Hash nodes (memory-intensive), and Nested Loop on large tables (should be Hash Join?).

8SQL Walkthroughs

Introductory

Running Total with Window Function

SELECT
    p.PurchaseID,
    p.CustomerID,
    p.PurchaseDate,
    p.Amount,
    -- Running total per customer, ordered by date
    SUM(p.Amount) OVER (
        PARTITION BY p.CustomerID
        ORDER BY p.PurchaseDate
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS RunningTotalAmount
FROM
    Purchases p
ORDER BY
    p.CustomerID, p.PurchaseDate;

Lines 7-11: SUM() as a window function with explicit frame. PARTITION BY CustomerID restarts the running total for each customer. ORDER BY PurchaseDate defines the accumulation order.

Performance: Dominated by sorting on (CustomerID, PurchaseDate). O(N log N) time. An index on (CustomerID, PurchaseDate) eliminates the sort, reducing to O(N).

Intermediate

Top 3 Products by Sales per Category

WITH ProductSales AS (
    SELECT
        p.ProductID, p.ProductName, p.CategoryID,
        SUM(oi.Quantity) AS TotalQuantitySold
    FROM Products p
    JOIN OrderItems oi ON p.ProductID = oi.ProductID
    GROUP BY p.ProductID, p.ProductName, p.CategoryID
),
RankedProducts AS (
    SELECT *,
        DENSE_RANK() OVER (
            PARTITION BY CategoryID
            ORDER BY TotalQuantitySold DESC
        ) AS SalesRank
    FROM ProductSales
)
SELECT CategoryID, ProductName, TotalQuantitySold, SalesRank
FROM RankedProducts
WHERE SalesRank <= 3
ORDER BY CategoryID, SalesRank;

CTE 1 (ProductSales): Aggregates total quantity sold per product via JOIN + GROUP BY.

CTE 2 (RankedProducts): Uses DENSE_RANK() to rank products within each category. Ties receive the same rank without gaps.

Final SELECT: Filters to top 3 ranks. If two products tie for rank 1, both are included and the next rank is 2 (not 3).

Key insight: DENSE_RANK() is preferred over ROW_NUMBER() for top-N queries when ties should be included.

Advanced

Reading an EXPLAIN Plan & Adding a Covering Index

-- Step 1: Check the execution plan
EXPLAIN ANALYZE
SELECT order_date, total_amount
FROM Orders
WHERE customer_id = 42;

-- Output might show:
-- Seq Scan on orders (cost=0.00..25000.00 rows=500000)
--   Filter: (customer_id = 42)
--   Rows Removed by Filter: 499950
--   Actual rows: 50, loops: 1, time: 1200ms

-- Problem: Full table scan reading 500K rows to find 50!

-- Step 2: Create a covering index
CREATE INDEX idx_orders_cust_cover
ON Orders (customer_id)
INCLUDE (order_date, total_amount);

-- Step 3: Re-check the plan
EXPLAIN ANALYZE
SELECT order_date, total_amount
FROM Orders
WHERE customer_id = 42;

-- New output:
-- Index Only Scan using idx_orders_cust_cover (cost=0.42..8.50)
--   Index Cond: (customer_id = 42)
--   Actual rows: 50, loops: 1, time: 0.3ms

Before: Sequential scan reads all 500K rows, filtering 499,950 (1200ms).

After: Index-only scan reads exactly 50 matching rows directly from the index (0.3ms) -- a 4000x improvement.

Key insight: EXPLAIN ANALYZE reveals the actual execution strategy. Covering indexes eliminate table access entirely.

9Memory Aids

Query Pipeline

“Parse → Plan → Optimize → Execute”

SQL string → AST → Logical plan (relational algebra) → Cost-based optimization → Physical plan → Results.

Index Selection

“B+Tree for ranges, Hash for equals, Bitmap for booleans.”

Join Algorithm Choice

“Nested Loop for small + indexed, Hash for big + unsorted, Merge for pre-sorted.”

Left-Prefix Rule

“A composite index on (A, B, C) is a phone book sorted by last name, then first name, then middle name.”

You can look up by last name alone, or last + first, but not by first name alone.

Window Function Ranking

“ROW_NUMBER = unique jersey numbers. RANK = Olympic medals (ties skip). DENSE_RANK = podium positions (no skipping).”

Optimization Mantra

“Filter early, join small, cover your indexes, and always check EXPLAIN.”

10Common Mistakes

Wrapping Index Columns in Functions

Using WHERE UPPER(name) = 'SMITH' or WHERE YEAR(order_date) = 2025

Applying a function to an indexed column prevents the optimizer from using the index. Instead, use WHERE name = 'Smith' (with case-insensitive collation) or WHERE order_date >= '2025-01-01' AND order_date < '2026-01-01'. Alternatively, create a functional index on UPPER(name).

Ignoring the Left-Prefix Rule

Creating index on (A, B, C) but querying WHERE B = ... or WHERE C = ...

A composite index on (A, B, C) can only be used when the query filters on the leading columns in order: A, (A, B), or (A, B, C). Queries filtering only on B or C cannot use this index efficiently.

Using SELECT * in Production Queries

Retrieving all columns when only a few are needed

SELECT * prevents index-only scans and increases I/O by reading unnecessary columns from the base table. Always specify the exact columns needed.

Stale Statistics

Not running ANALYZE/UPDATE STATISTICS after large data changes

Cost-based optimizers depend on accurate statistics. After bulk inserts, deletes, or schema changes, stale statistics lead to wildly inaccurate cardinality estimates and suboptimal plans. Run ANALYZE regularly.

Confusing Window Function Defaults

Expecting SUM() OVER (ORDER BY x) to compute the total, not a running total

With ORDER BY present, the default frame is cumulative (UNBOUNDED PRECEDING to CURRENT ROW), not the entire partition. For the full partition total, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Over-Indexing

Creating an index for every column or every query

Each index adds overhead to every INSERT, UPDATE, and DELETE operation. Indexes consume storage and require maintenance. Focus on indexes that serve the most critical and frequent queries, and consolidate using composite or covering indexes.

Using UNION When UNION ALL Suffices

Paying the cost of duplicate elimination unnecessarily

UNION (which means UNION DISTINCT) requires sorting or hashing the entire combined result to remove duplicates. If you know duplicates are impossible or acceptable, always use UNION ALL.

Frequently Asked Questions

When should I use a Hash Index vs. a B+Tree Index?: Use a Hash Index when your queries consist exclusively of equality lookups (WHERE key = value) and you do not need range scans, ORDER BY, or partial key matching. Hash indexes provide O(1) average lookup time for exact matches. Use a B+Tree Index (the default in most DBMS) for virtually all other cases, including range queries, ORDER BY, LIKE prefix matching, and composite key lookups. B+Tree indexes offer O(log N) lookups and efficient range scans via linked leaf nodes.
What is the difference between RANK() and DENSE_RANK()?: Both assign ranks to rows within a partition based on the ORDER BY clause. RANK() leaves gaps after ties: if two rows tie for rank 1, the next row receives rank 3. DENSE_RANK() does not leave gaps: the next row after two rank-1 ties receives rank 2. Use DENSE_RANK() when you need consecutive ranking (e.g., "top 3 products") and RANK() when the gap conveys how many items share a higher rank.
How do I read an EXPLAIN/ANALYZE output?: EXPLAIN shows the planned execution strategy without running the query. EXPLAIN ANALYZE actually executes it and shows real timings. Key things to look for: (1) Scan type -- index scan vs. sequential/full table scan; (2) Join algorithm -- nested loop, hash join, or merge join; (3) Estimated vs. actual row counts (large discrepancies indicate stale statistics); (4) Cost estimates for each node; (5) Sort or hash operations that may spill to disk. Start reading from the innermost (most indented) operation outward.
When should I use a CTE vs. a subquery?: Use a CTE when you need to reference the same derived result set multiple times in one query, when the query logic benefits from being broken into named steps for readability, or when you need recursion (hierarchical traversal). Use a subquery for simple, one-off derived tables. Performance-wise, some DBMS inline CTEs like subqueries; others materialize them. Check your DBMS documentation and test with EXPLAIN.
What is a covering index and when should I create one?: A covering index includes all columns referenced in a query (SELECT, WHERE, ORDER BY, GROUP BY), allowing the DBMS to satisfy the query entirely from the index without accessing the base table. Create one when you have a critical, frequently executed query that reads a small subset of columns from a large table. The trade-off is increased index size and slower writes (INSERT/UPDATE/DELETE), so only create covering indexes for high-impact queries.
Why do my queries slow down even though I have indexes?: Common reasons: (1) The optimizer chooses a full table scan because the query returns a large percentage of rows, making the index less efficient than sequential reading. (2) Index columns are wrapped in functions (e.g., WHERE UPPER(name) = ...), preventing index usage. (3) Statistics are outdated -- run ANALYZE/UPDATE STATISTICS. (4) The query uses OR conditions that prevent index intersection. (5) A composite index exists but the query does not filter on the leading column (left-prefix rule violation). (6) Implicit type conversions cause index bypass.

Practice Quiz

Test your understanding of advanced SQL and query optimization -- select the correct answer for each question.

1.What is the primary role of a query optimizer in a DBMS?

2.What is the time complexity of a lookup operation in a B+Tree index?

3.Which index type is most suitable for columns with low cardinality (few distinct values) and queries combining multiple AND/OR conditions?

4.What does the "left-prefix" rule for composite indexes mean?

5.What is the key difference between ROW_NUMBER() and DENSE_RANK() window functions?

6.What is predicate pushdown and why is it beneficial?

7.What is a covering index (index-only scan)?

8.What distinguishes a recursive CTE from a non-recursive CTE?

9.Why is cost-based optimization generally preferred over rule-based (heuristic) optimization?

10.When reading an EXPLAIN plan, which of the following indicates that a query is likely performing poorly?

Study Tips

Practice with EXPLAIN ANALYZE: Run queries against a real database and read the execution plan. Identify sequential scans, index scans, join types, and sort operations. Compare plans before and after adding indexes.
Draw B+Tree diagrams: Practice inserting and deleting keys from a B+Tree by hand. Understanding the split and merge mechanics helps you reason about index performance.
Write window functions from scratch: For each ranking function (ROW_NUMBER, RANK, DENSE_RANK), write the same query and compare the output side by side to internalize the differences.
Compare join algorithms: Set up test tables of different sizes and use optimizer hints to force different join algorithms. Measure timing to see when each excels.

1Introduction

Why It Matters

2Key Definitions

3Query Processing Architecture

Query Processing Pipeline

4Index Structures

B+Tree Indexes

Building a B+Tree Index

Hash Indexes

Hash Index: Insert & Lookup

Bitmap Indexes

Bitmap Index: Build & Query

Clustered vs. Non-Clustered & Composite Indexes

5Join Algorithms

Join Ordering: Why It Matters

6Window Functions & CTEs

Window Functions

Window Functions Step by Step

Common Table Expressions (CTEs)

7Query Optimization Strategies

8SQL Walkthroughs

9Memory Aids

10Common Mistakes

Frequently Asked Questions

Practice Quiz

Study Tips

Related Topics