Storage, Indexing & File Structures
Storage, Indexing, and File Structures form the foundational layer of every database management system. This domain governs how data is physically laid out on disk, how it is organised into logical files, and how auxiliary data structures accelerate retrieval.
This guide covers physical storage and buffer management, file organisations (heap, sequential, hash), B+Tree and hash-based indexes, bitmap indexes, advanced techniques (composite, covering, partial indexes), SQL walkthroughs with EXPLAIN ANALYZE, and a 10-question practice quiz.
1Introduction
In the architecture of a DBMS, Storage, Indexing, and File Structures sit at the physical and logical data organisation levels. This layer bridges the logical schema (relations, attributes) and the physical hardware (HDDs, SSDs). Understanding it is a prerequisite for query processing, optimisation, and scalable system design.
The efficiency of a database is critically dependent on these choices. Poorly chosen file organisations or missing indexes can cause catastrophic performance bottlenecks under heavy load. The central trade-off is always between storage space, access time, and update cost.
Modern databases like PostgreSQL, MySQL, and Oracle rely heavily on optimised B+Tree indexes for primary key constraints and frequently queried columns. Choosing the right index type -- B+Tree for range queries, hash for equality -- and verifying its use via EXPLAIN ANALYZE is a daily task for database administrators and performance engineers. A missing index on a large table can turn a 10ms query into a 30-second full table scan.
Storage
Managing physical devices (HDDs, SSDs), allocating disk blocks, and caching pages in main memory via the buffer pool.
Indexing
Auxiliary data structures (B+Trees, hash indexes) that enable rapid record retrieval without scanning the entire data file.
File Structures
How records within a relation are physically organised -- heap, sequential, or hash -- affecting insert, delete, and scan performance.
2Key Definitions
Essential terms for understanding storage, indexing, and file structures at the university level.
Disk Page (Block)
The smallest unit of data transfer between disk and memory. Typically 4 KB, 8 KB, or 16 KB. All I/O is measured in page reads/writes.
Record (Tuple)
A single row in a relation, composed of fields (attributes). Records are stored in pages on disk.
Data File
A collection of records of the same type (same relation) stored on disk. e.g., employees.dat.
Index File
A separate file of (key, pointer) pairs enabling fast lookup into a data file based on a specific attribute.
Heap File
Records stored in no particular order. New records are appended or placed in any free slot. Fast insert, slow search.
Sequential File
Records stored in sorted order by a designated search key. Efficient for range queries, costly to maintain on updates.
Dense Index
An index entry exists for every search key value in the data file. Required for secondary indexes.
Sparse Index
Index entries exist only for some search key values -- typically one per disk page. Only usable with clustered (sorted) files.
Primary Index
An index whose search key specifies the physical sequential order of the file. Usually on the primary key.
Secondary Index
An index on an attribute different from the file's physical ordering key. Must be dense. Requires RID lookups.
Fan-out (f)
The maximum number of child pointers an internal B+Tree node can hold. Higher fan-out = shorter tree = fewer I/Os.
I/O Cost
The number of disk page reads/writes to perform an operation. The dominant performance metric in disk-based databases.
Record ID (RID)
A unique identifier for a record: (PageID, SlotNumber). Secondary indexes store RIDs to locate data records.
Buffer Pool
A designated area of main memory that caches disk pages. Managed by the buffer manager to minimise disk I/O.
3Physical Storage & Disk Organisation
Storage Hierarchy
Faster and smaller at the top; slower and larger at the bottom
Database buffer pools sit in RAM, caching pages from SSD/HDD to minimise costly disk I/O
Disk Access Time Components
Traditional Hard Disk Drives (HDDs) are mechanical devices. Their access time has three components:
Seek Time
Time to move the arm to the correct track. Usually the largest cost component: 5--20 ms for HDDs. Near-zero for SSDs.
Rotational Latency
Time waiting for the sector to rotate under the head. Average = half a full rotation (~4 ms at 7200 RPM).
Transfer Time
Time to actually read/write the data bytes once positioned. The fastest component; sequential I/O amortises this.
Total Access Time = Seek Time + Rotational Latency + Transfer Time Random I/O = 1 x (SeekTime + RotLatency + TransferTime) -- expensive Sequential = 1 x SeekTime + 1 x RotLatency + N x TransferTime -- efficient
Random I/O is far more expensive than sequential I/O. This is why minimising random disk accesses -- through good file organisations, indexes, and buffer management -- is the central goal of database storage design. SSDs dramatically reduce seek time but sequential reads are still faster than random reads due to internal parallelism and prefetching.
RAID Levels
RAID 0 (Striping)
Data striped across disks for parallel I/O. Best performance, but zero redundancy -- any disk failure loses all data.
RAID 1 (Mirroring)
Data duplicated across two disks. Excellent redundancy, fast reads. 50% storage overhead.
RAID 5 (Stripe + Parity)
Data striped with distributed parity. Good balance of performance, redundancy, and cost. Tolerates one disk failure.
RAID 10 (Mirror + Stripe)
Combines RAID 1 and RAID 0. Highest performance and redundancy. Most costly -- 50% storage overhead.
Buffer Pool Management
Since disk I/O is expensive, the DBMS maintains a buffer pool -- a designated area of main memory that caches frequently accessed disk pages. The buffer manager controls fetching, pinning, and eviction of pages.
Buffer Pool Management
Buffer frames cache disk pages in RAM. The LRU policy evicts the least-recently-used clean frame first.
Buffer Frames (Main Memory)
Frame 1
P-12
Frame 2
P-07
Frame 3
P-33
Frame 4
P-01
Frame 5
(free)
Frame 6
P-44
LRU Queue
MRU at top
LRU Policy
Evicts the page that has not been accessed for the longest time. Good general-purpose policy.
CLOCK Policy
Efficient approximation of LRU using a reference bit. Sweeps frames in a circular fashion.
MRU Policy
Evicts the most recently used page. Optimal for nested-loop join patterns where the same pages cycle repeatedly.
4File Organizations
A file organisation dictates how records are physically stored and managed on disk. The choice significantly impacts the efficiency of insert, delete, search, and range scan operations.
Heap File vs Sequential File
Heap: records in arrival order (fast insert, slow search) | Sequential: records sorted by key (efficient range queries)
Heap File
(unordered)
Page 1
Page 2
Page 3
Insert: O(1) | Search: O(N pages)
Sequential File
(sorted by key)
Page 1
Page 2
Page 3
Range: O(log N + K) | Insert: O(N)
Heap File
- Records in arrival order
- Insert: O(1) amortised
- Search: O(N pages) full scan
- Best for: small tables, index-driven access
Sequential File
- Sorted by ordering key
- Range query: O(log N + K)
- Insert: O(N) worst case
- Best for: static tables with range queries
Hash File
- Records in hash buckets
- Equality search: O(1) avg
- Range queries: not supported
- Best for: equality-only lookups
Clustered vs Unclustered
| Property | Clustered | Unclustered |
|---|---|---|
| Physical order | Matches index key order | Independent of index order |
| Indexes per table | Only one | Multiple allowed |
| Range queries | Excellent (sequential reads) | Poor (random I/O per record) |
| Key updates | Expensive (may move records) | Less expensive |
5B+Tree Indexes
The B+Tree is the de facto standard for database indexing. It is a self-balancing tree structure where all data pointers reside in leaf nodes (which are linked), and internal nodes serve purely as a navigation directory.
B+Tree Structure
Internal nodes route searches; leaf nodes hold data and are linked for range scans
Dashed green arrows show the linked list connecting all leaf nodes for sequential range traversal
Key Properties
All leaves at the same depth
Guarantees O(log_f N) I/O cost for any search. The tree is always perfectly height-balanced.
Linked leaf nodes
Leaves form a doubly-linked list, enabling efficient sequential range traversal once the start is found.
High fan-out
Internal nodes hold only keys (no data), maximising the number of child pointers per page and minimising tree height.
Self-balancing
Splits propagate up on insert; merges/redistributions handle deletes. Height grows only at the root.
Operations & I/O Cost
| Operation | I/O Cost | Notes |
|---|---|---|
| Exact search | h I/Os | h = tree height approx log_f(N). One more for unclustered data fetch. |
| Range query | h + K I/Os | h to find start, K leaf pages to scan the range. |
| Insert (no split) | h + 1 I/Os | Traverse to leaf, write modified leaf. |
| Insert (with split) | h + splits | Splits propagate up; root split increases height by 1. |
With a page size of 8 KB and a key+pointer of 8 bytes, fan-out f ≈ 1000. A B+Tree with 1 billion records has height h = log₁₀₀₀(1,000,000,000) = 3. That means just 3 disk I/Os to find any record in a billion-row table.
6Hash-Based Indexes
Hash-based indexes provide very fast O(1) average-case equality lookups. A hash function maps a search key to a bucket, which holds the matching records. They are completely unsuitable for range queries.
Static Hashing
Fixed number of buckets. Overflow chains degrade performance as the table grows. Suitable only for static datasets.
Linear Hashing
Buckets grow gradually using a family of hash functions and a split pointer. No directory needed. Avoids global restructuring.
Extendible Hashing
Uses a directory of bucket pointers. Directory doubles when needed. No overflow chains. O(1) average lookup.
Extendible Hashing
Directory doubles when local depth equals global depth. No overflow chains needed.
Directory
Buckets (capacity=2)
B+Tree vs Hash Index Comparison
| Feature | B+Tree | Hash Index |
|---|---|---|
| Equality search | O(log_f N) | O(1) average |
| Range search | O(log_f N + K) | Not supported |
| Ordering | Sorted leaf nodes | None |
| ORDER BY support | Yes | No |
| Primary use case | General purpose, ranges | Equality lookups only |
7Bitmap Indexes
A bitmap index creates a separate bit vector for each distinct value of a low-cardinality attribute. Each bit corresponds to one record: 1 if the record has that value, 0 otherwise.
-- Example: Gender column with 10 records -- Bitmap for 'M': 1 0 1 0 0 1 0 1 0 1 (records 1,3,6,8,10 are Male) -- Bitmap for 'F': 0 1 0 1 1 0 1 0 1 0 (records 2,4,5,7,9 are Female) -- Query: Gender='M' AND Department='Sales' -- Result = BITWISE_AND(Bitmap_M, Bitmap_Sales) -- This is a single CPU instruction per 64 records!
Ideal for OLAP / Data Warehousing
Complex multi-condition analytics on low-cardinality columns (status, region, gender). Bitwise AND/OR/NOT are extremely fast.
Low Cardinality Only
Works well for columns with few distinct values (e.g., 2--1000). For high-cardinality columns (e.g., salary), the index becomes as large as the data file.
Avoid for OLTP
Each row update requires modifying multiple bitmaps simultaneously. High write contention makes bitmap indexes impractical for transactional workloads.
8Advanced Indexing Techniques
Composite Indexes & Column Ordering
A composite index is built on multiple columns. The column order critically determines which queries can use the index.
-- Index on (department_id, employee_id) WHERE department_id = 10 -- uses index (leading column) WHERE department_id = 10 AND employee_id > 5 -- uses index (both columns) WHERE employee_id = 101 -- cannot use index (skips leading)
Covering Indexes
A covering index includes all columns referenced by a query (WHERE + SELECT), enabling an index-only scan that never touches the data file.
-- Query requires: department_id (filter), employee_id, salary (output) CREATE INDEX idx_covering ON Employees (department_id, employee_id, salary); -- Now this query needs zero data-page I/Os: SELECT employee_id, salary FROM Employees WHERE department_id = 20;
Partial (Filtered) Indexes
A partial index indexes only a subset of rows satisfying a WHERE clause. Smaller size, lower maintenance overhead.
-- Only index active employees -- much smaller than a full index CREATE INDEX idx_active_emp ON Employees (employee_id) WHERE status = 'Active';
Clustered vs Non-Clustered Comparison
| Property | Clustered Index | Non-Clustered Index |
|---|---|---|
| Data order | Leaf level IS the data | Separate structure; stores RIDs |
| Count per table | Maximum 1 | Multiple allowed |
| Range scans | Excellent (sequential) | Poor unless covering |
| Key update cost | High (may relocate data) | Lower |
| Index-only scan | Always (data in leaf) | Only if covering index |
Multi-Dimensional Indexes (R-Trees, Quad-Trees)
Spatial data requires indexes that handle multi-dimensional queries. R-Trees use minimum bounding rectangles (MBRs) to partition space hierarchically -- used by PostGIS for geospatial queries. Quad-Trees subdivide 2D space into quadrants recursively. Both dramatically outperform B+Trees for "find all points within this region" queries.
9SQL Walkthroughs
Index Creation & Equality Query
-- Without index: Sequential Scan (O(N) pages) EXPLAIN ANALYZE SELECT * FROM Employees WHERE department_id = 20; -- Create B+Tree index on department_id CREATE INDEX idx_emp_dept ON Employees (department_id); -- With index: Index Scan (O(log_f N + M) I/Os) EXPLAIN ANALYZE SELECT * FROM Employees WHERE department_id = 20;
Before index: PostgreSQL performs a Sequential Scan -- reads every page in the table. Cost = N_pages I/Os regardless of selectivity.
After index: Optimizer uses an Index Scan on idx_emp_dept. Traverses the B+Tree (3--4 I/Os), then fetches only the matching data pages. Orders of magnitude faster for large tables.
Range Query with B+Tree Index
-- Create index on hire_date CREATE INDEX idx_emp_hire ON Employees (hire_date); -- Range query: B+Tree traverses to start of range, -- then follows linked leaf nodes sequentially EXPLAIN ANALYZE SELECT employee_id, first_name, hire_date FROM Employees WHERE hire_date BETWEEN '2020-01-01' AND '2021-12-31'; -- Without salary index: forces a full Sequential Scan EXPLAIN ANALYZE SELECT employee_id, salary FROM Employees WHERE salary BETWEEN 70000 AND 80000;
hire_date range: B+Tree finds the first matching leaf in O(log_f N) I/Os, then traverses the linked leaf list sequentially -- highly efficient.
salary range (no index): Full table scan required. Every page read to check the condition. Adding a B+Tree on salary would fix this.
Key insight: B+Trees are ideal for ranges precisely because leaf nodes are linked -- the tree gives the start point, then the list gives the rest.
Covering Index & Index-Only Scan
-- Covering index: includes all columns needed by the query CREATE INDEX idx_covering_dept ON Employees (department_id, status, salary); -- Index-Only Scan: zero data-page I/Os EXPLAIN ANALYZE SELECT employee_id, department_id, salary FROM Employees WHERE department_id = 20 AND status = 'Active'; -- Requires Heap Fetch (first_name not in index) EXPLAIN ANALYZE SELECT first_name, department_id, salary FROM Employees WHERE department_id = 20 AND status = 'Active';
Index-Only Scan: All required columns (department_id, status, salary, plus employee_id as implicit row identifier) are in the index. PostgreSQL returns results without touching data pages -- minimum possible I/O.
Heap Fetch required: first_name is not in the covering index, so the DBMS must also fetch the actual data page for each matching row -- adding random I/Os.
Design rule: For frequently executed queries, design covering indexes by including all projected columns. The extra index size is often worth the eliminated data I/Os.
10Common Mistakes
Indexing every column to maximise read performance
Every index must be updated on INSERT, UPDATE, and DELETE -- adding write overhead and disk space. Create indexes judiciously on columns frequently in WHERE, JOIN, ORDER BY, or GROUP BY clauses. Use EXPLAIN ANALYZE to confirm an index is actually used.
Assuming multiple clustered indexes are possible per table
A clustered index physically orders the data on disk -- only one is possible per table. All other indexes are non-clustered and store RID pointers. This distinction fundamentally impacts range query performance and update cost.
Putting a low-selectivity column first in a composite index
Index (A, B, C) supports queries on A, (A,B), or (A,B,C) -- but not B alone or (B,C). Place the most selective column (or the one most often in equality conditions) first. Skipping the leading column forces the optimizer to ignore the index entirely.
Expecting a hash index to speed up BETWEEN or range predicates
Hash indexes have no inherent key ordering. A range query requires scanning all buckets -- equivalent to a full table scan. Always use a B+Tree for range queries, ORDER BY, or any operation requiring sorted access.
Leaving the buffer pool at default size for a production workload
An undersized buffer pool causes excessive cache misses (buffer thrashing), resulting in constant disk I/O. PostgreSQL's shared_buffers should typically be 25% of RAM. Monitor buffer hit ratios and tune the replacement policy for your workload type.
Using heap files for tables with frequent unindexed range queries
Match the file organisation to the dominant access pattern: Heap for index-driven or insert-heavy workloads; Sequential for mostly-static tables with range queries on the ordering key; Hash for equality-only access. A mismatch causes O(N) full scans where O(log N) is possible.
Frequently Asked Questions
- What is the main difference between a B-Tree and a B+Tree, and why is B+Tree preferred in databases?
- In a B-Tree, data pointers can exist in both internal and leaf nodes. In a B+Tree, all data pointers reside only in leaf nodes -- internal nodes serve purely as a directory. B+Trees are preferred for three reasons: (1) leaf nodes are linked, enabling efficient range query traversal without returning to the root; (2) internal nodes hold only keys, giving a higher fan-out and shorter tree height; (3) all searches end uniformly at the leaf level, simplifying the implementation.
- When should I choose a hash index over a B+Tree index?
- Choose a hash index when your primary access pattern involves exact match equality queries (e.g., WHERE employee_id = 123). Hash indexes offer O(1) average-case lookup, which is faster than a B+Tree's O(log_f N). However, hash indexes provide no inherent ordering, making range queries (BETWEEN, >, <) require a full table scan. For any query involving order, ranges, or ORDER BY, a B+Tree is the correct choice.
- What is a "covering index" and how does it improve query performance?
- A covering index is a non-clustered index that includes all columns required by a query -- both in the WHERE clause and the SELECT list. It improves performance by allowing the database to satisfy the query entirely by reading only the index, without accessing the actual data pages. This eliminates potentially many random I/Os, which is especially significant for unclustered indexes where each data fetch involves a separate disk read.
- What are the trade-offs of using a clustered index versus a non-clustered index?
- A clustered index physically orders data records on disk in index key order, making range queries and sequential scans extremely fast since related records are co-located. Only one clustered index is allowed per table. Updates to the clustering key are expensive as they may require physically relocating records. A non-clustered index is a separate structure pointing to data via RIDs -- multiple can exist per table, supporting fast lookups on many attributes. However, each record retrieval may involve a separate random I/O to fetch the actual data page, unless the index is a covering index.
- How does the buffer pool interact with disk I/O, and what is the role of a replacement policy?
- The buffer pool is a cache in main memory holding copies of disk pages. When the DBMS needs a page, it first checks the buffer pool (a buffer hit avoids disk I/O). On a buffer miss, the page is fetched from disk and placed in a free frame. A replacement policy (LRU, CLOCK, MRU) determines which page to evict when the pool is full. The goal is to evict the page least likely to be needed again, maximising the buffer hit rate and minimising slow disk I/Os.
- Why is I/O cost the dominant performance metric in disk-based databases?
- Disk I/O is orders of magnitude slower than CPU operations or RAM access. An HDD seek takes 5-20ms while a CPU instruction takes nanoseconds -- a difference of over a million times. Even SSDs are 10-100x slower than RAM for random access. Because a single query can require thousands of page reads, minimising the number of disk I/O operations (especially random ones) is far more impactful than any CPU optimisation. This is why file organisations, indexes, and buffer management all focus on reducing I/O count.
Practice Quiz
Test your understanding of storage, indexing, and file structures -- select the correct answer for each question.
1.Which of the following disk characteristics is generally the most significant contributor to disk access time in traditional HDDs?
2.What is the primary advantage of a B+Tree over a B-Tree in a database context?
3.Which file organization is best suited for scenarios where records are primarily accessed via exact match equality queries and range queries are rare?
4.A query `SELECT name, email FROM Users WHERE status = 'Active';` could benefit most from which type of index if `status` is a low-cardinality column?
5.What is the purpose of the "dirty bit" in buffer pool management?
6.Which RAID level provides striping for performance but no data redundancy?
7.In the context of a B+Tree, what does "fan-out" (branching factor) refer to?
8.Which of the following statements about a clustered index is TRUE?
9.An index on (department_id, employee_id) would be most efficiently used by which query?
10.Which dynamic hashing technique uses a directory that doubles in size when a bucket split occurs at the global depth?
Study Tips
- Draw the B+Tree manually: Practice inserting keys 1 through 20 into a B+Tree with order d=2. Trace each split and observe how the tree height grows. This builds intuition that no textbook explanation can replace.
- Run EXPLAIN ANALYZE on your own queries: Set up a local PostgreSQL instance, create a table with 100,000 rows, and toggle indexes on/off. Observe the shift from "Seq Scan" to "Index Scan" and "Index Only Scan" in the output.
- Memorise the I/O cost table: Heap scan = N pages. B+Tree search = h I/Os. Hash lookup = 1 I/O average. Range via B+Tree = h + K. These formulas appear in exams and interviews regularly.
- Trace the extendible hashing animation step-by-step: Work through a bucket split on paper -- redraw the directory before and after. The directory doubling is the key insight that separates extendible hashing from linear hashing.