ResourcesDatabaseDistributed Databases
DatabaseCollege

Distributed Databases

A distributed database (DDB) is a collection of logically interrelated data physically distributed across multiple interconnected sites, managed by a Distributed Database Management System (DDBMS). Users interact with the system as if it were a single, centralized database, achieving data transparency.

This guide covers distributed architecture, fragmentation, replication, query processing, concurrency control, Two-Phase Commit, the CAP theorem, Raft & Paxos consensus, and a 10-question practice quiz.

1Introduction

Distributed databases are crucial in modern computing, underpinning large-scale, high-availability systems such as global e-commerce platforms, financial trading systems, social media networks, and IoT data processing pipelines. Their ability to handle massive data volumes across geographical boundaries is indispensable.

Studying DDBs involves fundamental challenges in distributed systems including consistency, availability, fault tolerance, and consensus. Concepts like the CAP theorem, distributed transactions, and consensus algorithms are central to robust distributed systems design.

In Practice

A global e-commerce platform like Amazon utilizes distributed databases extensively. Customer data, product catalogs, and order histories are sharded and replicated across data centers worldwide, ensuring low latency for users in different regions and high availability even if an entire data center fails.

Centralized vs. Distributed

Centralized Database

  • Resides on a single machine or tightly coupled cluster
  • Simpler management and consistency guarantees
  • Atomic operations without distributed coordination
  • Single point of failure, limited scalability

Distributed Database

  • Data spread across multiple interconnected sites
  • Horizontal scalability and geographic distribution
  • High availability through replication
  • Complex consistency, concurrency, and recovery management

2Key Definitions

Essential terminology for understanding distributed databases at the university level.

Distributed Database (DDB)

A collection of logically interrelated data distributed across multiple interconnected sites managed as a single logical system.

DDBMS

Software that manages a distributed database, making distribution transparent to users. Examples: Apache Cassandra, Google Spanner, CockroachDB.

Site Autonomy

The degree to which each local database can operate independently: design autonomy, communication autonomy, and execution autonomy.

Data Fragmentation

Dividing a relation into smaller logical units (fragments). Enables parallel processing and data locality optimization across sites.

Horizontal Fragmentation

Partitioning rows by predicate: R_H = sigma_P(R). Each fragment contains a subset of tuples. Example: ORDERS_Q1, ORDERS_Q2 split by quarter.

Vertical Fragmentation

Partitioning columns, with the primary key duplicated in each fragment to enable reconstruction via natural join.

Data Replication

Storing multiple copies of the same fragment at different sites. Improves availability and read performance but increases write complexity.

Location Transparency

Users need not know the physical location of data. The DDBMS maps logical data names to physical locations via a global directory.

Distributed Transaction

A transaction accessing data at multiple sites. Must adhere to ACID properties globally. Example: cross-bank money transfer involving two sites.

Two-Phase Commit (2PC)

An atomic commitment protocol ensuring all participants in a distributed transaction either commit or abort together. Involves Coordinator and Participants.

CAP Theorem

A distributed store cannot simultaneously guarantee Consistency, Availability, and Partition Tolerance. At most two of the three can be achieved.

Eventual Consistency

If no new updates are made, eventually all accesses return the last updated value. High availability, low latency. Examples: DNS, DynamoDB.

Linearizability

Strongest single-object consistency model: operations appear to execute instantaneously at some point in real time. Provided by Paxos/Raft-based systems.

Consensus Algorithm

A protocol enabling distributed processes to agree on a single value despite failures. Critical for leader election and state machine replication. Examples: Paxos, Raft.

Sharding

A form of horizontal fragmentation where data is partitioned across multiple database instances (shards) using a shard key to distribute load and storage.

Global Schema

A logical schema representing the entire DDB as a single centralized database, providing a unified view to users and applications.

3Architecture

Distributed database architectures vary based on the degree of distribution, autonomy, and heterogeneity of the participating sites.

Distributed Database Architecture

Client ApplicationsGlobal DDBMS LayerQuery OptimizerTransaction ManagerGlobal Directory / CatalogNetwork (LAN / WAN)Site A (New York)Local DBMSLocal DDBMSFragments: CUSTOMER_EASTORDERS_Q1, PRODUCT_CATALOG(r)Site B (London)Local DBMSLocal DDBMSFragments: CUSTOMER_WESTORDERS_Q2, EMPLOYEE(r)Site C (Tokyo)Local DBMSLocal DDBMSFragments: CUSTOMER_APACORDERS_Q3_Q4, EMPLOYEE(r)(r) = replicated fragment

Client-Server Architecture

Clients send requests to server(s), which coordinate with other DDBMS components across sites.

  • Centralized control, easier management
  • Security concentrated at the server
  • Potential bottleneck / single point of failure
  • Most common in traditional enterprise systems

Peer-to-Peer (P2P) Architecture

Each node acts as both client and server. No central authority; peers communicate directly.

  • High availability, no single point of failure
  • Excellent scalability and fault tolerance
  • Complex consistency and query optimization
  • Examples: Cassandra, blockchain ledgers

Data Allocation Strategies

StrategyDescriptionAdvantageDisadvantage
CentralizedAll data at one siteSimple managementSingle point of failure
Full ReplicationAll data at every siteHighest read availabilityVery high write/storage cost
Partial ReplicationSome fragments at multiple sitesBalances availability & costComplex management
Partitioned (Sharded)Each fragment at exactly one siteHigh scalabilityLower availability per fragment

Naming, Directory Services & Federated Databases

In a distributed environment, each data item must have a unique global name (e.g., SiteA.Schema1.TableX). A global directory stores metadata about fragmentation, allocation, and replication, enabling location transparency. It may be centralized (single point of failure risk) or distributed/replicated. Modern systems use coordination services like Apache ZooKeeper or etcd for this purpose.

A Federated Database System (FDBS) integrates multiple autonomous, potentially heterogeneous databases into a single logical system using a wrapper/mediator pattern. Wrappers translate component queries and results to a common model; mediators provide an integrated global schema and query processing. This is ideal for integrating existing legacy systems while preserving local autonomy.

4Fragmentation

Fragmentation divides a global relation into smaller fragments that can be distributed across sites. The goal is to improve query performance by enabling data locality and parallel processing.

Horizontal vs. Vertical Fragmentation

Horizontal Fragmentation(rows split by predicate)EMPLOYEE (original)EmpIDNameSalaryRegion101Alice80KEast102Bob75KEast103Carlos90KWest104Dana70KWestsplit by RegionEMPLOYEE_EASTEmpID Name Salary Region101 Alice 80K East102 Bob 75K EastWHERE Region='East'EMPLOYEE_WESTEmpID Name Salary Region103 Carlos 90K West104 Dana 70K WestWHERE Region='West'Vertical Fragmentation(columns split, PK duplicated)EMPLOYEE (original)EmpIDNameAddressSalaryDept101AliceNY80KEng102BobLA75KMkt103CarlosSF90KEngsplit by columnsEMP_PERSONALEmpID Name Address101 Alice NY102 Bob LAPK + personal colsEMP_FINANCEEmpID Salary Dept101 80K Eng102 75K MktPK + finance colsReconstruct via JOIN on EmpID

Horizontal Fragmentation

Rows split by a predicate. Reconstruct with UNION.

  • Great for regional data partitioning
  • Enables parallel scans across fragments
  • Queries spanning fragments need UNION

Example: CUSTOMER_EAST, CUSTOMER_WEST

Vertical Fragmentation

Columns split, PK duplicated in each fragment. Reconstruct with JOIN.

  • Reduces I/O for column-subset queries
  • Enables sensitive data separation
  • Cross-fragment queries need join

Example: EMP_PERSONAL, EMP_FINANCE

Hybrid Fragmentation

Combines horizontal and vertical. Most flexible approach.

  • Apply horizontal then vertical (or vice versa)
  • Used in complex multi-site deployments
  • Reconstruction is complex

Used in complex, multi-site deployments

Correctness Rules for Fragmentation

Completeness

Every tuple in R must appear in at least one fragment. No data is lost during fragmentation.

Reconstruction

R can be reconstructed from its fragments: UNION (horizontal) or natural JOIN (vertical).

Disjointness

Horizontal fragments are disjoint (no duplicate rows). Vertical fragments share only the PK column(s).

5Replication

Data replication stores multiple copies of the same fragment or relation at different sites. The trade-off is between read performance and availability (improved by replication) versus write cost and consistency complexity (increased by replication).

Synchronous Replication

Primary waits for all replicas to acknowledge before committing.

  • Strong consistency guarantee
  • No data loss on primary failure
  • Higher write latency
  • Availability reduced if a replica is slow/down

Asynchronous Replication

Primary commits immediately; replicas receive updates in the background.

  • Lower write latency
  • Higher write availability
  • Potential data loss on primary failure
  • Eventual consistency model

Replication Strategies

Primary-Replica

One primary handles all writes; replicas serve reads. Most common pattern. Used by MySQL, PostgreSQL.

Multi-Primary

Multiple nodes accept writes. Higher write availability but requires conflict resolution. Used in active-active setups.

Quorum-Based

A write succeeds if W replicas acknowledge. A read succeeds if R replicas respond. W + R > N ensures strong consistency.

Quorum Rule

For a system with N replicas, using write quorum W and read quorum R where W + R > N guarantees that every read set overlaps with every write set, ensuring at least one replica seen during a read has the latest write. Example: N=3, W=2, R=2 (used in Cassandra with QUORUM consistency level).

6Distributed Query Processing

The dominant cost in distributed query execution is network communication. The goal is to minimize total cost: CPU + I/O + communication. The key formula for a single message: Cost = Latency + MessageSize / Bandwidth.

Query Processing Stages

  1. Query Decomposition: Parsing, semantic analysis against global schema, normalization to relational algebra.
  2. Data Localization: Replace global relation names with fragment names; choose optimal replicas.
  3. Global Optimization: Determine join order, data movement strategy, and site assignment for operations.
  4. Local Optimization: Each site's local DBMS optimizes its assigned sub-query.

Distributed Join Strategies

Simple Join (Ship & Join)

Ship the smaller relation to the site holding the larger. Perform local join. Cost dominated by the smaller relation's transfer size.

Semi-Join

Project join attributes from S, ship projection, filter R, ship filtered R, final join at S's site. Significantly reduces data transferred.

Bloom Join

Like semi-join but ships a Bloom filter instead of actual attribute values. Minimal network transfer; may have false positives requiring extra local work.

Data Shipping vs. Query Shipping

Data Shipping

Move data to where the query processor runs. Simple to implement, but expensive when large data volumes must be transferred over the network.

Query Shipping (Function Shipping)

Move the query to where the data resides. Only results are returned. Reduces network traffic; requires capable local DDBMS components at each site.

Optimization Heuristics
  1. Perform local selections and projections first to reduce data size before any transfer.
  2. Use semi-joins or Bloom joins to reduce relation sizes before shipping for joins.
  3. Ship the smaller relation or fragment for simple join operations.
  4. Push filtering predicates as far down the query plan as possible.
  5. Exploit inter-site parallelism where data is independent across fragments.

7Concurrency Control

Ensuring concurrent transactions maintain database consistency across distributed sites is complex due to network delays and partial failures. The key challenge is coordinating lock management or ordering across geographically separated nodes.

Distributed 2PL

Locks acquired at the site where data resides. Global deadlock is possible. Sub-variants: Centralized Lock Manager, Primary Copy 2PL, Decentralized (Voting) 2PL.

Timestamp Ordering

Each transaction gets a unique timestamp. Operations are ordered by timestamp. Deadlock-free but can have high abort rates under contention.

Optimistic CC (OCC)

Transactions proceed without locks. Conflicts validated at commit time. Non-blocking; ideal for low-contention, read-heavy workloads.

Deadlock Prevention: Wait-Die vs. Wound-Wait

SchemeOlder requests lock held by YoungerYounger requests lock held by Older
Wait-Die (non-preemptive)Older waitsYounger dies (aborts & restarts)
Wound-Wait (preemptive)Older wounds younger (forces abort)Younger waits

Distributed Deadlock Detection

Deadlocks are harder to detect in DDBs because the wait-for graph is distributed across multiple sites. A cycle in this global graph signals a deadlock -- but building that global view requires coordination.

Distributed Deadlock Detection

A wait-for cycle spans multiple sites: T1 → T2 → T3 → T1

T1

Site A (New York)

Holds

Lock on R1

Waiting for

R2 (held by T2)

waits for T2
T2

Site B (London)

Holds

Lock on R2

Waiting for

R3 (held by T3)

waits for T3
T3

Site C (Tokyo)

Holds

Lock on R3

Waiting for

R1 (held by T1)

T3 waits for T1 — cycle completes!

Detection: Edge Chasing

  1. T1@SiteA sends probe message along wait-for edge to T2@SiteB
  2. T2@SiteB forwards probe to T3@SiteC
  3. T3@SiteC forwards probe to T1@SiteA
  4. Probe returns to origin → cycle detected!

Resolution: Abort & Restart

Victim selection — abort the transaction with:

  • Lowest priority or least work done
  • Youngest timestamp (wait-die / wound-wait)
  • Fewest locks held
Abort T3 → releases R3 → T2 proceeds → deadlock broken
T1@Site A T2@Site B T3@Site C Wait-for edge

Centralized Detection

A single site collects local wait-for graphs from all sites periodically, builds a global graph, and detects cycles. Simple but a single point of failure with detection latency.

Edge-Chasing (Distributed)

Probe messages travel along wait-for edges across sites. If a probe returns to its origin, a deadlock cycle is detected. Risk of phantom deadlocks when transactions abort during detection.

8Recovery & Two-Phase Commit

Recovery in distributed systems must handle transaction failures, site (system) failures, media failures, and communication failures (network partitions). The key challenge is ensuring atomicity across all sites despite partial failures.

Two-Phase Commit Protocol

Atomic commit across distributed sites

Coordinator

ABORTING

App Server

Participant P1

PREPARED

Site A

Participant P2

ABORTED

Site B

Press Play to see the 2PC protocol in action

Shows the complete 2PC protocol: PREPARE → VOTE → COMMIT/ABORT. Includes both commit and abort paths.
Step 0 / 8

2PC Protocol Details

Phase 1: Prepare (Voting)

  1. Coordinator sends PREPARE to all participants
  2. Each participant writes log records to stable storage
  3. Participants respond VOTE_COMMIT or VOTE_ABORT
  4. Participants in PREPARED state cannot unilaterally abort

Phase 2: Commit (Decision)

  1. If all VOTE_COMMIT: Coordinator writes GLOBAL_COMMIT to its log
  2. If any VOTE_ABORT or timeout: writes GLOBAL_ABORT
  3. Coordinator broadcasts decision to all participants
  4. Participants commit/abort locally, release locks, send ACK
The Blocking Problem

If the coordinator fails after sending PREPARE but before broadcasting the global decision, participants in the PREPARED state block indefinitely, holding locks and resources. This is the fundamental limitation of 2PC. The system cannot make progress until the coordinator recovers or a new one is elected.

Three-Phase Commit (3PC) Overview

3PC introduces a Pre-Commit phase between Prepare and Commit, intended to eliminate the blocking problem. After all participants vote YES, the coordinator sends PRE-COMMIT. Participants acknowledge and move to a "pre-committed" state. Only then does the coordinator send DO-COMMIT.

Improvement over 2PC

If the coordinator fails during Phase 2, a new coordinator can determine the global decision from the states of remaining participants, avoiding blocking in some failure scenarios.

Limitation

3PC requires more messages (3xN vs. 2xN), adding latency. It still blocks under network partitions and does not provide a general solution to the blocking problem in all failure scenarios.

Network Partitions & Split-Brain

A network partition divides the system into isolated subnetworks. The split-brain problem occurs when each partition independently believes it is the only active part and makes conflicting decisions. Solutions include quorum-based protocols (requiring a majority of nodes to proceed), leader election (electing a single leader in the largest partition), and fencing (preventing isolated nodes from performing writes, e.g., STONITH).

9Consensus: Paxos & Raft

Consensus algorithms allow distributed processes to agree on a single value despite node failures and network issues. They are the foundation of strong consistency and leader election in modern distributed systems.

CAP Theorem

CP Systems

Consistency + Partition Tolerance. Sacrifice availability during partitions. Examples: Google Spanner, HBase, systems using 2PC/Paxos.

AP Systems

Availability + Partition Tolerance. Sacrifice consistency (eventual). Examples: Apache Cassandra, DynamoDB, CouchDB.

CA Systems

Consistency + Availability. Only possible without partitions. Effectively centralized or tightly coupled clusters on reliable networks.

Paxos Algorithm

Paxos guarantees safety (never makes incorrect decisions) and liveness (eventually makes a decision if a majority of nodes are available). Three roles: Proposer (proposes values), Acceptor (votes; a quorum must accept), Learner (learns the chosen value).

Phase 1 (Prepare/Promise)

Proposer sends Prepare(N). Acceptors promise not to accept proposals numbered less than N and report any previously accepted proposal.

Phase 2 (Accept/Accepted)

Proposer sends Accept(N, V) to a majority. If N is not less than any promised number, Acceptors accept. A value is chosen when a majority accepts it.

Raft Algorithm

Raft is designed to be more understandable than Paxos while achieving the same fault tolerance. Three roles: Leader (handles all client requests, replicates log entries), Follower (passive, responds to Leader and Candidate), Candidate (a Follower that timed out and is seeking election).

Raft Consensus: Leader Election & Log Replication

AppendEntriesAppendEntriesAppendEntriesAppendEntriesACKACKACKheartbeatS1LeaderS2FollowerS3FollowerS4FollowerS5FollowerTerm 4Leader Log (S1)[1] T1: SET x=1[2] T2: SET y=2[3] T4: SET z=5newStates / RolesLeaderFollowerCandidateAppendEntries (log replication + heartbeat)ACK (majority = committed)3 of 5 ACKs received → entry committed

Leader Election

  1. Follower election timeout expires (no heartbeat from Leader)
  2. Transitions to Candidate, increments term number
  3. Votes for itself, sends RequestVote RPCs to others
  4. If majority responds favorably, becomes Leader
  5. If another leader is found, reverts to Follower

Log Replication

  1. Leader receives client command
  2. Appends entry to its log
  3. Sends AppendEntries RPCs to all Followers
  4. Entry committed when replicated on a majority
  5. Leader applies to state machine, notifies client

Consistency Model Hierarchy

Linearizability

Strongest: real-time atomic ops

Sequential Consistency

Total order, not real-time

Causal Consistency

Causally related ops ordered

Eventual Consistency

Weakest: will converge

Consistency weakens left to right; availability and scalability improve left to right.

Consensus in Practice

Google Chubby

A distributed lock service using Paxos internally. Provides highly available consistent storage for small files and metadata. Used by GFS and Bigtable for master election and coordination.

Apache ZooKeeper

Open-source coordination service using Zab (a Paxos variant). Provides leader election, distributed locks, and configuration management for distributed applications like Kafka and Hadoop.

10Common Mistakes

Confusing 2PC blocking with non-blocking

Assuming 2PC always completes without blocking

2PC is a blocking protocol. If the coordinator fails after participants enter the PREPARED state but before sending the global decision, participants block indefinitely. 3PC attempts to be non-blocking but still has limitations under network partitions.

Misunderstanding CAP theorem implications

Thinking CAP is a choice between C, A, or P individually

CAP states you can guarantee at most two out of three. Since network partitions are inevitable in real distributed systems, Partition Tolerance is almost always required, making the real choice between Consistency and Availability during a partition.

Poor shard key selection

Choosing a shard key with low cardinality or skewed distribution

A poor shard key creates hotspots (disproportionate load on some shards) and data migration difficulties. The shard key should ensure even data distribution and support common query patterns for single-shard lookups.

Ignoring network partition handling

Designing distributed systems that assume a perfect network

Network partitions are inevitable. Systems must explicitly handle them by becoming unavailable (CP) or accepting temporary inconsistency (AP) with conflict resolution strategies. Ignoring them leads to split-brain scenarios and data corruption.

Confusing synchronous and asynchronous replication

Treating async replication as if it guarantees strong consistency

Synchronous replication ensures strong consistency but adds write latency. Asynchronous replication allows lower write latency but can lead to data loss on primary failure and only provides eventual consistency. Applications must be designed for the model they choose.

Misunderstanding eventual consistency guarantees

Assuming eventual consistency means "consistent after a short delay"

Eventual consistency makes no guarantees about when convergence occurs or what order updates will be seen in the interim. Applications must tolerate reading stale data and potentially resolve conflicts. It means "will converge if no new writes occur" -- not "quickly consistent."

Frequently Asked Questions

What is the primary challenge in designing a distributed database compared to a centralized one?
The primary challenge is maintaining data consistency and atomicity of transactions across multiple, potentially failing, and geographically dispersed nodes, especially in the presence of network partitions. This involves complex algorithms for distributed concurrency control, atomic commitment (like 2PC), and recovery, all while minimizing network communication overhead and maximizing availability.
Explain the "blocking problem" of 2PC. Why is it a significant issue?
The blocking problem occurs when the coordinator fails after sending PREPARE messages but before broadcasting the global COMMIT or ABORT decision. Participants that have voted VOTE_COMMIT and entered the PREPARED state are then blocked indefinitely -- they cannot unilaterally commit or abort and must wait for the coordinator to recover. This leads to resource unavailability (locks held indefinitely) and can halt parts of the distributed system.
How does sharding differ from replication, and why are both often used together?
Sharding (horizontal fragmentation) partitions data into distinct subsets across different servers, primarily to achieve scalability by distributing data and query load. Replication creates multiple copies of the same data across different servers, primarily for high availability and read scalability. They are often combined: a dataset is first sharded, and then each shard is replicated across multiple servers to ensure both scalability and fault tolerance.
When would you choose an AP system over a CP system based on the CAP theorem?
You would choose an AP system (Availability + Partition Tolerance) when high availability and continuous operation are paramount, even if it means serving potentially stale data during a network partition. This suits social media feeds, e-commerce catalogs, and IoT ingestion. A CP system (Consistency + Partition Tolerance) is preferred when data correctness is absolute, such as financial transactions, where it is better to be temporarily unavailable than to return incorrect data.
What is the role of a "global directory" or "catalog" in a distributed database?
A global directory acts as the central metadata repository for the entire distributed database. It stores the global schema (logical view of the whole database), fragmentation information (how relations are divided), allocation information (which fragments reside at which sites), and replication details (which fragments are replicated and where). It enables location transparency by mapping logical data names to physical locations, which is crucial for query processing, concurrency control, and recovery.
Briefly compare Paxos and Raft. Which is generally preferred for new implementations?
Both Paxos and Raft are consensus algorithms that provide strong consistency and fault tolerance. Paxos is older, more flexible, but notoriously complex to understand and implement correctly. Raft was designed with understandability as a primary goal, offering a more structured approach with explicit roles (Leader, Follower, Candidate) and clear phases (Leader Election, Log Replication). Due to its relative simplicity and ease of correct implementation, Raft is generally preferred for new distributed systems requiring consensus.

Practice Quiz

Test your understanding of distributed databases -- select the correct answer for each question.

1.Which of the following properties is NOT part of the CAP theorem?

2.In Two-Phase Commit (2PC), after a participant sends VOTE_COMMIT, what state does it enter?

3.Which of the following is the primary benefit of using a semi-join in distributed query processing?

4.A database system that prioritizes high uptime even if it means occasionally returning stale data during a network partition is best described as:

5.Which concurrency control protocol is inherently non-blocking and works best for read-heavy, low-contention workloads?

6.What is the main purpose of data replication in a distributed database?

7.Which of the following best describes horizontal fragmentation?

8.In the Raft consensus algorithm, which role handles all client requests and replicates log entries to other servers?

9.What core problem does Three-Phase Commit (3PC) attempt to address that Two-Phase Commit (2PC) suffers from?

10.What is the main drawback of using a centralized lock manager in a distributed database?

Study Tips

  • Draw the 2PC state machine: Sketch coordinator and participant states (INIT, PREPARED, COMMITTING, COMMITTED, ABORTING) and all message flows. Trace both commit and abort paths including coordinator failure scenarios.
  • Apply the CAP theorem to real systems: Classify well-known systems (Cassandra, Spanner, MySQL, DynamoDB) as CP or AP and justify why. Understand which property is sacrificed during a network partition.
  • Practice fragmentation decomposition: Take a sample table and manually apply horizontal fragmentation by different predicates, then vertical fragmentation. Verify completeness, disjointness, and reconstruction.
  • Trace Raft leader election: Walk through the timeline when a Leader fails -- how a Follower becomes a Candidate, sends RequestVote, collects a majority, and becomes the new Leader. Understand the role of term numbers.
  • Calculate network costs: Given a distributed join scenario, compute the cost of simple join vs. semi-join using Cost = L + M/B and compare them to understand when semi-joins provide the most benefit.

Related Topics