System Design: Key-Value Store

A distributed key-value store is a non-relational database that stores data as key-value pairs across multiple nodes. This document covers the design of a scalable, highly available key-value store suitable for system design interviews.


Table of Contents

  1. Requirements
  2. High-Level Architecture
  3. API Design
  4. Data Partitioning
  5. Replication
  6. Consistency Models
  7. Handling Failures
  8. Storage Engine
  9. System Architecture Diagram
  10. Trade-offs and Design Decisions
  11. Real-World Examples
  12. Interview Tips

Requirements

Functional Requirements

Requirement Description
put(key, value) Insert/update a key-value pair
get(key) Retrieve value by key
delete(key) Remove a key-value pair

Non-Functional Requirements

Requirement Target
Scalability Handle petabytes of data, millions of QPS
High Availability 99.99% uptime (< 52 mins downtime/year)
Low Latency p99 < 10ms for reads and writes
Durability No data loss once acknowledged
Tuneable Consistency Support eventual to strong consistency

Capacity Estimation (Example)

Assumptions:
- 100 million DAU
- Each user: 10 reads, 2 writes per day
- Average key size: 100 bytes
- Average value size: 10 KB

Calculations:
- Read QPS: (100M Γ— 10) / 86400 β‰ˆ 11,574 QPS
- Write QPS: (100M Γ— 2) / 86400 β‰ˆ 2,315 QPS
- Peak QPS (3x): ~35K reads, ~7K writes
- Daily storage: 100M Γ— 2 Γ— 10KB = 2 TB/day
- Yearly storage: ~730 TB (before replication)

High-Level Architecture

flowchart TB
    subgraph ClientLayer["Client Layer (SDK / HTTP Client / CLI)"]
    end

    ClientLayer --> LB["Load Balancer<br/>(Round Robin / Least Conn)"]

    LB --> C1["Coordinator 1<br/>(Stateless API)"]
    LB --> C2["Coordinator 2<br/>(Stateless API)"]
    LB --> CN["Coordinator N<br/>(Stateless API)"]

    C1 --> Ring["Consistent Hash Ring<br/>(Partition Key β†’ Node Mapping)"]
    C2 --> Ring
    CN --> Ring

    Ring --> NA["Node A<br/>MemTab β†’ SSTable"]
    Ring --> NB["Node B<br/>MemTab β†’ SSTable"]
    Ring --> NC["Node C<br/>MemTab β†’ SSTable"]
    Ring --> ND["Node D<br/>MemTab β†’ SSTable"]
    Ring --> NN["Node N<br/>MemTab β†’ SSTable"]

    NA --> Repl["Replication Layer<br/>(Sync/Async Replication)"]
    NB --> Repl
    NC --> Repl
    ND --> Repl
    NN --> Repl

Components

Component Responsibility
Client SDK that routes requests and handles retries
Coordinator Stateless node that handles routing, quorum reads/writes
Storage Node Stores data partitions, handles local reads/writes
Consistent Hash Ring Maps keys to nodes (see consistent-hashing.md)
Replication Manager Manages data replication across nodes

API Design

Core Operations

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         API Layer                               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  PUT /v1/kv/{key}                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Request:  { "value": "...", "ttl": 3600 }               β”‚   β”‚
β”‚  β”‚ Response: { "version": 12345, "status": "ok" }          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚  GET /v1/kv/{key}                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Response: { "value": "...", "version": 12345 }          β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚  DELETE /v1/kv/{key}                                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Response: { "status": "deleted" }                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Internal RPC (Node-to-Node)

Operation Purpose
put(key, value, context) Write to local storage with conflict detection
get(key) Read from local storage
replicate(key, value, context) Accept replicated data from peer
handoff(key_range, target_node) Transfer key range to another node

Data Partitioning

Interview context: After high-level design, the interviewer will ask: β€œHow do you distribute data across multiple nodes?” This is where consistent hashing comes in.

Why Do We Need Partitioning?

A single server can’t store petabytes of data or handle millions of QPS. We need to split data across multiple nodes. But how do we decide which node stores which key?

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Partitioning Options                           β”‚
β”‚                                                                 β”‚
β”‚   ❌ Option 1: Simple Modulo Hash                              β”‚
β”‚      node = hash(key) % N                                      β”‚
β”‚      Problem: Adding/removing nodes remaps almost ALL keys!    β”‚
β”‚                                                                 β”‚
β”‚   βœ“ Option 2: Consistent Hashing                               β”‚
β”‚      Only ~1/N keys remapped when nodes change                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

For detailed explanation of consistent hashing, see consistent-hashing.md.

Partitioning Strategy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Hash Ring (2^128 space)                      β”‚
β”‚                                                                     β”‚
β”‚                              0Β°                                     β”‚
β”‚                              β”‚                                      β”‚
β”‚                     A3 ●─────┴─────● B1                            β”‚
β”‚                       /             \                               β”‚
β”‚                      /               \                              β”‚
β”‚               A1 ●──                  ──● B2                       β”‚
β”‚                  β”‚    Virtual Nodes     β”‚                          β”‚
β”‚               C2 ●──                  ──● C1                       β”‚
β”‚                      \               /                              β”‚
β”‚                       \             /                               β”‚
β”‚                     A2 ●───────────● B3                            β”‚
β”‚                             β”‚                                       β”‚
β”‚                            180Β°                                     β”‚
β”‚                                                                     β”‚
β”‚   Physical Nodes:  A (vnodes: A1,A2,A3)                            β”‚
β”‚                    B (vnodes: B1,B2,B3)                            β”‚
β”‚                    C (vnodes: C1,C2,C3)                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key-to-Node Mapping

To find which nodes store a key:

  1. Hash the key: hash("user:123") β†’ position on ring
  2. Walk clockwise: Starting from that position
  3. Collect N distinct physical nodes: Skip virtual nodes pointing to same server
  4. Return node list: First node is coordinator, others are replicas

Why Virtual Nodes?

Benefit Explanation
Load Balancing Keys distributed evenly across nodes
Heterogeneous Hardware More vnodes for powerful servers
Smoother Scaling Adding node redistributes many small ranges
Faster Recovery Multiple nodes share recovery load

Replication

Interview context: After discussing partitioning, the interviewer will likely ask: β€œWhat happens if a node storing data crashes? How do you prevent data loss?”

Why Do We Need Replication?

Partitioning solves scalability - we can store more data by adding nodes. But it introduces a problem:

Without Replication:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                 β”‚
β”‚   Key "user:123" β†’ hash β†’ Node B                               β”‚
β”‚                                                                 β”‚
β”‚   If Node B crashes β†’ "user:123" is LOST FOREVER!              β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

With Replication (N=3):
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                                                 β”‚
β”‚   Key "user:123" β†’ stored on [Node A, Node B, Node C]          β”‚
β”‚                                                                 β”‚
β”‚   If Node B crashes β†’ still have copies on A and C!            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Replication gives us:

  • Durability: Data survives node failures
  • Availability: Can read from replicas if primary is down
  • Read scalability: Distribute read load across replicas

Replication Strategy

flowchart TB
    subgraph ReplicationModel["Replication Model (N=3)"]
        Key["Key 'user:123' hashes to position P"]
        Key --> NodeA["Node A (Primary/Coordinator)"]
        NodeA -->|replicate| NodeB["Node B (Replica 1)"]
        NodeB -->|replicate| NodeC["Node C (Replica 2)"]
    end
    Note["Preference List: [A, B, C, D, E] (in ring order)<br/>Active Replicas: First 3 healthy nodes from list"]

Quorum Configuration (NWR)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Quorum Parameters                           β”‚
β”‚                                                                 β”‚
β”‚   N = Total replicas (typically 3)                             β”‚
β”‚   W = Write quorum (nodes that must acknowledge write)         β”‚
β”‚   R = Read quorum (nodes that must respond to read)            β”‚
β”‚                                                                 β”‚
β”‚   Rule: W + R > N guarantees reading latest write              β”‚
β”‚                                                                 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚   Configuration Examples:                                       β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚ Config      β”‚  N  β”‚  W  β”‚  R  β”‚ Use Case              β”‚    β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚   β”‚ Strong      β”‚  3  β”‚  3  β”‚  1  β”‚ Write-heavy, strong   β”‚    β”‚
β”‚   β”‚ Consistency β”‚     β”‚     β”‚     β”‚ consistency           β”‚    β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚   β”‚ Balanced    β”‚  3  β”‚  2  β”‚  2  β”‚ Default, good balance β”‚    β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚   β”‚ Fast Reads  β”‚  3  β”‚  2  β”‚  1  β”‚ Read-heavy workloads  β”‚    β”‚
β”‚   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€    β”‚
β”‚   β”‚ Eventual    β”‚  3  β”‚  1  β”‚  1  β”‚ Max availability      β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Synchronous vs Asynchronous Replication

Synchronous (W=N):

sequenceDiagram
    participant Client
    participant NodeA as Node A
    participant NodeB as Node B
    participant NodeC as Node C

    Client->>NodeA: write
    NodeA->>NodeB: replicate
    NodeA->>NodeC: replicate
    NodeB-->>NodeA: ack
    NodeC-->>NodeA: ack
    NodeA-->>Client: ack (all replicas)
  • Pros: Strong consistency
  • Cons: Higher latency, lower availability

Asynchronous (W=1):

sequenceDiagram
    participant Client
    participant NodeA as Node A
    participant NodeB as Node B
    participant NodeC as Node C

    Client->>NodeA: write
    NodeA-->>Client: ack (immediate)
    NodeA-)NodeB: async replicate
    NodeA-)NodeC: async replicate
  • Pros: Low latency, high availability
  • Cons: Potential data loss, eventual consistency

Consistency Models

Interview context: After discussing replication, a natural follow-up is: β€œWith multiple replicas, how do you keep them in sync? What if two clients write to different replicas at the same time?”

The Consistency Challenge

With replication, we introduced a new problem: keeping replicas consistent.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  The Consistency Problem                        β”‚
β”‚                                                                 β”‚
β”‚   Time T1: Client 1 writes x=1 to Node A                       β”‚
β”‚   Time T2: Client 2 writes x=2 to Node B (before A replicates) β”‚
β”‚   Time T3: Client 3 reads from Node C                          β”‚
β”‚                                                                 β”‚
β”‚   What value does Client 3 see?                                 β”‚
β”‚   - Could be x=1 (replicated from A)                           β”‚
β”‚   - Could be x=2 (replicated from B)                           β”‚
β”‚   - Could be nothing (neither replicated yet)                  β”‚
β”‚   - Could be BOTH (conflict!)                                  β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This leads to a fundamental trade-off in distributed systems: the CAP Theorem.

CAP Theorem Trade-offs

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       CAP Theorem                               β”‚
β”‚                                                                 β”‚
β”‚                    Consistency (C)                              β”‚
β”‚                         β–²                                       β”‚
β”‚                        /β”‚\                                      β”‚
β”‚                       / β”‚ \                                     β”‚
β”‚                      /  β”‚  \                                    β”‚
β”‚                     /   β”‚   \                                   β”‚
β”‚                    / CP β”‚ CA \                                  β”‚
β”‚                   /     β”‚     \                                 β”‚
β”‚                  /      β”‚      \                                β”‚
β”‚                 ────────┼────────                               β”‚
β”‚           Availability  β”‚   Partition                           β”‚
β”‚               (A)       β”‚   Tolerance (P)                       β”‚
β”‚                    \ AP β”‚                                       β”‚
β”‚                     \   β”‚                                       β”‚
β”‚                      \  β”‚                                       β”‚
β”‚                       \ β”‚                                       β”‚
β”‚                        \β”‚                                       β”‚
β”‚                                                                 β”‚
β”‚   Key-Value Stores typically choose AP (DynamoDB, Cassandra)   β”‚
β”‚   or CP (HBase, traditional RDBMS)                             β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Conflict Resolution

When W + R ≀ N, concurrent writes can create conflicts. Resolution strategies:

1. Last-Write-Wins (LWW)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Last-Write-Wins (LWW)                         β”‚
β”‚                                                                 β”‚
β”‚   Client A writes: key="x", value="A", timestamp=100           β”‚
β”‚   Client B writes: key="x", value="B", timestamp=101           β”‚
β”‚                                                                 β”‚
β”‚   On read: Return value with highest timestamp β†’ "B"           β”‚
β”‚                                                                 β”‚
β”‚   Pros: Simple, deterministic                                   β”‚
β”‚   Cons: Silent data loss                                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. Vector Clocks

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Vector Clocks                              β”‚
β”‚                                                                 β”‚
β”‚   Each write carries a vector: {NodeA: 2, NodeB: 1, NodeC: 3}  β”‚
β”‚                                                                 β”‚
β”‚   Concurrent Write Detection:                                   β”‚
β”‚                                                                 β”‚
β”‚   V1 = {A:1, B:0}   (Client 1 writes via Node A)               β”‚
β”‚   V2 = {A:0, B:1}   (Client 2 writes via Node B)               β”‚
β”‚                                                                 β”‚
β”‚   Neither dominates β†’ CONFLICT DETECTED                         β”‚
β”‚                                                                 β”‚
β”‚   Resolution Options:                                           β”‚
β”‚   1. Return both values to client (application resolves)        β”‚
β”‚   2. Merge values (CRDTs)                                       β”‚
β”‚   3. Use semantic rules (shopping cart = union)                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3. CRDTs (Conflict-free Replicated Data Types)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                          CRDTs                                  β”‚
β”‚                                                                 β”‚
β”‚   G-Counter (Grow-only Counter):                               β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚ Node A: {A: 5, B: 3, C: 2}  β†’ Total: 10               β”‚   β”‚
β”‚   β”‚ Node B: {A: 4, B: 4, C: 2}  β†’ Total: 10               β”‚   β”‚
β”‚   β”‚                                                        β”‚   β”‚
β”‚   β”‚ Merge: {A: max(5,4), B: max(3,4), C: max(2,2)}        β”‚   β”‚
β”‚   β”‚      = {A: 5, B: 4, C: 2}   β†’ Total: 11               β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                 β”‚
β”‚   Types: G-Counter, PN-Counter, G-Set, OR-Set, LWW-Register   β”‚
β”‚   Used by: Riak, Redis CRDT, Cassandra counters               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Handling Failures

Interview Tip: When discussing failures, start by identifying what can go wrong, then systematically address each scenario. This shows structured thinking.

The Problem: What Can Go Wrong?

In a distributed system, failures are inevitable. Let’s think about what can fail:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    What Can Fail?                               β”‚
β”‚                                                                 β”‚
β”‚   1. NETWORK ISSUES                                            β”‚
β”‚      - Packet loss, network partition                          β”‚
β”‚      - Node appears dead but is actually alive                 β”‚
β”‚                                                                 β”‚
β”‚   2. NODE FAILURES                                             β”‚
β”‚      - Process crash (temporary - will restart)                β”‚
β”‚      - Server reboot (temporary - minutes)                     β”‚
β”‚      - Hardware failure (permanent - disk dead)                β”‚
β”‚                                                                 β”‚
β”‚   3. DATA CENTER FAILURES                                      β”‚
β”‚      - Power outage, natural disaster                          β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This raises several questions we need to answer:

Question Why It Matters
How do we know a node failed? Can’t fix what we can’t detect
Is it temporary or permanent? Different strategies needed
How do writes succeed during failure? Maintain availability
How do we recover lost replicas? Maintain durability

Let’s address each question systematically.


7.1 Failure Detection: Gossip Protocol

First question: How do we know a node has failed?

In a distributed system, we can’t rely on a single β€œmaster” to track health - that would be a single point of failure. Instead, we use a decentralized approach where nodes monitor each other.

Why Not Just Use Heartbeats to a Central Server?

❌ Centralized Health Check (Single Point of Failure):

     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚  Monitor   β”‚ ← If this dies, no failure detection!
     β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
    β–Ό      β–Ό      β–Ό      β–Ό
   [A]    [B]    [C]    [D]


βœ“ Decentralized Gossip (No Single Point of Failure):

   [A] ←──→ [B]
    ↑  β•²  β•±  ↑
    β”‚   β•²β•±   β”‚      Everyone monitors everyone
    β”‚   β•±β•²   β”‚      through random peer exchange
    ↓  β•±  β•²  ↓
   [D] ←──→ [C]

How Gossip Works

Every T seconds (typically 1s), each node:

  1. Picks random peer
  2. Exchanges membership list with heartbeat counters
  3. Merges lists (keep highest heartbeat per node)
sequenceDiagram
    participant NodeA as Node A
    participant NodeB as Node B

    Note over NodeA: {A:10, B:5, C:7, D:3}
    Note over NodeB: {A:9, B:6, C:7, D:4}

    NodeA->>NodeB: gossip exchange
    NodeB-->>NodeA: respond

    Note over NodeA,NodeB: Merged: {A:10, B:6, C:7, D:4}

Node marked DOWN if heartbeat unchanged for T_fail seconds.

Failure State Transitions

stateDiagram-v2
    [*] --> HEALTHY
    HEALTHY --> SUSPICIOUS: Missed heartbeats (T > 5s)
    SUSPICIOUS --> DOWN: No recovery (T > 30s)
    DOWN --> HEALTHY: Node recovers
    DOWN --> PERMANENTLY_FAILED: No recovery (T > 10min)
    PERMANENTLY_FAILED --> REMOVED: Admin removes or auto-remove
    REMOVED --> [*]

Timeline:

  • T=0: Node D stops responding
  • T=5s: Marked SUSPICIOUS (missed heartbeats)
  • T=30s: Marked DOWN (hinted handoff activates)
  • T=10min: Marked PERMANENTLY FAILED (re-replication triggers)

7.2 Temporary Failures: Hinted Handoff

Second question: How do writes succeed when a node is temporarily down?

Now that we can detect failures, what happens when a write request comes in but one of the target nodes is down?

The Challenge

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     The Problem                                 β”‚
β”‚                                                                 β”‚
β”‚   Scenario: Client wants to write key "user:123"               β”‚
β”‚   Target nodes: [A, B, C]  (N=3 replicas)                      β”‚
β”‚   But Node B is temporarily DOWN (restarting)                  β”‚
β”‚                                                                 β”‚
β”‚   Options:                                                      β”‚
β”‚   ❌ Option 1: Fail the write                                  β”‚
β”‚      β†’ Bad UX, reduces availability                            β”‚
β”‚                                                                 β”‚
β”‚   ❌ Option 2: Write to only A and C (skip B)                  β”‚
β”‚      β†’ Data lost if B never gets the update                    β”‚
β”‚                                                                 β”‚
β”‚   βœ“ Option 3: Write to A, C, and temporarily store for B      β”‚
β”‚      β†’ This is Hinted Handoff!                                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The Solution: Hinted Handoff

The key idea is simple: when a target node is down, write to another node with a β€œhint” to forward the data when the target recovers.

Normal Write Path (All Nodes Healthy)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Client │────▢│ Node A │────▢│ Node B │────▢│ Node C β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚               β”‚               β”‚
                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           All 3 receive data
                           W=2 satisfied

Write Path with Node B Down

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Hinted Handoff in Action                           β”‚
β”‚                                                                 β”‚
β”‚   Step 1: Coordinator detects Node B is DOWN                   β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚   β”‚ Client │────▢│ Node A │──X──│ Node B β”‚     β”‚ Node C β”‚     β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚ (DOWN) β”‚     β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚                      β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β–²          β”‚
β”‚                      β”‚                              β”‚          β”‚
β”‚                      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚
β”‚                            Write to C instead                  β”‚
β”‚                                                                 β”‚
β”‚   Step 2: Write to Node D with hint for B                      β”‚
β”‚                                                                 β”‚
β”‚                      β”‚                                          β”‚
β”‚                      β–Ό                                          β”‚
β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”                                      β”‚
β”‚                 β”‚ Node D β”‚ ← Stores: {                         β”‚
β”‚                 β”‚        β”‚     key: "user:123",                β”‚
β”‚                 β”‚        β”‚     value: "data...",               β”‚
β”‚                 β”‚        β”‚     hint_for: "Node B",             β”‚
β”‚                 β”‚        β”‚     timestamp: 1706123456           β”‚
β”‚                 β”‚        β”‚   }                                  β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                      β”‚
β”‚                                                                 β”‚
β”‚   Step 3: W=2 satisfied (A + C or A + D), return success       β”‚
β”‚                                                                 β”‚
β”‚   Step 4: When B recovers (detected via gossip)                β”‚
β”‚                                                                 β”‚
β”‚                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚                 β”‚ Node D │────────▢│ Node B β”‚                  β”‚
β”‚                 β”‚        β”‚ handoff β”‚(recovered)β”‚                β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                      β”‚                  β”‚                       β”‚
β”‚                      β”‚   ACK received   β”‚                       β”‚
β”‚                      │◀─────────────────│                       β”‚
β”‚                      β”‚                                          β”‚
β”‚                      β–Ό                                          β”‚
β”‚                 Delete local hint                               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Sloppy Quorum

Hinted handoff enables a sloppy quorum - writes succeed using any W healthy nodes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Quorum Comparison                            β”‚
β”‚                                                                 β”‚
β”‚   STRICT QUORUM (Traditional):                                 β”‚
β”‚   ─────────────────────────────                                β”‚
β”‚   Preference list: [A, B, C]                                   β”‚
β”‚   Must write to 2 of [A, B, C] only                           β”‚
β”‚   If B is down β†’ WRITE FAILS (only A, C available = 2, ok)    β”‚
β”‚   If B and C down β†’ WRITE FAILS (only A available = 1)        β”‚
β”‚                                                                 β”‚
β”‚   SLOPPY QUORUM (With Hinted Handoff):                         β”‚
β”‚   ────────────────────────────────────                         β”‚
β”‚   Preference list: [A, B, C, D, E, ...]                        β”‚
β”‚   Must write to any 2 healthy nodes                            β”‚
β”‚   If B is down β†’ write to A + C (or A + D with hint)          β”‚
β”‚   If B and C down β†’ write to A + D + E (hints for B, C)       β”‚
β”‚                                                                 β”‚
β”‚   Result: Higher write availability                             β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How It Works

  1. Write attempt: Target node A is down
  2. Select hint holder: Next healthy node on ring (D)
  3. Store hint: D stores the data + metadata about intended target (A)
  4. Wait for recovery: Gossip detects A is back
  5. Deliver hint: D sends stored data to A
  6. Delete hint: Once delivered, hint is removed

Trade-offs

Pros Cons
Higher write availability Temporary inconsistency
Client doesn’t see failures Storage overhead on hint nodes
Automatic recovery Hints can pile up for long outages
No data loss for temp failures Read-your-write not guaranteed

7.3 Permanent Failures: Re-replication

Third question: What if a node never comes back?

Hinted handoff works great for temporary failures. But what if the node’s disk crashed and it’s never coming back? Hints will pile up forever, and more importantly, we’ve lost a replica.

The Challenge

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     The Problem                                 β”‚
β”‚                                                                 β”‚
β”‚   Scenario: Node B's disk crashes (permanent failure)          β”‚
β”‚                                                                 β”‚
β”‚   Before failure:                                               β”‚
β”‚   Key "x" stored on: [A, B, C]  β†’ 3 copies (safe)              β”‚
β”‚                                                                 β”‚
β”‚   After failure (with only hinted handoff):                    β”‚
β”‚   Key "x" stored on: [A, C] + hint on D β†’ only 2 real copies!  β”‚
β”‚                                                                 β”‚
β”‚   Risk: If A or C also fails β†’ DATA LOSS!                      β”‚
β”‚                                                                 β”‚
β”‚   We need to CREATE A NEW REPLICA to restore N=3               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

But How Do We Know It’s Permanent?

Good question! We can’t know for certain, so we use a time threshold:

  • If a node is down for > 10 minutes β†’ assume permanent
  • Trigger re-replication to be safe
  • If the node comes back later, we just have extra copies (harmless)

Why Re-replication is Necessary

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Why Hinted Handoff Isn't Enough                  β”‚
β”‚                                                                 β”‚
β”‚   Scenario: Node B permanently fails (disk crash)              β”‚
β”‚                                                                 β”‚
β”‚   With only hinted handoff:                                     β”‚
β”‚   - Hints pile up on other nodes                               β”‚
β”‚   - B never comes back to receive them                         β”‚
β”‚   - Data only exists on 2 nodes (A, C) instead of 3           β”‚
β”‚   - If another node fails, DATA LOSS!                          β”‚
β”‚                                                                 β”‚
β”‚   Before B failed:        After B failed (no re-replication):  β”‚
β”‚   Key "x" on [A,B,C]      Key "x" on [A,C] only               β”‚
β”‚   3 copies = safe         2 copies = VULNERABLE               β”‚
β”‚                                                                 β”‚
β”‚   Solution: Detect permanent failure β†’ create new replica      β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Re-replication Process

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Permanent Failure Recovery Flow                    β”‚
β”‚                                                                 β”‚
β”‚   BEFORE (Node B fails permanently):                           β”‚
β”‚                                                                 β”‚
β”‚   Hash Ring:              Data Distribution (N=3):              β”‚
β”‚        ● A                Key "x" β†’ [A, B, C]                  β”‚
β”‚       / β”‚                 Key "y" β†’ [B, C, D]                  β”‚
β”‚      /  β”‚                 Key "z" β†’ [C, D, A]                  β”‚
β”‚     ●   β”‚                                                       β”‚
β”‚     D   β”‚                 Node B owned:                         β”‚
β”‚      \  β”‚                 - Primary: keys in range (A, B]      β”‚
β”‚       \ β”‚                 - Replica: keys from A and D         β”‚
β”‚        ● B ← PERMANENTLY FAILED                                β”‚
β”‚         \                                                       β”‚
β”‚          ● C                                                    β”‚
β”‚                                                                 β”‚
β”‚   AFTER (Re-replication complete):                             β”‚
β”‚                                                                 β”‚
β”‚   Hash Ring:              Data Distribution (N=3):              β”‚
β”‚        ● A                Key "x" β†’ [A, C, D] ← D replaces B   β”‚
β”‚       /β”‚ \                Key "y" β†’ [C, D, E] ← E replaces B   β”‚
β”‚      / β”‚  ● E (new)       Key "z" β†’ [C, D, A]   (unchanged)    β”‚
β”‚     ●  β”‚                                                        β”‚
β”‚     D  β”‚                  Actions taken:                        β”‚
β”‚      \ β”‚                  1. Remove B from ring                 β”‚
β”‚       \β”‚                  2. Identify under-replicated keys     β”‚
β”‚        ● C                3. Stream data to new replicas        β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step-by-Step Recovery

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Permanent Failure Recovery Steps                  β”‚
β”‚                                                                β”‚
β”‚  1. DETECTION (via Gossip)                                     β”‚
β”‚     └─▢ Node B heartbeat unchanged for >10 minutes            β”‚
β”‚     └─▢ Marked as PERMANENTLY_FAILED                          β”‚
β”‚                                                                β”‚
β”‚  2. RING UPDATE                                                β”‚
β”‚     └─▢ Remove B's virtual nodes from hash ring               β”‚
β”‚     └─▢ B's key ranges now map to next nodes clockwise        β”‚
β”‚     └─▢ Broadcast new ring to all nodes                       β”‚
β”‚                                                                β”‚
β”‚  3. IDENTIFY UNDER-REPLICATED DATA                             β”‚
β”‚     └─▢ For each key range B owned:                           β”‚
β”‚         - Find keys that now have only 2 replicas             β”‚
β”‚         - Add to re-replication queue                         β”‚
β”‚                                                                β”‚
β”‚  4. SELECT NEW REPLICA NODES                                   β”‚
β”‚     └─▢ For each under-replicated key range:                  β”‚
β”‚         - Walk clockwise on ring from key position            β”‚
β”‚         - Pick next node NOT already holding that data        β”‚
β”‚                                                                β”‚
β”‚  5. STREAM DATA (Anti-Entropy)                                 β”‚
β”‚     └─▢ Existing replicas send data to new nodes              β”‚
β”‚     └─▢ Use Merkle trees for efficient diff                   β”‚
β”‚     └─▢ Throttle to prevent network saturation                β”‚
β”‚                                                                β”‚
β”‚  6. VERIFY & COMPLETE                                          β”‚
β”‚     └─▢ New replicas acknowledge receipt                      β”‚
β”‚     └─▢ Replication factor N=3 restored                       β”‚
β”‚     └─▢ Mark recovery complete                                β”‚
β”‚                                                                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

But Wait - How Do We Efficiently Sync Data?

Follow-up question an interviewer might ask: β€œRe-replicating data sounds expensive. Do you copy ALL the data?”

Great question! If Node B had 100GB of data, we don’t want to blindly copy everything. We need to find only the missing keys. But checking every key is O(N) - too slow!

Solution: Merkle Trees - a data structure that lets us find differences in O(log N).

Anti-Entropy with Merkle Trees

Merkle Trees let us efficiently find which keys differ between two nodes:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Merkle Trees for Sync                          β”‚
β”‚                                                                 β”‚
β”‚   Problem: Node E needs B's data. Which keys exactly?           β”‚
β”‚            Scanning all keys is O(N) - too slow!               β”‚
β”‚                                                                 β”‚
β”‚   Solution: Compare hash trees to find differences O(log N)    β”‚
β”‚                                                                 β”‚
β”‚        Node C (has data)              Node E (needs data)       β”‚
β”‚                                                                 β”‚
β”‚            H(root)="abc"                 H(root)="xyz"          β”‚
β”‚           /            \                /            \          β”‚
β”‚       H(L)="def"    H(R)="ghi"     H(L)="def"    H(R)="000"    β”‚
β”‚       /     \       /     \        /     \       /     \       β”‚
β”‚     H1      H2    H3      H4     H1      H2    H3      H4      β”‚
β”‚     βœ“       βœ“     βœ—       βœ“      βœ“       βœ“     βœ—       βœ“       β”‚
β”‚                   β”‚                            β”‚                β”‚
β”‚                   └────── Different! β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                                                                 β”‚
β”‚   Sync Process:                                                 β”‚
β”‚   1. Compare root hashes: "abc" β‰  "xyz" β†’ different            β”‚
β”‚   2. Compare children: H(L) same, H(R) different               β”‚
β”‚   3. Recurse into H(R): H3 different, H4 same                  β”‚
β”‚   4. Only sync keys in H3's range (25% of data)                β”‚
β”‚                                                                 β”‚
β”‚   Complexity: O(log N) comparisons to find differences          β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Merkle Tree Structure

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Merkle Tree Example                          β”‚
β”‚                                                                 β”‚
β”‚                        Root Hash                                β”‚
β”‚                     H("abc"+"def")                              β”‚
β”‚                      β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                                β”‚
β”‚                      β”‚ H(root) β”‚                                β”‚
β”‚                      β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                                β”‚
β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚          β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”                   β”‚
β”‚          β”‚ H("abc")β”‚             β”‚ H("def")β”‚                   β”‚
β”‚          β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                   β”‚
β”‚         β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”                  β”‚
β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”            β”‚
β”‚    β”‚keys 0-25β”‚ β”‚keys26-50β”‚ β”‚keys51-75β”‚ β”‚keys76-99β”‚            β”‚
β”‚    β”‚ H(data) β”‚ β”‚ H(data) β”‚ β”‚ H(data) β”‚ β”‚ H(data) β”‚            β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                                                 β”‚
β”‚   Each leaf = hash of all keys in that range                   β”‚
β”‚   Each parent = hash of children's hashes                      β”‚
β”‚   Any change in data β†’ changes propagate to root               β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Recovery Process

  1. Detect failure: Gossip protocol marks node as permanently failed (down >10 min)
  2. Update ring: Remove failed node from membership
  3. Find affected ranges: Identify key ranges with < N replicas
  4. Select new replica: Next healthy node in ring not already holding data
  5. Sync using Merkle tree: Compare tree hashes to find differences
  6. Transfer only diffs: Stream only the differing keys (efficient!)

Interviewer might ask: β€œWhy use Merkle trees?”

Merkle trees allow O(log N) comparison of datasets. Compare root hashesβ€”if equal, data is identical. If different, recurse into children to find exactly which ranges differ. Much more efficient than comparing key-by-key.


7.4 Read Repair

Fourth question: What about replicas that get out of sync over time?

Even without failures, replicas can become inconsistent:

  • A write succeeded on 2 of 3 nodes (W=2), third node was slow
  • Network issues caused a partial update
  • A node was down briefly and missed some writes

Rather than running expensive background sync constantly, we can fix inconsistencies opportunistically during reads. This is called Read Repair.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      Read Repair                                β”‚
β”‚                                                                 β”‚
β”‚   During read with R=2, N=3:                                    β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”     GET key="x"                                   β”‚
β”‚   β”‚ Client │──────────────────┐                                β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚                                β”‚
β”‚       β–²                       β–Ό                                β”‚
β”‚       β”‚               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚       β”‚               β”‚ Coordinator  β”‚                         β”‚
β”‚       β”‚               β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β”‚       β”‚                      β”‚                                  β”‚
β”‚       β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚       β”‚         β–Ό            β–Ό            β–Ό                    β”‚
β”‚       β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚       β”‚    β”‚ Node A β”‚   β”‚ Node B β”‚   β”‚ Node C β”‚               β”‚
β”‚       β”‚    β”‚ v=1    β”‚   β”‚ v=2    β”‚   β”‚ v=2    β”‚               β”‚
β”‚       β”‚    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚       β”‚         β”‚            β”‚                                  β”‚
β”‚       β”‚         β”‚ return     β”‚ return                          β”‚
β”‚       β”‚         β”‚ (v=1)      β”‚ (v=2)                           β”‚
β”‚       β”‚         β”‚            β”‚                                  β”‚
β”‚       β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                  β”‚
β”‚       β”‚                β”‚                                        β”‚
β”‚       β”‚    Coordinator detects v=1 < v=2                       β”‚
β”‚       β”‚    1. Return v=2 to client                             β”‚
β”‚       β”‚    2. Async send v=2 to Node A (repair)                β”‚
β”‚       β”‚                                                         β”‚
β”‚       └─────── value = 2                                        β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7.5 Comparison: Temporary vs Permanent Failures

Summary: How do all these mechanisms work together?

Let’s summarize the complete failure handling strategy:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Complete Failure Handling Flow                     β”‚
β”‚                                                                 β”‚
β”‚   Node stops responding                                         β”‚
β”‚           β”‚                                                     β”‚
β”‚           β–Ό                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                        β”‚
β”‚   β”‚ Gossip detects    β”‚                                        β”‚
β”‚   β”‚ missed heartbeats β”‚                                        β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                        β”‚
β”‚             β”‚                                                   β”‚
β”‚             β–Ό                                                   β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚   β”‚ Mark as DOWN      │────▢│ Hinted Handoff activates   β”‚    β”‚
β”‚   β”‚ (after 30s)       β”‚     β”‚ Writes go to backup nodes  β”‚    β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚             β”‚                                                   β”‚
β”‚             β”‚ Node still down after 10+ minutes?               β”‚
β”‚             β”‚                                                   β”‚
β”‚     β”Œβ”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”                                          β”‚
β”‚     β”‚               β”‚                                          β”‚
β”‚     β–Ό               β–Ό                                          β”‚
β”‚   Node          Node stays                                     β”‚
β”‚   recovers      down                                           β”‚
β”‚     β”‚               β”‚                                          β”‚
β”‚     β–Ό               β–Ό                                          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   β”‚ Hints   β”‚   β”‚ Mark PERMANENTLY FAILED             β”‚       β”‚
β”‚   β”‚ replayedβ”‚   β”‚ Remove from ring                    β”‚       β”‚
β”‚   β”‚ to node β”‚   β”‚ Trigger re-replication              β”‚       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚ Use Merkle trees to sync efficientlyβ”‚       β”‚
β”‚                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                                 β”‚
β”‚   Meanwhile: Read Repair fixes any stale data opportunisticallyβ”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Aspect Temporary Failure Permanent Failure
Detection time 30 seconds 10+ minutes
Mechanism Hinted Handoff Re-replication
Data location Hints on other nodes New full replicas
Ring change No Yes, node removed
Network cost Low (hints only) High (full data streaming)
Recovery Automatic on return Manual or auto add node
Data risk Low (hints preserved) Medium (fewer replicas)

7.6 Configuration Parameters

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Failure Handling Configuration                   β”‚
β”‚                                                                 β”‚
β”‚   Parameter                    β”‚ Typical Value β”‚ Description    β”‚
β”‚   ────────────────────────────┼───────────────┼────────────────│
β”‚   gossip_interval             β”‚ 1 second      β”‚ Heartbeat freq β”‚
β”‚   suspicion_threshold         β”‚ 5 seconds     β”‚ Mark suspiciousβ”‚
β”‚   down_threshold              β”‚ 30 seconds    β”‚ Mark DOWN      β”‚
β”‚   permanent_failure_threshold β”‚ 10 minutes    β”‚ Trigger re-rep β”‚
β”‚   hint_ttl                    β”‚ 3 hours       β”‚ Max hint age   β”‚
β”‚   max_hints_per_node          β”‚ 10 GB         β”‚ Hint storage   β”‚
β”‚   streaming_throughput        β”‚ 50 MB/s       β”‚ Re-rep speed   β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

7.7 Real-World Implementations

System Temporary Failure Permanent Failure
Cassandra Hinted handoff (3h default) nodetool removenode + streaming
DynamoDB Sloppy quorum + hints Managed by AWS automatically
Riak Hinted handoff Automatic with Merkle trees
etcd Raft leader election Membership reconfiguration

Storage Engine

Interview context: The interviewer might ask: β€œHow does each node actually store data on disk? How do you optimize for both reads and writes?”

The Storage Challenge

Key-value stores need to handle:

  • Fast writes: Users expect low latency
  • Efficient reads: Quick lookups by key
  • Durability: Data must survive crashes

Traditional B-trees are optimized for reads but writes are slow (random I/O). For write-heavy workloads, we use a different approach.

LSM Tree (Log-Structured Merge Tree)

Most distributed KV stores use LSM trees because they convert random writes into sequential writes, which are much faster.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     LSM Tree Architecture                       β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                    Write Path                            β”‚  β”‚
β”‚   β”‚                                                          β”‚  β”‚
β”‚   β”‚   Write ──▢ WAL (Write-Ahead Log) ──▢ MemTable          β”‚  β”‚
β”‚   β”‚              (Durability)             (In-Memory)        β”‚  β”‚
β”‚   β”‚                                           β”‚              β”‚  β”‚
β”‚   β”‚                                    When full             β”‚  β”‚
β”‚   β”‚                                           β”‚              β”‚  β”‚
β”‚   β”‚                                           β–Ό              β”‚  β”‚
β”‚   β”‚                                    Flush to SSTable      β”‚  β”‚
β”‚   β”‚                                      (Level 0)           β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                    Storage Levels                        β”‚  β”‚
β”‚   β”‚                                                          β”‚  β”‚
β”‚   β”‚   MemTable:     [ k1:v1, k3:v3, k5:v5 ]  (sorted)       β”‚  β”‚
β”‚   β”‚                            β”‚                             β”‚  β”‚
β”‚   β”‚                            β–Ό                             β”‚  β”‚
β”‚   β”‚   Level 0:    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚  β”‚
β”‚   β”‚               β”‚SSTable 1β”‚ β”‚SSTable 2β”‚ β”‚SSTable 3β”‚       β”‚  β”‚
β”‚   β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚  β”‚
β”‚   β”‚                            β”‚ compact                     β”‚  β”‚
β”‚   β”‚                            β–Ό                             β”‚  β”‚
β”‚   β”‚   Level 1:    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚  β”‚
β”‚   β”‚               β”‚     SSTable (sorted, no overlap)  β”‚     β”‚  β”‚
β”‚   β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚  β”‚
β”‚   β”‚                            β”‚ compact                     β”‚  β”‚
β”‚   β”‚                            β–Ό                             β”‚  β”‚
β”‚   β”‚   Level 2:    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚  β”‚
β”‚   β”‚               β”‚   SSTable (10x larger, no overlap)    β”‚ β”‚  β”‚
β”‚   β”‚               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚  β”‚
β”‚   β”‚                            β”‚                             β”‚  β”‚
β”‚   β”‚                            β–Ό                             β”‚  β”‚
β”‚   β”‚   Level N:    (exponentially larger)                    β”‚  β”‚
β”‚   β”‚                                                          β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Read Path with Bloom Filters

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        Read Path                                β”‚
β”‚                                                                 β”‚
β”‚   GET(key) ─┬──▢ Check MemTable ──────────┬──▢ Found? Return   β”‚
β”‚             β”‚         β”‚                    β”‚                    β”‚
β”‚             β”‚         β”‚ Not found          β”‚                    β”‚
β”‚             β”‚         β–Ό                    β”‚                    β”‚
β”‚             β”‚    Check Bloom Filter        β”‚                    β”‚
β”‚             β”‚    for each SSTable          β”‚                    β”‚
β”‚             β”‚         β”‚                    β”‚                    β”‚
β”‚             β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”               β”‚                    β”‚
β”‚             β”‚    β”‚         β”‚               β”‚                    β”‚
β”‚             β”‚    β–Ό         β–Ό               β”‚                    β”‚
β”‚             β”‚  "Maybe"   "No"              β”‚                    β”‚
β”‚             β”‚    β”‚         β”‚               β”‚                    β”‚
β”‚             β”‚    β”‚         β”‚ Skip          β”‚                    β”‚
β”‚             β”‚    β–Ό         β”‚               β”‚                    β”‚
β”‚             β”‚  Binary     β”‚               β”‚                    β”‚
β”‚             β”‚  Search     β”‚               β”‚                    β”‚
β”‚             β”‚  SSTable    β”‚               β”‚                    β”‚
β”‚             β”‚    β”‚         β”‚               β”‚                    β”‚
β”‚             β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                                                                 β”‚
β”‚   Bloom Filter: Space-efficient probabilistic data structure    β”‚
β”‚   - "Definitely not in set" or "Maybe in set"                  β”‚
β”‚   - False positive rate ~1% with 10 bits per element           β”‚
β”‚   - Saves disk reads for keys that don't exist                 β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

SSTable Structure

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SSTable File Format                          β”‚
β”‚                                                                 β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚                    Data Blocks                           β”‚ β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚ β”‚
β”‚   β”‚  β”‚Block 0  β”‚β”‚Block 1  β”‚β”‚Block 2  β”‚β”‚Block N  β”‚           β”‚ β”‚
β”‚   β”‚  β”‚k1:v1    β”‚β”‚k50:v50  β”‚β”‚k100:v100β”‚β”‚...      β”‚           β”‚ β”‚
β”‚   β”‚  β”‚k2:v2    β”‚β”‚k51:v51  β”‚β”‚k101:v101β”‚β”‚         β”‚           β”‚ β”‚
β”‚   β”‚  β”‚...      β”‚β”‚...      β”‚β”‚...      β”‚β”‚         β”‚           β”‚ β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚                    Index Block                           β”‚ β”‚
β”‚   β”‚  Block 0: first_key=k1,  offset=0                       β”‚ β”‚
β”‚   β”‚  Block 1: first_key=k50, offset=4096                    β”‚ β”‚
β”‚   β”‚  Block 2: first_key=k100, offset=8192                   β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚                    Bloom Filter                          β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚                    Footer/Metadata                       β”‚ β”‚
β”‚   β”‚  - Index offset                                          β”‚ β”‚
β”‚   β”‚  - Bloom filter offset                                   β”‚ β”‚
β”‚   β”‚  - Compression type                                      β”‚ β”‚
β”‚   β”‚  - Checksum                                              β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

System Architecture Diagram

Complete System View

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           Distributed Key-Value Store                               β”‚
β”‚                                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                              Client SDK                                      β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚   β”‚
β”‚  β”‚  β”‚ Routing     β”‚  β”‚ Retry       β”‚  β”‚ Load        β”‚  β”‚ Connection          β”‚ β”‚   β”‚
β”‚  β”‚  β”‚ (Ring Copy) β”‚  β”‚ Logic       β”‚  β”‚ Balancing   β”‚  β”‚ Pooling             β”‚ β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                      β”‚                                              β”‚
β”‚  ═══════════════════════════════════════════════════════════════════════════════   β”‚
β”‚                                      β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                            Coordinator Layer                                 β”‚   β”‚
β”‚  β”‚                                                                              β”‚   β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚   β”‚
β”‚  β”‚    β”‚  Request Flow:                                                    β”‚     β”‚   β”‚
β”‚  β”‚    β”‚  1. Parse request                                                β”‚     β”‚   β”‚
β”‚  β”‚    β”‚  2. Hash key β†’ find nodes                                        β”‚     β”‚   β”‚
β”‚  β”‚    β”‚  3. Forward to replicas                                          β”‚     β”‚   β”‚
β”‚  β”‚    β”‚  4. Wait for quorum (W or R)                                     β”‚     β”‚   β”‚
β”‚  β”‚    β”‚  5. Return response (or conflict)                                β”‚     β”‚   β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚   β”‚
β”‚  β”‚                                                                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                      β”‚                                              β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                              Storage Nodes                                   β”‚   β”‚
β”‚  β”‚                                                                              β”‚   β”‚
β”‚  β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚   β”‚
β”‚  β”‚   β”‚     Node A      β”‚  β”‚     Node B      β”‚  β”‚     Node C      β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”‚ MemTable  β”‚  β”‚  β”‚  β”‚ MemTable  β”‚  β”‚  β”‚  β”‚ MemTable  β”‚  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”‚  SSTable  β”‚  β”‚  β”‚  β”‚  SSTable  β”‚  β”‚  β”‚  β”‚  SSTable  β”‚  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”‚   L0-L6   β”‚  β”‚  β”‚  β”‚   L0-L6   β”‚  β”‚  β”‚  β”‚   L0-L6   β”‚  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β”‚    WAL    β”‚  β”‚  β”‚  β”‚    WAL    β”‚  β”‚  β”‚  β”‚    WAL    β”‚  β”‚             β”‚   β”‚
β”‚  β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚             β”‚   β”‚
β”‚  β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚   β”‚
β”‚  β”‚           β”‚                     β”‚                     β”‚                      β”‚   β”‚
β”‚  β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚   β”‚
β”‚  β”‚                                 β”‚                                            β”‚   β”‚
β”‚  β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                               β”‚   β”‚
β”‚  β”‚                    β”‚    Replication &        β”‚                               β”‚   β”‚
β”‚  β”‚                    β”‚    Anti-Entropy         β”‚                               β”‚   β”‚
β”‚  β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                               β”‚   β”‚
β”‚  β”‚                                                                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                           Control Plane                                       β”‚   β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚   β”‚
β”‚  β”‚  β”‚   Gossip     β”‚  β”‚   Failure    β”‚  β”‚   Ring       β”‚  β”‚   Monitoring     β”‚  β”‚   β”‚
β”‚  β”‚  β”‚   Protocol   β”‚  β”‚   Detector   β”‚  β”‚   Manager    β”‚  β”‚   & Metrics      β”‚  β”‚   β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Write Flow Sequence

sequenceDiagram
    participant Client
    participant Coord as Coordinator
    participant NodeA as Node A
    participant NodeB as Node B
    participant NodeC as Node C

    Client->>Coord: PUT(k,v)
    Note over Coord: hash(k) β†’ [A,B,C]

    par Write to all replicas
        Coord->>NodeA: write(k,v)
        Coord->>NodeB: write(k,v)
        Coord->>NodeC: write(k,v)
    end

    NodeA-->>Coord: ACK
    NodeB-->>Coord: ACK (W=2 reached)
    Coord-->>Client: SUCCESS
    NodeC-->>Coord: ACK (async, W already met)

Trade-offs and Design Decisions

Key Design Decisions

Decision Options Trade-off
Partitioning Hash vs Range Hash: even distribution. Range: range queries
Replication Sync vs Async Sync: consistency. Async: availability
Consistency Strong vs Eventual Strong: correctness. Eventual: performance
Storage LSM vs B-Tree LSM: writes. B-Tree: reads
Conflict Resolution LWW vs Vector Clock LWW: simple. VC: accurate

When to Choose What

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Decision Matrix                              β”‚
β”‚                                                                 β”‚
β”‚   Use Case                      β”‚ Recommended Configuration     β”‚
β”‚   ─────────────────────────────┼──────────────────────────────│
β”‚   Financial transactions       β”‚ Strong consistency (W=R=N)    β”‚
β”‚   Session storage              β”‚ Eventual (W=1, R=1)           β”‚
β”‚   User profiles                β”‚ Balanced (W=2, R=2, N=3)      β”‚
β”‚   Analytics/metrics            β”‚ Fast writes (W=1, R=N)        β”‚
β”‚   Shopping cart                β”‚ CRDTs for merge               β”‚
β”‚   Leaderboards                 β”‚ LWW for simplicity            β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Consistency vs Latency Trade-off

                    Latency
                      β–²
                      β”‚
                      β”‚     ● W=1, R=1
                      β”‚        (Eventual)
                      β”‚
                      β”‚
                      β”‚        ● W=2, R=2
                      β”‚           (Balanced)
                      β”‚
                      β”‚              ● W=3, R=3
                      β”‚                 (Strong)
                      β”‚
                      └──────────────────────────▢ Consistency

Real-World Examples

System Company Key Features
DynamoDB Amazon Fully managed, auto-scaling, single-digit ms latency
Cassandra Apache Wide-column, tunable consistency, multi-DC
Redis Redis Labs In-memory, data structures, Cluster mode
Riak Basho Masterless, CRDTs, high availability
etcd CNCF Strong consistency (Raft), Kubernetes config
RocksDB Facebook Embedded, LSM-based, used in many systems
TiKV PingCAP Distributed, Raft-based, ACID transactions

Feature Comparison

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Feature Comparison                           β”‚
β”‚                                                                 β”‚
β”‚              β”‚ DynamoDB β”‚ Cassandra β”‚ Redis β”‚ etcd β”‚           β”‚
β”‚   ───────────┼──────────┼───────────┼───────┼───────           β”‚
β”‚   Consistencyβ”‚ Eventual/β”‚ Tunable   β”‚ Event.β”‚Strongβ”‚           β”‚
β”‚              β”‚ Strong   β”‚           β”‚       β”‚      β”‚           β”‚
β”‚   ───────────┼──────────┼───────────┼───────┼───────           β”‚
β”‚   Partitions β”‚ Auto     β”‚ Consistentβ”‚ Hash  β”‚ Raft β”‚           β”‚
β”‚              β”‚          β”‚ Hash      β”‚ Slots β”‚      β”‚           β”‚
β”‚   ───────────┼──────────┼───────────┼───────┼───────           β”‚
β”‚   Replicationβ”‚ Multi-AZ β”‚ Multi-DC  β”‚ Async β”‚ Raft β”‚           β”‚
β”‚   ───────────┼──────────┼───────────┼───────┼───────           β”‚
β”‚   Storage    β”‚ B-Tree   β”‚ LSM       β”‚Memory β”‚ BoltDBβ”‚          β”‚
β”‚   ───────────┼──────────┼───────────┼───────┼───────           β”‚
β”‚   Use Case   β”‚ General  β”‚ Time-     β”‚ Cache β”‚Configβ”‚           β”‚
β”‚              β”‚          β”‚ series    β”‚       β”‚      β”‚           β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Interview Tips

How to Approach the Problem

The key is to show your thought process. Don’t just list solutions - identify problems first, then solve them.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚           Interview Flow (45 minutes)                           β”‚
β”‚                                                                 β”‚
β”‚   1. CLARIFY REQUIREMENTS (3-5 min)                            β”‚
β”‚      "Before I dive in, let me make sure I understand..."      β”‚
β”‚      β–‘ Scale: How many keys? QPS? Data size?                   β”‚
β”‚      β–‘ Consistency: Strong or eventual?                        β”‚
β”‚      β–‘ Latency: p99 requirements?                              β”‚
β”‚      β–‘ Availability: 99.9%? 99.99%?                            β”‚
β”‚      β–‘ Features: TTL? Range queries? Transactions?             β”‚
β”‚                                                                 β”‚
β”‚   2. HIGH-LEVEL DESIGN (5-7 min)                               β”‚
β”‚      "Let me start with the basic architecture..."             β”‚
β”‚      β–‘ API design (get, put, delete)                           β”‚
β”‚      β–‘ Single server first, then identify bottlenecks          β”‚
β”‚      β–‘ Add partitioning for scale                              β”‚
β”‚      β–‘ Add replication for durability                          β”‚
β”‚      β–‘ Draw architecture diagram                                β”‚
β”‚                                                                 β”‚
β”‚   3. DEEP DIVE - Show Problem β†’ Solution Flow (20-25 min)      β”‚
β”‚                                                                 β”‚
β”‚      "Now that we have multiple nodes, how do we distribute    β”‚
β”‚       data? Simple modulo hash won't work because..."          β”‚
β”‚      β†’ Consistent hashing                                       β”‚
β”‚                                                                 β”‚
β”‚      "What if a node crashes? We'd lose data..."               β”‚
β”‚      β†’ Replication                                              β”‚
β”‚                                                                 β”‚
β”‚      "With multiple replicas, how do we keep them in sync?"    β”‚
β”‚      β†’ Quorum (NWR), consistency models                        β”‚
β”‚                                                                 β”‚
β”‚      "What happens when a node temporarily fails?"             β”‚
β”‚      β†’ Hinted handoff                                           β”‚
β”‚                                                                 β”‚
β”‚      "What if it never comes back?"                            β”‚
β”‚      β†’ Re-replication with Merkle trees                        β”‚
β”‚                                                                 β”‚
β”‚   4. TRADE-OFFS (3-5 min)                                      β”‚
β”‚      "There are trade-offs to discuss..."                      β”‚
β”‚      β–‘ CAP theorem: We chose AP, here's why...                 β”‚
β”‚      β–‘ Consistency vs latency                                   β”‚
β”‚      β–‘ Memory vs disk                                           β”‚
β”‚                                                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Phrases That Show Structured Thinking

Instead of… Say…
β€œWe use consistent hashing” β€œSimple modulo hashing has a problem: if we add a node, almost all keys move. To solve this, we use consistent hashing which only moves 1/N keys.”
β€œWe replicate data” β€œPartitioning alone is risky - if a node fails, we lose that data. So we replicate each key to N nodes for durability.”
β€œWe use hinted handoff” β€œWhen a node is temporarily down, we have a choice: fail the write or buffer it somewhere. Hinted handoff lets us buffer on another node and deliver later.”

Common Follow-up Questions

Question Key Points
β€œHow do you handle hotkeys?” Caching, add salt to key, replicate reads
β€œHow do you support range queries?” Use range partitioning instead of hash
β€œHow do you handle large values?” Chunk values, store in blob storage
β€œHow do you support transactions?” 2PC, Paxos/Raft, or don’t support
β€œHow do you handle datacenter failure?” Multi-DC replication, async with conflict resolution

Key Metrics to Know

Typical performance targets:
- Read latency: p50 < 1ms, p99 < 10ms
- Write latency: p50 < 5ms, p99 < 50ms
- Availability: 99.99% (52 min downtime/year)
- Replication factor: 3 (tolerates 1 failure)
- Virtual nodes: 100-256 per physical node

Summary

A distributed key-value store combines several core concepts:

  1. Partitioning: Consistent hashing distributes data across nodes
  2. Replication: Multiple copies ensure durability and availability
  3. Consistency: Quorum (NWR) provides tunable consistency levels
  4. Failure Handling: Gossip, hinted handoff, and Merkle trees
  5. Storage: LSM trees optimize for write-heavy workloads

The key trade-off is between consistency and availability (CAP theorem). Most modern KV stores choose availability and provide tunable consistency through quorum configuration.


References