System Design: Key-Value Store
A distributed key-value store is a non-relational database that stores data as key-value pairs across multiple nodes. This document covers the design of a scalable, highly available key-value store suitable for system design interviews.
Table of Contents
- Requirements
- High-Level Architecture
- API Design
- Data Partitioning
- Replication
- Consistency Models
- Handling Failures
- Storage Engine
- System Architecture Diagram
- Trade-offs and Design Decisions
- Real-World Examples
- Interview Tips
Requirements
Functional Requirements
| Requirement | Description |
|---|---|
put(key, value) |
Insert/update a key-value pair |
get(key) |
Retrieve value by key |
delete(key) |
Remove a key-value pair |
Non-Functional Requirements
| Requirement | Target |
|---|---|
| Scalability | Handle petabytes of data, millions of QPS |
| High Availability | 99.99% uptime (< 52 mins downtime/year) |
| Low Latency | p99 < 10ms for reads and writes |
| Durability | No data loss once acknowledged |
| Tuneable Consistency | Support eventual to strong consistency |
Capacity Estimation (Example)
Assumptions:
- 100 million DAU
- Each user: 10 reads, 2 writes per day
- Average key size: 100 bytes
- Average value size: 10 KB
Calculations:
- Read QPS: (100M Γ 10) / 86400 β 11,574 QPS
- Write QPS: (100M Γ 2) / 86400 β 2,315 QPS
- Peak QPS (3x): ~35K reads, ~7K writes
- Daily storage: 100M Γ 2 Γ 10KB = 2 TB/day
- Yearly storage: ~730 TB (before replication)
High-Level Architecture
flowchart TB
subgraph ClientLayer["Client Layer (SDK / HTTP Client / CLI)"]
end
ClientLayer --> LB["Load Balancer<br/>(Round Robin / Least Conn)"]
LB --> C1["Coordinator 1<br/>(Stateless API)"]
LB --> C2["Coordinator 2<br/>(Stateless API)"]
LB --> CN["Coordinator N<br/>(Stateless API)"]
C1 --> Ring["Consistent Hash Ring<br/>(Partition Key β Node Mapping)"]
C2 --> Ring
CN --> Ring
Ring --> NA["Node A<br/>MemTab β SSTable"]
Ring --> NB["Node B<br/>MemTab β SSTable"]
Ring --> NC["Node C<br/>MemTab β SSTable"]
Ring --> ND["Node D<br/>MemTab β SSTable"]
Ring --> NN["Node N<br/>MemTab β SSTable"]
NA --> Repl["Replication Layer<br/>(Sync/Async Replication)"]
NB --> Repl
NC --> Repl
ND --> Repl
NN --> Repl
Components
| Component | Responsibility |
|---|---|
| Client | SDK that routes requests and handles retries |
| Coordinator | Stateless node that handles routing, quorum reads/writes |
| Storage Node | Stores data partitions, handles local reads/writes |
| Consistent Hash Ring | Maps keys to nodes (see consistent-hashing.md) |
| Replication Manager | Manages data replication across nodes |
API Design
Core Operations
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API Layer β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β PUT /v1/kv/{key} β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Request: { "value": "...", "ttl": 3600 } β β
β β Response: { "version": 12345, "status": "ok" } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β GET /v1/kv/{key} β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Response: { "value": "...", "version": 12345 } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β DELETE /v1/kv/{key} β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Response: { "status": "deleted" } β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Internal RPC (Node-to-Node)
| Operation | Purpose |
|---|---|
put(key, value, context) |
Write to local storage with conflict detection |
get(key) |
Read from local storage |
replicate(key, value, context) |
Accept replicated data from peer |
handoff(key_range, target_node) |
Transfer key range to another node |
Data Partitioning
Interview context: After high-level design, the interviewer will ask: βHow do you distribute data across multiple nodes?β This is where consistent hashing comes in.
Why Do We Need Partitioning?
A single server canβt store petabytes of data or handle millions of QPS. We need to split data across multiple nodes. But how do we decide which node stores which key?
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Partitioning Options β
β β
β β Option 1: Simple Modulo Hash β
β node = hash(key) % N β
β Problem: Adding/removing nodes remaps almost ALL keys! β
β β
β β Option 2: Consistent Hashing β
β Only ~1/N keys remapped when nodes change β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
For detailed explanation of consistent hashing, see consistent-hashing.md.
Partitioning Strategy
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hash Ring (2^128 space) β
β β
β 0Β° β
β β β
β A3 βββββββ΄ββββββ B1 β
β / \ β
β / \ β
β A1 βββ βββ B2 β
β β Virtual Nodes β β
β C2 βββ βββ C1 β
β \ / β
β \ / β
β A2 βββββββββββββ B3 β
β β β
β 180Β° β
β β
β Physical Nodes: A (vnodes: A1,A2,A3) β
β B (vnodes: B1,B2,B3) β
β C (vnodes: C1,C2,C3) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key-to-Node Mapping
To find which nodes store a key:
- Hash the key:
hash("user:123")β position on ring - Walk clockwise: Starting from that position
- Collect N distinct physical nodes: Skip virtual nodes pointing to same server
- Return node list: First node is coordinator, others are replicas
Why Virtual Nodes?
| Benefit | Explanation |
|---|---|
| Load Balancing | Keys distributed evenly across nodes |
| Heterogeneous Hardware | More vnodes for powerful servers |
| Smoother Scaling | Adding node redistributes many small ranges |
| Faster Recovery | Multiple nodes share recovery load |
Replication
Interview context: After discussing partitioning, the interviewer will likely ask: βWhat happens if a node storing data crashes? How do you prevent data loss?β
Why Do We Need Replication?
Partitioning solves scalability - we can store more data by adding nodes. But it introduces a problem:
Without Replication:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Key "user:123" β hash β Node B β
β β
β If Node B crashes β "user:123" is LOST FOREVER! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
With Replication (N=3):
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β
β Key "user:123" β stored on [Node A, Node B, Node C] β
β β
β If Node B crashes β still have copies on A and C! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Replication gives us:
- Durability: Data survives node failures
- Availability: Can read from replicas if primary is down
- Read scalability: Distribute read load across replicas
Replication Strategy
flowchart TB
subgraph ReplicationModel["Replication Model (N=3)"]
Key["Key 'user:123' hashes to position P"]
Key --> NodeA["Node A (Primary/Coordinator)"]
NodeA -->|replicate| NodeB["Node B (Replica 1)"]
NodeB -->|replicate| NodeC["Node C (Replica 2)"]
end
Note["Preference List: [A, B, C, D, E] (in ring order)<br/>Active Replicas: First 3 healthy nodes from list"]
Quorum Configuration (NWR)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Quorum Parameters β
β β
β N = Total replicas (typically 3) β
β W = Write quorum (nodes that must acknowledge write) β
β R = Read quorum (nodes that must respond to read) β
β β
β Rule: W + R > N guarantees reading latest write β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Configuration Examples: β
β β
β βββββββββββββββ¬ββββββ¬ββββββ¬ββββββ¬ββββββββββββββββββββββββ β
β β Config β N β W β R β Use Case β β
β βββββββββββββββΌββββββΌββββββΌββββββΌββββββββββββββββββββββββ€ β
β β Strong β 3 β 3 β 1 β Write-heavy, strong β β
β β Consistency β β β β consistency β β
β βββββββββββββββΌββββββΌββββββΌββββββΌββββββββββββββββββββββββ€ β
β β Balanced β 3 β 2 β 2 β Default, good balance β β
β βββββββββββββββΌββββββΌββββββΌββββββΌββββββββββββββββββββββββ€ β
β β Fast Reads β 3 β 2 β 1 β Read-heavy workloads β β
β βββββββββββββββΌββββββΌββββββΌββββββΌββββββββββββββββββββββββ€ β
β β Eventual β 3 β 1 β 1 β Max availability β β
β βββββββββββββββ΄ββββββ΄ββββββ΄ββββββ΄ββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Synchronous vs Asynchronous Replication
Synchronous (W=N):
sequenceDiagram
participant Client
participant NodeA as Node A
participant NodeB as Node B
participant NodeC as Node C
Client->>NodeA: write
NodeA->>NodeB: replicate
NodeA->>NodeC: replicate
NodeB-->>NodeA: ack
NodeC-->>NodeA: ack
NodeA-->>Client: ack (all replicas)
- Pros: Strong consistency
- Cons: Higher latency, lower availability
Asynchronous (W=1):
sequenceDiagram
participant Client
participant NodeA as Node A
participant NodeB as Node B
participant NodeC as Node C
Client->>NodeA: write
NodeA-->>Client: ack (immediate)
NodeA-)NodeB: async replicate
NodeA-)NodeC: async replicate
- Pros: Low latency, high availability
- Cons: Potential data loss, eventual consistency
Consistency Models
Interview context: After discussing replication, a natural follow-up is: βWith multiple replicas, how do you keep them in sync? What if two clients write to different replicas at the same time?β
The Consistency Challenge
With replication, we introduced a new problem: keeping replicas consistent.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Consistency Problem β
β β
β Time T1: Client 1 writes x=1 to Node A β
β Time T2: Client 2 writes x=2 to Node B (before A replicates) β
β Time T3: Client 3 reads from Node C β
β β
β What value does Client 3 see? β
β - Could be x=1 (replicated from A) β
β - Could be x=2 (replicated from B) β
β - Could be nothing (neither replicated yet) β
β - Could be BOTH (conflict!) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This leads to a fundamental trade-off in distributed systems: the CAP Theorem.
CAP Theorem Trade-offs
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CAP Theorem β
β β
β Consistency (C) β
β β² β
β /β\ β
β / β \ β
β / β \ β
β / β \ β
β / CP β CA \ β
β / β \ β
β / β \ β
β βββββββββΌββββββββ β
β Availability β Partition β
β (A) β Tolerance (P) β
β \ AP β β
β \ β β
β \ β β
β \ β β
β \β β
β β
β Key-Value Stores typically choose AP (DynamoDB, Cassandra) β
β or CP (HBase, traditional RDBMS) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Conflict Resolution
When W + R β€ N, concurrent writes can create conflicts. Resolution strategies:
1. Last-Write-Wins (LWW)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Last-Write-Wins (LWW) β
β β
β Client A writes: key="x", value="A", timestamp=100 β
β Client B writes: key="x", value="B", timestamp=101 β
β β
β On read: Return value with highest timestamp β "B" β
β β
β Pros: Simple, deterministic β
β Cons: Silent data loss β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
2. Vector Clocks
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Vector Clocks β
β β
β Each write carries a vector: {NodeA: 2, NodeB: 1, NodeC: 3} β
β β
β Concurrent Write Detection: β
β β
β V1 = {A:1, B:0} (Client 1 writes via Node A) β
β V2 = {A:0, B:1} (Client 2 writes via Node B) β
β β
β Neither dominates β CONFLICT DETECTED β
β β
β Resolution Options: β
β 1. Return both values to client (application resolves) β
β 2. Merge values (CRDTs) β
β 3. Use semantic rules (shopping cart = union) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3. CRDTs (Conflict-free Replicated Data Types)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CRDTs β
β β
β G-Counter (Grow-only Counter): β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Node A: {A: 5, B: 3, C: 2} β Total: 10 β β
β β Node B: {A: 4, B: 4, C: 2} β Total: 10 β β
β β β β
β β Merge: {A: max(5,4), B: max(3,4), C: max(2,2)} β β
β β = {A: 5, B: 4, C: 2} β Total: 11 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Types: G-Counter, PN-Counter, G-Set, OR-Set, LWW-Register β
β Used by: Riak, Redis CRDT, Cassandra counters β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Handling Failures
Interview Tip: When discussing failures, start by identifying what can go wrong, then systematically address each scenario. This shows structured thinking.
The Problem: What Can Go Wrong?
In a distributed system, failures are inevitable. Letβs think about what can fail:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β What Can Fail? β
β β
β 1. NETWORK ISSUES β
β - Packet loss, network partition β
β - Node appears dead but is actually alive β
β β
β 2. NODE FAILURES β
β - Process crash (temporary - will restart) β
β - Server reboot (temporary - minutes) β
β - Hardware failure (permanent - disk dead) β
β β
β 3. DATA CENTER FAILURES β
β - Power outage, natural disaster β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This raises several questions we need to answer:
| Question | Why It Matters |
|---|---|
| How do we know a node failed? | Canβt fix what we canβt detect |
| Is it temporary or permanent? | Different strategies needed |
| How do writes succeed during failure? | Maintain availability |
| How do we recover lost replicas? | Maintain durability |
Letβs address each question systematically.
7.1 Failure Detection: Gossip Protocol
First question: How do we know a node has failed?
In a distributed system, we canβt rely on a single βmasterβ to track health - that would be a single point of failure. Instead, we use a decentralized approach where nodes monitor each other.
Why Not Just Use Heartbeats to a Central Server?
β Centralized Health Check (Single Point of Failure):
ββββββββββββββ
β Monitor β β If this dies, no failure detection!
βββββββ¬βββββββ
β
ββββββββΌβββββββ¬βββββββ
βΌ βΌ βΌ βΌ
[A] [B] [C] [D]
β Decentralized Gossip (No Single Point of Failure):
[A] ββββ [B]
β β² β± β
β β²β± β Everyone monitors everyone
β β±β² β through random peer exchange
β β± β² β
[D] ββββ [C]
How Gossip Works
Every T seconds (typically 1s), each node:
- Picks random peer
- Exchanges membership list with heartbeat counters
- Merges lists (keep highest heartbeat per node)
sequenceDiagram
participant NodeA as Node A
participant NodeB as Node B
Note over NodeA: {A:10, B:5, C:7, D:3}
Note over NodeB: {A:9, B:6, C:7, D:4}
NodeA->>NodeB: gossip exchange
NodeB-->>NodeA: respond
Note over NodeA,NodeB: Merged: {A:10, B:6, C:7, D:4}
Node marked DOWN if heartbeat unchanged for T_fail seconds.
Failure State Transitions
stateDiagram-v2
[*] --> HEALTHY
HEALTHY --> SUSPICIOUS: Missed heartbeats (T > 5s)
SUSPICIOUS --> DOWN: No recovery (T > 30s)
DOWN --> HEALTHY: Node recovers
DOWN --> PERMANENTLY_FAILED: No recovery (T > 10min)
PERMANENTLY_FAILED --> REMOVED: Admin removes or auto-remove
REMOVED --> [*]
Timeline:
- T=0: Node D stops responding
- T=5s: Marked SUSPICIOUS (missed heartbeats)
- T=30s: Marked DOWN (hinted handoff activates)
- T=10min: Marked PERMANENTLY FAILED (re-replication triggers)
7.2 Temporary Failures: Hinted Handoff
Second question: How do writes succeed when a node is temporarily down?
Now that we can detect failures, what happens when a write request comes in but one of the target nodes is down?
The Challenge
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Problem β
β β
β Scenario: Client wants to write key "user:123" β
β Target nodes: [A, B, C] (N=3 replicas) β
β But Node B is temporarily DOWN (restarting) β
β β
β Options: β
β β Option 1: Fail the write β
β β Bad UX, reduces availability β
β β
β β Option 2: Write to only A and C (skip B) β
β β Data lost if B never gets the update β
β β
β β Option 3: Write to A, C, and temporarily store for B β
β β This is Hinted Handoff! β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The Solution: Hinted Handoff
The key idea is simple: when a target node is down, write to another node with a βhintβ to forward the data when the target recovers.
Normal Write Path (All Nodes Healthy)
ββββββββββ ββββββββββ ββββββββββ ββββββββββ
β Client ββββββΆβ Node A ββββββΆβ Node B ββββββΆβ Node C β
ββββββββββ ββββββββββ ββββββββββ ββββββββββ
β β β
βββββββββββββββββ΄ββββββββββββββββ
All 3 receive data
W=2 satisfied
Write Path with Node B Down
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hinted Handoff in Action β
β β
β Step 1: Coordinator detects Node B is DOWN β
β β
β ββββββββββ ββββββββββ ββββββββββ ββββββββββ β
β β Client ββββββΆβ Node A βββXβββ Node B β β Node C β β
β ββββββββββ ββββββββββ β (DOWN) β ββββββββββ β
β β ββββββββββ β² β
β β β β
β ββββββββββββββββββββββββββββββββ β
β Write to C instead β
β β
β Step 2: Write to Node D with hint for B β
β β
β β β
β βΌ β
β ββββββββββ β
β β Node D β β Stores: { β
β β β key: "user:123", β
β β β value: "data...", β
β β β hint_for: "Node B", β
β β β timestamp: 1706123456 β
β β β } β
β ββββββββββ β
β β
β Step 3: W=2 satisfied (A + C or A + D), return success β
β β
β Step 4: When B recovers (detected via gossip) β
β β
β ββββββββββ ββββββββββ β
β β Node D ββββββββββΆβ Node B β β
β β β handoff β(recovered)β β
β ββββββββββ ββββββββββ β
β β β β
β β ACK received β β
β ββββββββββββββββββββ β
β β β
β βΌ β
β Delete local hint β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Sloppy Quorum
Hinted handoff enables a sloppy quorum - writes succeed using any W healthy nodes:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Quorum Comparison β
β β
β STRICT QUORUM (Traditional): β
β βββββββββββββββββββββββββββββ β
β Preference list: [A, B, C] β
β Must write to 2 of [A, B, C] only β
β If B is down β WRITE FAILS (only A, C available = 2, ok) β
β If B and C down β WRITE FAILS (only A available = 1) β
β β
β SLOPPY QUORUM (With Hinted Handoff): β
β ββββββββββββββββββββββββββββββββββββ β
β Preference list: [A, B, C, D, E, ...] β
β Must write to any 2 healthy nodes β
β If B is down β write to A + C (or A + D with hint) β
β If B and C down β write to A + D + E (hints for B, C) β
β β
β Result: Higher write availability β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
How It Works
- Write attempt: Target node A is down
- Select hint holder: Next healthy node on ring (D)
- Store hint: D stores the data + metadata about intended target (A)
- Wait for recovery: Gossip detects A is back
- Deliver hint: D sends stored data to A
- Delete hint: Once delivered, hint is removed
Trade-offs
| Pros | Cons |
|---|---|
| Higher write availability | Temporary inconsistency |
| Client doesnβt see failures | Storage overhead on hint nodes |
| Automatic recovery | Hints can pile up for long outages |
| No data loss for temp failures | Read-your-write not guaranteed |
7.3 Permanent Failures: Re-replication
Third question: What if a node never comes back?
Hinted handoff works great for temporary failures. But what if the nodeβs disk crashed and itβs never coming back? Hints will pile up forever, and more importantly, weβve lost a replica.
The Challenge
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The Problem β
β β
β Scenario: Node B's disk crashes (permanent failure) β
β β
β Before failure: β
β Key "x" stored on: [A, B, C] β 3 copies (safe) β
β β
β After failure (with only hinted handoff): β
β Key "x" stored on: [A, C] + hint on D β only 2 real copies! β
β β
β Risk: If A or C also fails β DATA LOSS! β
β β
β We need to CREATE A NEW REPLICA to restore N=3 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
But How Do We Know Itβs Permanent?
Good question! We canβt know for certain, so we use a time threshold:
- If a node is down for > 10 minutes β assume permanent
- Trigger re-replication to be safe
- If the node comes back later, we just have extra copies (harmless)
Why Re-replication is Necessary
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Why Hinted Handoff Isn't Enough β
β β
β Scenario: Node B permanently fails (disk crash) β
β β
β With only hinted handoff: β
β - Hints pile up on other nodes β
β - B never comes back to receive them β
β - Data only exists on 2 nodes (A, C) instead of 3 β
β - If another node fails, DATA LOSS! β
β β
β Before B failed: After B failed (no re-replication): β
β Key "x" on [A,B,C] Key "x" on [A,C] only β
β 3 copies = safe 2 copies = VULNERABLE β
β β
β Solution: Detect permanent failure β create new replica β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Re-replication Process
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Permanent Failure Recovery Flow β
β β
β BEFORE (Node B fails permanently): β
β β
β Hash Ring: Data Distribution (N=3): β
β β A Key "x" β [A, B, C] β
β / β Key "y" β [B, C, D] β
β / β Key "z" β [C, D, A] β
β β β β
β D β Node B owned: β
β \ β - Primary: keys in range (A, B] β
β \ β - Replica: keys from A and D β
β β B β PERMANENTLY FAILED β
β \ β
β β C β
β β
β AFTER (Re-replication complete): β
β β
β Hash Ring: Data Distribution (N=3): β
β β A Key "x" β [A, C, D] β D replaces B β
β /β \ Key "y" β [C, D, E] β E replaces B β
β / β β E (new) Key "z" β [C, D, A] (unchanged) β
β β β β
β D β Actions taken: β
β \ β 1. Remove B from ring β
β \β 2. Identify under-replicated keys β
β β C 3. Stream data to new replicas β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step-by-Step Recovery
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Permanent Failure Recovery Steps β
β β
β 1. DETECTION (via Gossip) β
β βββΆ Node B heartbeat unchanged for >10 minutes β
β βββΆ Marked as PERMANENTLY_FAILED β
β β
β 2. RING UPDATE β
β βββΆ Remove B's virtual nodes from hash ring β
β βββΆ B's key ranges now map to next nodes clockwise β
β βββΆ Broadcast new ring to all nodes β
β β
β 3. IDENTIFY UNDER-REPLICATED DATA β
β βββΆ For each key range B owned: β
β - Find keys that now have only 2 replicas β
β - Add to re-replication queue β
β β
β 4. SELECT NEW REPLICA NODES β
β βββΆ For each under-replicated key range: β
β - Walk clockwise on ring from key position β
β - Pick next node NOT already holding that data β
β β
β 5. STREAM DATA (Anti-Entropy) β
β βββΆ Existing replicas send data to new nodes β
β βββΆ Use Merkle trees for efficient diff β
β βββΆ Throttle to prevent network saturation β
β β
β 6. VERIFY & COMPLETE β
β βββΆ New replicas acknowledge receipt β
β βββΆ Replication factor N=3 restored β
β βββΆ Mark recovery complete β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
But Wait - How Do We Efficiently Sync Data?
Follow-up question an interviewer might ask: βRe-replicating data sounds expensive. Do you copy ALL the data?β
Great question! If Node B had 100GB of data, we donβt want to blindly copy everything. We need to find only the missing keys. But checking every key is O(N) - too slow!
Solution: Merkle Trees - a data structure that lets us find differences in O(log N).
Anti-Entropy with Merkle Trees
Merkle Trees let us efficiently find which keys differ between two nodes:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Merkle Trees for Sync β
β β
β Problem: Node E needs B's data. Which keys exactly? β
β Scanning all keys is O(N) - too slow! β
β β
β Solution: Compare hash trees to find differences O(log N) β
β β
β Node C (has data) Node E (needs data) β
β β
β H(root)="abc" H(root)="xyz" β
β / \ / \ β
β H(L)="def" H(R)="ghi" H(L)="def" H(R)="000" β
β / \ / \ / \ / \ β
β H1 H2 H3 H4 H1 H2 H3 H4 β
β β β β β β β β β β
β β β β
β βββββββ Different! βββββββββββ β
β β
β Sync Process: β
β 1. Compare root hashes: "abc" β "xyz" β different β
β 2. Compare children: H(L) same, H(R) different β
β 3. Recurse into H(R): H3 different, H4 same β
β 4. Only sync keys in H3's range (25% of data) β
β β
β Complexity: O(log N) comparisons to find differences β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Merkle Tree Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Merkle Tree Example β
β β
β Root Hash β
β H("abc"+"def") β
β ββββββ΄βββββ β
β β H(root) β β
β ββββββ¬βββββ β
β βββββββββββββ΄ββββββββββββ β
β ββββββ΄βββββ ββββββ΄βββββ β
β β H("abc")β β H("def")β β
β ββββββ¬βββββ ββββββ¬βββββ β
β βββββββ΄ββββββ βββββββ΄ββββββ β
β ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ ββββββ΄βββββ β
β βkeys 0-25β βkeys26-50β βkeys51-75β βkeys76-99β β
β β H(data) β β H(data) β β H(data) β β H(data) β β
β βββββββββββ βββββββββββ βββββββββββ βββββββββββ β
β β
β Each leaf = hash of all keys in that range β
β Each parent = hash of children's hashes β
β Any change in data β changes propagate to root β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Recovery Process
- Detect failure: Gossip protocol marks node as permanently failed (down >10 min)
- Update ring: Remove failed node from membership
- Find affected ranges: Identify key ranges with < N replicas
- Select new replica: Next healthy node in ring not already holding data
- Sync using Merkle tree: Compare tree hashes to find differences
- Transfer only diffs: Stream only the differing keys (efficient!)
Interviewer might ask: βWhy use Merkle trees?β
Merkle trees allow O(log N) comparison of datasets. Compare root hashesβif equal, data is identical. If different, recurse into children to find exactly which ranges differ. Much more efficient than comparing key-by-key.
7.4 Read Repair
Fourth question: What about replicas that get out of sync over time?
Even without failures, replicas can become inconsistent:
- A write succeeded on 2 of 3 nodes (W=2), third node was slow
- Network issues caused a partial update
- A node was down briefly and missed some writes
Rather than running expensive background sync constantly, we can fix inconsistencies opportunistically during reads. This is called Read Repair.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Read Repair β
β β
β During read with R=2, N=3: β
β β
β ββββββββββ GET key="x" β
β β Client ββββββββββββββββββββ β
β ββββββββββ β β
β β² βΌ β
β β ββββββββββββββββ β
β β β Coordinator β β
β β ββββββββ¬ββββββββ β
β β β β
β β ββββββββββββββΌβββββββββββββ β
β β βΌ βΌ βΌ β
β β ββββββββββ ββββββββββ ββββββββββ β
β β β Node A β β Node B β β Node C β β
β β β v=1 β β v=2 β β v=2 β β
β β ββββββ¬ββββ ββββββ¬ββββ ββββββββββ β
β β β β β
β β β return β return β
β β β (v=1) β (v=2) β
β β β β β
β β ββββββββββββββ β
β β β β
β β Coordinator detects v=1 < v=2 β
β β 1. Return v=2 to client β
β β 2. Async send v=2 to Node A (repair) β
β β β
β ββββββββ value = 2 β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.5 Comparison: Temporary vs Permanent Failures
Summary: How do all these mechanisms work together?
Letβs summarize the complete failure handling strategy:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Complete Failure Handling Flow β
β β
β Node stops responding β
β β β
β βΌ β
β βββββββββββββββββββββ β
β β Gossip detects β β
β β missed heartbeats β β
β βββββββββββ¬ββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββ ββββββββββββββββββββββββββββββ β
β β Mark as DOWN ββββββΆβ Hinted Handoff activates β β
β β (after 30s) β β Writes go to backup nodes β β
β βββββββββββ¬ββββββββββ ββββββββββββββββββββββββββββββ β
β β β
β β Node still down after 10+ minutes? β
β β β
β βββββββββ΄ββββββββ β
β β β β
β βΌ βΌ β
β Node Node stays β
β recovers down β
β β β β
β βΌ βΌ β
β βββββββββββ βββββββββββββββββββββββββββββββββββββββ β
β β Hints β β Mark PERMANENTLY FAILED β β
β β replayedβ β Remove from ring β β
β β to node β β Trigger re-replication β β
β βββββββββββ β Use Merkle trees to sync efficientlyβ β
β βββββββββββββββββββββββββββββββββββββββ β
β β
β Meanwhile: Read Repair fixes any stale data opportunisticallyβ
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Aspect | Temporary Failure | Permanent Failure |
|---|---|---|
| Detection time | 30 seconds | 10+ minutes |
| Mechanism | Hinted Handoff | Re-replication |
| Data location | Hints on other nodes | New full replicas |
| Ring change | No | Yes, node removed |
| Network cost | Low (hints only) | High (full data streaming) |
| Recovery | Automatic on return | Manual or auto add node |
| Data risk | Low (hints preserved) | Medium (fewer replicas) |
7.6 Configuration Parameters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Failure Handling Configuration β
β β
β Parameter β Typical Value β Description β
β βββββββββββββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββ
β gossip_interval β 1 second β Heartbeat freq β
β suspicion_threshold β 5 seconds β Mark suspiciousβ
β down_threshold β 30 seconds β Mark DOWN β
β permanent_failure_threshold β 10 minutes β Trigger re-rep β
β hint_ttl β 3 hours β Max hint age β
β max_hints_per_node β 10 GB β Hint storage β
β streaming_throughput β 50 MB/s β Re-rep speed β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
7.7 Real-World Implementations
| System | Temporary Failure | Permanent Failure |
|---|---|---|
| Cassandra | Hinted handoff (3h default) | nodetool removenode + streaming |
| DynamoDB | Sloppy quorum + hints | Managed by AWS automatically |
| Riak | Hinted handoff | Automatic with Merkle trees |
| etcd | Raft leader election | Membership reconfiguration |
Storage Engine
Interview context: The interviewer might ask: βHow does each node actually store data on disk? How do you optimize for both reads and writes?β
The Storage Challenge
Key-value stores need to handle:
- Fast writes: Users expect low latency
- Efficient reads: Quick lookups by key
- Durability: Data must survive crashes
Traditional B-trees are optimized for reads but writes are slow (random I/O). For write-heavy workloads, we use a different approach.
LSM Tree (Log-Structured Merge Tree)
Most distributed KV stores use LSM trees because they convert random writes into sequential writes, which are much faster.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LSM Tree Architecture β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Write Path β β
β β β β
β β Write βββΆ WAL (Write-Ahead Log) βββΆ MemTable β β
β β (Durability) (In-Memory) β β
β β β β β
β β When full β β
β β β β β
β β βΌ β β
β β Flush to SSTable β β
β β (Level 0) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Storage Levels β β
β β β β
β β MemTable: [ k1:v1, k3:v3, k5:v5 ] (sorted) β β
β β β β β
β β βΌ β β
β β Level 0: βββββββββββ βββββββββββ βββββββββββ β β
β β βSSTable 1β βSSTable 2β βSSTable 3β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β compact β β
β β βΌ β β
β β Level 1: βββββββββββββββββββββββββββββββββββββ β β
β β β SSTable (sorted, no overlap) β β β
β β βββββββββββββββββββββββββββββββββββββ β β
β β β compact β β
β β βΌ β β
β β Level 2: βββββββββββββββββββββββββββββββββββββββββ β β
β β β SSTable (10x larger, no overlap) β β β
β β βββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β Level N: (exponentially larger) β β
β β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Read Path with Bloom Filters
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Read Path β
β β
β GET(key) ββ¬βββΆ Check MemTable βββββββββββ¬βββΆ Found? Return β
β β β β β
β β β Not found β β
β β βΌ β β
β β Check Bloom Filter β β
β β for each SSTable β β
β β β β β
β β ββββββ΄βββββ β β
β β β β β β
β β βΌ βΌ β β
β β "Maybe" "No" β β
β β β β β β
β β β β Skip β β
β β βΌ β β β
β β Binary β β β
β β Search β β β
β β SSTable β β β
β β β β β β
β β βββββββββββ΄ββββββββββββββββ β
β β
β Bloom Filter: Space-efficient probabilistic data structure β
β - "Definitely not in set" or "Maybe in set" β
β - False positive rate ~1% with 10 bits per element β
β - Saves disk reads for keys that don't exist β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SSTable Structure
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SSTable File Format β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Data Blocks β β
β β ββββββββββββββββββββββββββββββββββββββββββββ β β
β β βBlock 0 ββBlock 1 ββBlock 2 ββBlock N β β β
β β βk1:v1 ββk50:v50 ββk100:v100ββ... β β β
β β βk2:v2 ββk51:v51 ββk101:v101ββ β β β
β β β... ββ... ββ... ββ β β β
β β ββββββββββββββββββββββββββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Index Block β β
β β Block 0: first_key=k1, offset=0 β β
β β Block 1: first_key=k50, offset=4096 β β
β β Block 2: first_key=k100, offset=8192 β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Bloom Filter β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Footer/Metadata β β
β β - Index offset β β
β β - Bloom filter offset β β
β β - Compression type β β
β β - Checksum β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
System Architecture Diagram
Complete System View
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Distributed Key-Value Store β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Client SDK β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β β
β β β Routing β β Retry β β Load β β Connection β β β
β β β (Ring Copy) β β Logic β β Balancing β β Pooling β β β
β β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ β
β β Coordinator Layer β β
β β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Request Flow: β β β
β β β 1. Parse request β β β
β β β 2. Hash key β find nodes β β β
β β β 3. Forward to replicas β β β
β β β 4. Wait for quorum (W or R) β β β
β β β 5. Return response (or conflict) β β β
β β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β
β βββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββ β
β β Storage Nodes β β
β β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β β β Node A β β Node B β β Node C β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β β β MemTable β β β β MemTable β β β β MemTable β β β β
β β β βββββββ¬ββββββ β β βββββββ¬ββββββ β β βββββββ¬ββββββ β β β
β β β βββββββΌββββββ β β βββββββΌββββββ β β βββββββΌββββββ β β β
β β β β SSTable β β β β SSTable β β β β SSTable β β β β
β β β β L0-L6 β β β β L0-L6 β β β β L0-L6 β β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β β β WAL β β β β WAL β β β β WAL β β β β
β β β βββββββββββββ β β βββββββββββββ β β βββββββββββββ β β β
β β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β β
β β β β β β β
β β βββββββββββββββββββββββΌββββββββββββββββββββββ β β
β β β β β
β β ββββββββββββββΌβββββββββββββ β β
β β β Replication & β β β
β β β Anti-Entropy β β β
β β βββββββββββββββββββββββββββ β β
β β β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Control Plane β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β β
β β β Gossip β β Failure β β Ring β β Monitoring β β β
β β β Protocol β β Detector β β Manager β β & Metrics β β β
β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Write Flow Sequence
sequenceDiagram
participant Client
participant Coord as Coordinator
participant NodeA as Node A
participant NodeB as Node B
participant NodeC as Node C
Client->>Coord: PUT(k,v)
Note over Coord: hash(k) β [A,B,C]
par Write to all replicas
Coord->>NodeA: write(k,v)
Coord->>NodeB: write(k,v)
Coord->>NodeC: write(k,v)
end
NodeA-->>Coord: ACK
NodeB-->>Coord: ACK (W=2 reached)
Coord-->>Client: SUCCESS
NodeC-->>Coord: ACK (async, W already met)
Trade-offs and Design Decisions
Key Design Decisions
| Decision | Options | Trade-off |
|---|---|---|
| Partitioning | Hash vs Range | Hash: even distribution. Range: range queries |
| Replication | Sync vs Async | Sync: consistency. Async: availability |
| Consistency | Strong vs Eventual | Strong: correctness. Eventual: performance |
| Storage | LSM vs B-Tree | LSM: writes. B-Tree: reads |
| Conflict Resolution | LWW vs Vector Clock | LWW: simple. VC: accurate |
When to Choose What
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Decision Matrix β
β β
β Use Case β Recommended Configuration β
β ββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β Financial transactions β Strong consistency (W=R=N) β
β Session storage β Eventual (W=1, R=1) β
β User profiles β Balanced (W=2, R=2, N=3) β
β Analytics/metrics β Fast writes (W=1, R=N) β
β Shopping cart β CRDTs for merge β
β Leaderboards β LWW for simplicity β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Consistency vs Latency Trade-off
Latency
β²
β
β β W=1, R=1
β (Eventual)
β
β
β β W=2, R=2
β (Balanced)
β
β β W=3, R=3
β (Strong)
β
ββββββββββββββββββββββββββββΆ Consistency
Real-World Examples
| System | Company | Key Features |
|---|---|---|
| DynamoDB | Amazon | Fully managed, auto-scaling, single-digit ms latency |
| Cassandra | Apache | Wide-column, tunable consistency, multi-DC |
| Redis | Redis Labs | In-memory, data structures, Cluster mode |
| Riak | Basho | Masterless, CRDTs, high availability |
| etcd | CNCF | Strong consistency (Raft), Kubernetes config |
| RocksDB | Embedded, LSM-based, used in many systems | |
| TiKV | PingCAP | Distributed, Raft-based, ACID transactions |
Feature Comparison
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Feature Comparison β
β β
β β DynamoDB β Cassandra β Redis β etcd β β
β ββββββββββββΌβββββββββββΌββββββββββββΌββββββββΌβββββββ€ β
β Consistencyβ Eventual/β Tunable β Event.βStrongβ β
β β Strong β β β β β
β ββββββββββββΌβββββββββββΌββββββββββββΌββββββββΌβββββββ€ β
β Partitions β Auto β Consistentβ Hash β Raft β β
β β β Hash β Slots β β β
β ββββββββββββΌβββββββββββΌββββββββββββΌββββββββΌβββββββ€ β
β Replicationβ Multi-AZ β Multi-DC β Async β Raft β β
β ββββββββββββΌβββββββββββΌββββββββββββΌββββββββΌβββββββ€ β
β Storage β B-Tree β LSM βMemory β BoltDBβ β
β ββββββββββββΌβββββββββββΌββββββββββββΌββββββββΌβββββββ€ β
β Use Case β General β Time- β Cache βConfigβ β
β β β series β β β β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Interview Tips
How to Approach the Problem
The key is to show your thought process. Donβt just list solutions - identify problems first, then solve them.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Interview Flow (45 minutes) β
β β
β 1. CLARIFY REQUIREMENTS (3-5 min) β
β "Before I dive in, let me make sure I understand..." β
β β‘ Scale: How many keys? QPS? Data size? β
β β‘ Consistency: Strong or eventual? β
β β‘ Latency: p99 requirements? β
β β‘ Availability: 99.9%? 99.99%? β
β β‘ Features: TTL? Range queries? Transactions? β
β β
β 2. HIGH-LEVEL DESIGN (5-7 min) β
β "Let me start with the basic architecture..." β
β β‘ API design (get, put, delete) β
β β‘ Single server first, then identify bottlenecks β
β β‘ Add partitioning for scale β
β β‘ Add replication for durability β
β β‘ Draw architecture diagram β
β β
β 3. DEEP DIVE - Show Problem β Solution Flow (20-25 min) β
β β
β "Now that we have multiple nodes, how do we distribute β
β data? Simple modulo hash won't work because..." β
β β Consistent hashing β
β β
β "What if a node crashes? We'd lose data..." β
β β Replication β
β β
β "With multiple replicas, how do we keep them in sync?" β
β β Quorum (NWR), consistency models β
β β
β "What happens when a node temporarily fails?" β
β β Hinted handoff β
β β
β "What if it never comes back?" β
β β Re-replication with Merkle trees β
β β
β 4. TRADE-OFFS (3-5 min) β
β "There are trade-offs to discuss..." β
β β‘ CAP theorem: We chose AP, here's why... β
β β‘ Consistency vs latency β
β β‘ Memory vs disk β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Key Phrases That Show Structured Thinking
| Instead of⦠| Say⦠|
|---|---|
| βWe use consistent hashingβ | βSimple modulo hashing has a problem: if we add a node, almost all keys move. To solve this, we use consistent hashing which only moves 1/N keys.β |
| βWe replicate dataβ | βPartitioning alone is risky - if a node fails, we lose that data. So we replicate each key to N nodes for durability.β |
| βWe use hinted handoffβ | βWhen a node is temporarily down, we have a choice: fail the write or buffer it somewhere. Hinted handoff lets us buffer on another node and deliver later.β |
Common Follow-up Questions
| Question | Key Points |
|---|---|
| βHow do you handle hotkeys?β | Caching, add salt to key, replicate reads |
| βHow do you support range queries?β | Use range partitioning instead of hash |
| βHow do you handle large values?β | Chunk values, store in blob storage |
| βHow do you support transactions?β | 2PC, Paxos/Raft, or donβt support |
| βHow do you handle datacenter failure?β | Multi-DC replication, async with conflict resolution |
Key Metrics to Know
Typical performance targets:
- Read latency: p50 < 1ms, p99 < 10ms
- Write latency: p50 < 5ms, p99 < 50ms
- Availability: 99.99% (52 min downtime/year)
- Replication factor: 3 (tolerates 1 failure)
- Virtual nodes: 100-256 per physical node
Summary
A distributed key-value store combines several core concepts:
- Partitioning: Consistent hashing distributes data across nodes
- Replication: Multiple copies ensure durability and availability
- Consistency: Quorum (NWR) provides tunable consistency levels
- Failure Handling: Gossip, hinted handoff, and Merkle trees
- Storage: LSM trees optimize for write-heavy workloads
The key trade-off is between consistency and availability (CAP theorem). Most modern KV stores choose availability and provide tunable consistency through quorum configuration.