Design a Chat System

A chat system enables real-time messaging between users. Examples include WhatsApp, Facebook Messenger, Slack, and Discord.


Table of Contents

  1. Requirements
  2. Back of the Envelope Estimation
  3. System APIs
  4. High-Level Design
  5. Communication Protocols
  6. Database Design
  7. Deep Dive
  8. Message Delivery
  9. Key Takeaways
  10. Interview Tips

Requirements

Functional Requirements

  • 1:1 chat: Send and receive messages between two users
  • Group chat: Support group conversations (up to 500 members)
  • Online presence: Show online/offline status
  • Message history: Persist and retrieve chat history
  • Media sharing: Support images, videos, and files

Non-Functional Requirements

Requirement Description
Low latency Messages delivered in < 100ms
High availability 99.99% uptime
Scalability Support 500 million DAU
Reliability No message loss, ordered delivery

Extended Requirements

  • Push notifications for offline users
  • End-to-end encryption
  • Read receipts and typing indicators
  • Message reactions and replies

Back of the Envelope Estimation

Traffic Estimates

Daily Active Users (DAU): 500 million
Avg messages per user per day: 40
Total messages: 500M × 40 = 20 billion/day
                = ~230,000 messages/second

Peak traffic: 3× average = 700,000 messages/second

Storage Estimates

Avg message size: 100 bytes (text)
Messages per day: 20 billion
Message storage per day: 20B × 100 bytes = 2 TB

Per year: 2 TB × 365 = 730 TB (text only)
Media storage: ~10× text = 7.3 PB/year

Connection Estimates

Concurrent connections: 10% of DAU = 50 million
WebSocket connections per server: 50,000
Servers needed: 50M / 50K = 1,000 servers

Summary

Metric Value
Messages/second 230,000 (avg), 700,000 (peak)
Storage (text) 730 TB/year
Concurrent connections 50 million
WebSocket servers 1,000+

System APIs

Send Message

POST /v1/messages

Request:

{
  "sender_id": "user123",
  "receiver_id": "user456",
  "message_type": "text",
  "content": "Hello!",
  "client_msg_id": "uuid-123"
}

Response:

{
  "message_id": "msg789",
  "timestamp": "2024-01-15T10:30:00Z",
  "status": "sent"
}

Send Group Message

POST /v1/groups/{group_id}/messages

Request:

{
  "sender_id": "user123",
  "message_type": "text",
  "content": "Hello everyone!",
  "client_msg_id": "uuid-456"
}

Get Messages

GET /v1/messages?user_id={user_id}&chat_id={chat_id}&before={timestamp}&limit={limit}

Response:

{
  "messages": [
    {
      "message_id": "msg789",
      "sender_id": "user456",
      "content": "Hi there!",
      "timestamp": "2024-01-15T10:29:00Z",
      "status": "read"
    }
  ],
  "has_more": true
}

WebSocket Events

// Client → Server
{ "type": "message", "data": {...} }
{ "type": "typing", "chat_id": "chat123" }
{ "type": "ack", "message_id": "msg789" }

// Server → Client
{ "type": "message", "data": {...} }
{ "type": "typing", "user_id": "user456", "chat_id": "chat123" }
{ "type": "presence", "user_id": "user456", "status": "online" }

High-Level Design

flowchart TB
    LB["Load Balancer"]

    LB --> WS["WebSocket Server"]
    LB --> API["API Server (REST)"]
    LB --> PS["Presence Service"]

    WS --> MQ["Message Queue (Kafka)"]
    API --> MQ
    PS --> MQ

    MQ --> CS["Chat Service"]
    MQ --> MDB["Message DB (Cassandra)"]
    MQ --> Cache["User/Session Cache (Redis)"]

Components

Component Responsibility
WebSocket Server Maintain persistent connections, real-time messaging
API Server REST APIs for message history, user management
Chat Service Message routing, delivery logic
Presence Service Track user online/offline status
Message Queue Decouple message producers and consumers
Push Notification Notify offline users

Communication Protocols

Interview context: After the high-level design, the interviewer will ask: “How do clients receive messages in real-time? What protocol would you use?”

The Challenge

HTTP is request-response: the client must ask for data. But in a chat system, the server needs to push messages to clients instantly. How do we achieve this?

Protocol Options

Protocol How It Works Pros Cons
HTTP Polling Client polls server every N seconds Simple High latency, wasteful
Long Polling Server holds request until data available Near real-time Connection overhead
WebSocket Persistent bidirectional connection True real-time Stateful, complex
Server-Sent Events Server pushes over HTTP Simple server push Unidirectional only

Why WebSocket?

Interviewer might ask: “Why not just use long polling? It’s simpler.”

Long polling limitations:

  • Each message requires a new HTTP request
  • Headers overhead on every request (~800 bytes)
  • Server must track pending requests
  • Harder to implement server → client push efficiently

WebSocket advantages:

  • Single persistent connection (established once)
  • True bidirectional: server can push anytime
  • Low overhead after handshake (~2-6 bytes per frame)
  • Native browser support

When to use long polling: As a fallback when WebSocket is blocked (corporate firewalls, old proxies).

WebSocket Architecture

flowchart LR
    UserA["User A (Client)"] <-->|WS| WSCluster["WebSocket Server Cluster"]
    WSCluster <-->|WS| UserB["User B (Client)"]
    WSCluster --> SD["Service Discovery<br/>(user→server)"]

Connection Flow

  1. Client connects to Load Balancer
  2. LB routes to WebSocket server
  3. Server authenticates user (JWT/session)
  4. Register connection in Service Discovery
  5. Subscribe to user’s message channels
sequenceDiagram
    participant Client
    participant WS as WS Server

    Client->>WS: WebSocket Upgrade
    WS-->>Client: 101 Switching Protocols
    Client->>WS: Auth: JWT Token
    WS-->>Client: Auth OK, Connected
    Client<-->WS: Messages

Database Design

Interview context: “Now let’s discuss data storage. What database would you use for messages?”

The Challenge

Chat messages have specific access patterns:

  • Write-heavy: 230K messages/second
  • Time-ordered: Messages retrieved in chronological order
  • Per-conversation: Always query by chat_id, then time
  • High volume: 730TB/year of text alone

Why Cassandra for Messages?

Interviewer might ask: “Why not MySQL or PostgreSQL?”

Requirement Cassandra Traditional SQL
Write throughput Excellent (distributed) Limited (single master)
Horizontal scaling Built-in Complex sharding
Time-series queries Optimized Requires careful indexing
Availability AP (highly available) CP (consistency focus)

Key insight: Messages are append-only and queried by (chat_id, time). This is exactly what Cassandra excels at.

Message Table (Cassandra)

CREATE TABLE messages (
    chat_id     UUID,
    message_id  TIMEUUID,
    sender_id   BIGINT,
    content     TEXT,
    msg_type    TEXT,      -- text, image, video, file
    media_url   TEXT,
    created_at  TIMESTAMP,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Chat Table (Cassandra)

CREATE TABLE chats (
    chat_id       UUID PRIMARY KEY,
    chat_type     TEXT,      -- direct, group
    participants  SET<BIGINT>,
    created_at    TIMESTAMP,
    last_message  TEXT,
    last_msg_time TIMESTAMP
);

-- User's chat list
CREATE TABLE user_chats (
    user_id       BIGINT,
    chat_id       UUID,
    unread_count  INT,
    last_read_at  TIMESTAMP,
    PRIMARY KEY (user_id, chat_id)
);

Group Table (PostgreSQL)

CREATE TABLE groups (
    group_id    BIGINT PRIMARY KEY,
    name        VARCHAR(100),
    avatar_url  TEXT,
    owner_id    BIGINT,
    created_at  TIMESTAMP,
    member_count INT DEFAULT 0
);

CREATE TABLE group_members (
    group_id    BIGINT,
    user_id     BIGINT,
    role        VARCHAR(20), -- admin, member
    joined_at   TIMESTAMP,
    PRIMARY KEY (group_id, user_id)
);

CREATE INDEX idx_user_groups ON group_members(user_id);

Session Table (Redis)

# User's WebSocket server location
Key: session:{user_id}
Value: {
    "server_id": "ws-server-42",
    "connected_at": 1705312200,
    "device": "mobile"
}
TTL: Refreshed on heartbeat

# Online status
Key: presence:{user_id}
Value: "online" | "away" | "offline"
TTL: 5 minutes (refreshed by heartbeat)

Deep Dive

Interview context: “Let’s dive deeper into some interesting challenges…”

1. Message Delivery Flow (1:1 Chat)

Interviewer might ask: “Walk me through what happens when User A sends a message to User B.”

sequenceDiagram
    participant UserA as User A
    participant WS1 as WS Server 1
    participant CS as Chat Service
    participant Cass as Cassandra
    participant WS2 as WS Server 2
    participant UserB as User B

    UserA->>WS1: Send Message
    WS1->>CS: Forward
    par Save & Respond
        CS->>Cass: Save to Cassandra
        CS-->>UserA: ACK (sent)
    and Find & Deliver
        CS->>WS2: Find B's WS Server
        WS2->>UserB: Message
        UserB-->>WS2: ACK (delivered)
    end

2. Message Delivery Flow (Group Chat)

Interviewer might ask: “Group chat with 500 members is different from 1:1. How do you handle it?”

The challenge: One message → 500 deliveries. This is a fan-out problem.

flowchart TB
    UserA["User A (sends)"] --> WS["WS Server"]
    WS --> CS["Chat Service"]
    CS --> Kafka["Message Queue (Kafka)<br/>Topic: group-messages-{group_id}"]

    Kafka --> UserB["User B (WS 2)"]
    Kafka --> UserC["User C (WS 3)"]
    Kafka --> UserD["User D (offline)"]

    UserD --> Push["Push Notif"]

Why use Kafka for fan-out?

  • Decouples sender from delivery
  • Handles backpressure (slow consumers don’t block sender)
  • Reliable delivery with retries
  • Partitioning by user_id for ordered delivery

3. Online Presence System

Interviewer might ask: “How do you show who’s online? This seems simple but gets complex at scale.”

The challenge:

  • 50 million concurrent users
  • Each user has hundreds of friends
  • Status changes must propagate quickly
  • But we can’t broadcast every change to everyone

User connects:

flowchart LR
    Client --> WS["WS Server"]
    WS --> PS["Presence Service"]
    PS --> Redis["Redis PUBLISH<br/>presence: user123=on"]
    Redis --> Subscribers["Subscribed clients receive update<br/>User A's friends → 'User A is now online'"]

Heartbeat mechanism:

flowchart LR
    Client -->|"Heartbeat (every 30s)"| Server
    Server --> Redis["Redis EXPIRE 60s"]
    Redis -.->|"If heartbeat stops → TTL expires"| Offline["User marked offline"]

Why heartbeat + TTL?

  • Handles ungraceful disconnects (network drop, app crash)
  • No need to detect disconnection explicitly
  • Redis handles cleanup automatically via TTL

4. Message Sync for Multiple Devices

Interviewer might ask: “Users have phones, tablets, and laptops. How do you sync messages across devices?”

flowchart TB
    subgraph Devices["User has multiple devices"]
        Phone
        Tablet
        Laptop
    end

    Phone --> WS["WS Servers (different)"]
    Tablet --> WS
    Laptop --> WS

    WS --> Sessions["User has 3 sessions in Redis"]
    Sessions --> Delivery["Message delivery to ALL sessions"]

Sync strategy:

  1. Each device maintains last_sync_timestamp
  2. On reconnect: fetch messages since last_sync
  3. Conflict resolution: server timestamp wins

5. Message Status Tracking

Message states:

flowchart LR
    Sent --> Stored --> Delivered --> Read

Storage:

message_id user_id status timestamp
msg123 user456 delivered 10:30:01
msg123 user456 read 10:30:05

Read receipt flow:

flowchart LR
    A["User B reads message"] --> B["Client sends ACK"]
    B --> C["Update status"]
    C --> D["Notify User A via WebSocket"]

Message Delivery

Interview context: “Let’s discuss delivery guarantees. This is where things get interesting.”

The Challenge: Delivery Guarantees

Interviewer might ask: “How do you ensure messages are delivered reliably?”

Guarantee Challenge Solution
At-least-once Network failures Retry with exponential backoff
Ordering Concurrent messages Sequential IDs per chat (TimeUUID)
No duplicates Retries cause duplicates client_msg_id deduplication

Why Not Exactly-Once?

Interviewer might ask: “Can you guarantee exactly-once delivery?”

The hard truth: Exactly-once delivery is impossible in distributed systems without significant complexity. Instead:

  1. At-least-once delivery (guaranteed)
  2. Idempotent processing (client_msg_id dedup)
  3. Result: Effectively exactly-once from user’s perspective

Retry Strategy

Retry with exponential backoff + jitter:
  Attempt 1: immediate
  Attempt 2: 1s + random(0-1s)
  Attempt 3: 2s + random(0-1s)
  Attempt 4: 4s + random(0-1s)
  After max retries: Dead letter queue + push notification

Offline Message Handling

Offline User Flow:

  1. User B is offline
  2. Message arrives for B
  3. Check Redis: B has no active session
  4. Store message in Cassandra (normal flow)
  5. Send push notification via FCM/APNs
  6. When B comes online:
    • Connect to WebSocket
    • Fetch unread messages since last_sync
    • Receive new real-time messages
flowchart TB
    UserA["User A"] -->|Send| WS["WS Server"]
    WS --> Check{"B offline?"}
    Check -->|YES| Push["Push Notif Service"]
    Check -->|NO| Deliver["Deliver to B's WS"]

Key Takeaways

Design Decisions Summary

Decision Choice Why
Protocol WebSocket True real-time, bidirectional, low overhead
Message storage Cassandra High write throughput, time-series optimized
Session storage Redis Fast lookups, pub/sub for cross-server messaging
Message queue Kafka Reliable fan-out, ordering guarantees
Presence Redis + TTL Simple, automatic cleanup on disconnect

Trade-offs to Discuss

Decision Option A Option B
Protocol WebSocket (stateful) Long polling (stateless)
Delivery At-least-once + dedup Exactly-once (complex)
Group fanout Sync (simple) Async via Kafka (scalable)
Presence Push updates Pull on demand

Scalability Phases

flowchart TB
    subgraph P1["Phase 1: Small scale (~10K concurrent)"]
        S1["Single WS server + Redis + PostgreSQL"]
    end
    subgraph P2["Phase 2: Medium scale (~1M concurrent)"]
        S2["WS cluster + Redis pub/sub + Cassandra"]
    end
    subgraph P3["Phase 3: Large scale (~50M concurrent)"]
        S3["Multiple DCs + Kafka + sharded everything"]
    end
    P1 --> P2 --> P3

Interview Tips

How to Approach (45 minutes)

1. CLARIFY (3-5 min)
   "1:1 only or groups? How large? Need presence? Read receipts?"

2. HIGH-LEVEL DESIGN (5-7 min)
   Draw: Clients → LB → WS Servers → Message Queue → Database

3. DEEP DIVE (25-30 min)
   - Protocol choice (WebSocket vs alternatives)
   - Message routing (how to find recipient's server)
   - Group chat fan-out strategy
   - Presence system design
   - Offline message handling

4. WRAP UP (5 min)
   - Discuss E2E encryption, push notifications
   - Mention monitoring, rate limiting

Key Phrases That Show Depth

Instead of… Say…
“We use WebSocket” “HTTP is request-response, so we need WebSocket for server-initiated push. It’s bidirectional and has low per-message overhead after the initial handshake.”
“Store in Cassandra” “Messages are append-only and queried by (chat_id, time) - perfect for Cassandra’s time-series model with chat_id as partition key.”
“Use Redis for presence” “Redis with TTL handles presence elegantly - heartbeat refreshes TTL, and if heartbeat stops, TTL expires and user is automatically offline.”

Common Follow-up Questions

Question Key Points
“How handle 50M connections?” Horizontal scaling, ~50K connections/server, 1000+ servers
“How ensure ordering?” TimeUUID per chat, Kafka partition by chat_id
“How handle offline users?” Store message, send push notification, sync on reconnect
“How scale group chat?” Async fan-out via Kafka, not blocking sender
“How implement E2E encryption?” Signal protocol, keys exchanged out-of-band, server can’t read