Design a Chat System

A chat system enables real-time messaging between users. Examples include WhatsApp, Facebook Messenger, Slack, and Discord.

Requirements
Back of the Envelope Estimation
System APIs
High-Level Design
Communication Protocols
Database Design
Deep Dive
Message Delivery
Key Takeaways
Interview Tips

Requirements

Functional Requirements

1:1 chat: Send and receive messages between two users
Group chat: Support group conversations (up to 500 members)
Online presence: Show online/offline status
Message history: Persist and retrieve chat history
Media sharing: Support images, videos, and files

Non-Functional Requirements

Requirement	Description
Low latency	Messages delivered in < 100ms
High availability	99.99% uptime
Scalability	Support 500 million DAU
Reliability	No message loss, ordered delivery

Extended Requirements

Push notifications for offline users
End-to-end encryption
Read receipts and typing indicators
Message reactions and replies

Back of the Envelope Estimation

Traffic Estimates

Daily Active Users (DAU): 500 million
Avg messages per user per day: 40
Total messages: 500M × 40 = 20 billion/day
                = ~230,000 messages/second

Peak traffic: 3× average = 700,000 messages/second

Storage Estimates

Avg message size: 100 bytes (text)
Messages per day: 20 billion
Message storage per day: 20B × 100 bytes = 2 TB

Per year: 2 TB × 365 = 730 TB (text only)
Media storage: ~10× text = 7.3 PB/year

Connection Estimates

Concurrent connections: 10% of DAU = 50 million
WebSocket connections per server: 50,000
Servers needed: 50M / 50K = 1,000 servers

Summary

Metric	Value
Messages/second	230,000 (avg), 700,000 (peak)
Storage (text)	730 TB/year
Concurrent connections	50 million
WebSocket servers	1,000+

System APIs

Send Message

POST /v1/messages

Request:

{
  "sender_id": "user123",
  "receiver_id": "user456",
  "message_type": "text",
  "content": "Hello!",
  "client_msg_id": "uuid-123"
}

Response:

{
  "message_id": "msg789",
  "timestamp": "2024-01-15T10:30:00Z",
  "status": "sent"
}

Send Group Message

POST /v1/groups/{group_id}/messages

Request:

{
  "sender_id": "user123",
  "message_type": "text",
  "content": "Hello everyone!",
  "client_msg_id": "uuid-456"
}

Get Messages

GET /v1/messages?user_id={user_id}&chat_id={chat_id}&before={timestamp}&limit={limit}

Response:

{
  "messages": [
    {
      "message_id": "msg789",
      "sender_id": "user456",
      "content": "Hi there!",
      "timestamp": "2024-01-15T10:29:00Z",
      "status": "read"
    }
  ],
  "has_more": true
}

WebSocket Events

// Client → Server
{ "type": "message", "data": {...} }
{ "type": "typing", "chat_id": "chat123" }
{ "type": "ack", "message_id": "msg789" }

// Server → Client
{ "type": "message", "data": {...} }
{ "type": "typing", "user_id": "user456", "chat_id": "chat123" }
{ "type": "presence", "user_id": "user456", "status": "online" }

High-Level Design

flowchart TB
    LB["Load Balancer"]

    LB --> WS["WebSocket Server"]
    LB --> API["API Server (REST)"]
    LB --> PS["Presence Service"]

    WS --> MQ["Message Queue (Kafka)"]
    API --> MQ
    PS --> MQ

    MQ --> CS["Chat Service"]
    MQ --> MDB["Message DB (Cassandra)"]
    MQ --> Cache["User/Session Cache (Redis)"]

Components

Component	Responsibility
WebSocket Server	Maintain persistent connections, real-time messaging
API Server	REST APIs for message history, user management
Chat Service	Message routing, delivery logic
Presence Service	Track user online/offline status
Message Queue	Decouple message producers and consumers
Push Notification	Notify offline users

Communication Protocols

Interview context: After the high-level design, the interviewer will ask: “How do clients receive messages in real-time? What protocol would you use?”

The Challenge

HTTP is request-response: the client must ask for data. But in a chat system, the server needs to push messages to clients instantly. How do we achieve this?

Protocol Options

Protocol	How It Works	Pros	Cons
HTTP Polling	Client polls server every N seconds	Simple	High latency, wasteful
Long Polling	Server holds request until data available	Near real-time	Connection overhead
WebSocket	Persistent bidirectional connection	True real-time	Stateful, complex
Server-Sent Events	Server pushes over HTTP	Simple server push	Unidirectional only

Why WebSocket?

Interviewer might ask: “Why not just use long polling? It’s simpler.”

Long polling limitations:

Each message requires a new HTTP request
Headers overhead on every request (~800 bytes)
Server must track pending requests
Harder to implement server → client push efficiently

WebSocket advantages:

Single persistent connection (established once)
True bidirectional: server can push anytime
Low overhead after handshake (~2-6 bytes per frame)
Native browser support

When to use long polling: As a fallback when WebSocket is blocked (corporate firewalls, old proxies).

WebSocket Architecture

flowchart LR
    UserA["User A (Client)"] <-->|WS| WSCluster["WebSocket Server Cluster"]
    WSCluster <-->|WS| UserB["User B (Client)"]
    WSCluster --> SD["Service Discovery<br/>(user→server)"]

Connection Flow

Client connects to Load Balancer
LB routes to WebSocket server
Server authenticates user (JWT/session)
Register connection in Service Discovery
Subscribe to user’s message channels

sequenceDiagram
    participant Client
    participant WS as WS Server

    Client->>WS: WebSocket Upgrade
    WS-->>Client: 101 Switching Protocols
    Client->>WS: Auth: JWT Token
    WS-->>Client: Auth OK, Connected
    Client<-->WS: Messages

Database Design

Interview context: “Now let’s discuss data storage. What database would you use for messages?”

The Challenge

Chat messages have specific access patterns:

Write-heavy: 230K messages/second
Time-ordered: Messages retrieved in chronological order
Per-conversation: Always query by chat_id, then time
High volume: 730TB/year of text alone

Why Cassandra for Messages?

Interviewer might ask: “Why not MySQL or PostgreSQL?”

Requirement	Cassandra	Traditional SQL
Write throughput	Excellent (distributed)	Limited (single master)
Horizontal scaling	Built-in	Complex sharding
Time-series queries	Optimized	Requires careful indexing
Availability	AP (highly available)	CP (consistency focus)

Key insight: Messages are append-only and queried by (chat_id, time). This is exactly what Cassandra excels at.

Message Table (Cassandra)

CREATE TABLE messages (
    chat_id     UUID,
    message_id  TIMEUUID,
    sender_id   BIGINT,
    content     TEXT,
    msg_type    TEXT,      -- text, image, video, file
    media_url   TEXT,
    created_at  TIMESTAMP,
    PRIMARY KEY (chat_id, message_id)
) WITH CLUSTERING ORDER BY (message_id DESC);

Chat Table (Cassandra)

CREATE TABLE chats (
    chat_id       UUID PRIMARY KEY,
    chat_type     TEXT,      -- direct, group
    participants  SET<BIGINT>,
    created_at    TIMESTAMP,
    last_message  TEXT,
    last_msg_time TIMESTAMP
);

-- User's chat list
CREATE TABLE user_chats (
    user_id       BIGINT,
    chat_id       UUID,
    unread_count  INT,
    last_read_at  TIMESTAMP,
    PRIMARY KEY (user_id, chat_id)
);

Group Table (PostgreSQL)

CREATE TABLE groups (
    group_id    BIGINT PRIMARY KEY,
    name        VARCHAR(100),
    avatar_url  TEXT,
    owner_id    BIGINT,
    created_at  TIMESTAMP,
    member_count INT DEFAULT 0
);

CREATE TABLE group_members (
    group_id    BIGINT,
    user_id     BIGINT,
    role        VARCHAR(20), -- admin, member
    joined_at   TIMESTAMP,
    PRIMARY KEY (group_id, user_id)
);

CREATE INDEX idx_user_groups ON group_members(user_id);

Session Table (Redis)

# User's WebSocket server location
Key: session:{user_id}
Value: {
    "server_id": "ws-server-42",
    "connected_at": 1705312200,
    "device": "mobile"
}
TTL: Refreshed on heartbeat

# Online status
Key: presence:{user_id}
Value: "online" | "away" | "offline"
TTL: 5 minutes (refreshed by heartbeat)

Deep Dive

Interview context: “Let’s dive deeper into some interesting challenges…”

1. Message Delivery Flow (1:1 Chat)

Interviewer might ask: “Walk me through what happens when User A sends a message to User B.”

sequenceDiagram
    participant UserA as User A
    participant WS1 as WS Server 1
    participant CS as Chat Service
    participant Cass as Cassandra
    participant WS2 as WS Server 2
    participant UserB as User B

    UserA->>WS1: Send Message
    WS1->>CS: Forward
    par Save & Respond
        CS->>Cass: Save to Cassandra
        CS-->>UserA: ACK (sent)
    and Find & Deliver
        CS->>WS2: Find B's WS Server
        WS2->>UserB: Message
        UserB-->>WS2: ACK (delivered)
    end

2. Message Delivery Flow (Group Chat)

Interviewer might ask: “Group chat with 500 members is different from 1:1. How do you handle it?”

The challenge: One message → 500 deliveries. This is a fan-out problem.

flowchart TB
    UserA["User A (sends)"] --> WS["WS Server"]
    WS --> CS["Chat Service"]
    CS --> Kafka["Message Queue (Kafka)<br/>Topic: group-messages-{group_id}"]

    Kafka --> UserB["User B (WS 2)"]
    Kafka --> UserC["User C (WS 3)"]
    Kafka --> UserD["User D (offline)"]

    UserD --> Push["Push Notif"]

Why use Kafka for fan-out?

Decouples sender from delivery
Handles backpressure (slow consumers don’t block sender)
Reliable delivery with retries
Partitioning by user_id for ordered delivery

3. Online Presence System

Interviewer might ask: “How do you show who’s online? This seems simple but gets complex at scale.”

The challenge:

50 million concurrent users
Each user has hundreds of friends
Status changes must propagate quickly
But we can’t broadcast every change to everyone

User connects:

flowchart LR
    Client --> WS["WS Server"]
    WS --> PS["Presence Service"]
    PS --> Redis["Redis PUBLISH<br/>presence: user123=on"]
    Redis --> Subscribers["Subscribed clients receive update<br/>User A's friends → 'User A is now online'"]

Heartbeat mechanism:

flowchart LR
    Client -->|"Heartbeat (every 30s)"| Server
    Server --> Redis["Redis EXPIRE 60s"]
    Redis -.->|"If heartbeat stops → TTL expires"| Offline["User marked offline"]

Why heartbeat + TTL?

Handles ungraceful disconnects (network drop, app crash)
No need to detect disconnection explicitly
Redis handles cleanup automatically via TTL

4. Message Sync for Multiple Devices

Interviewer might ask: “Users have phones, tablets, and laptops. How do you sync messages across devices?”

flowchart TB
    subgraph Devices["User has multiple devices"]
        Phone
        Tablet
        Laptop
    end

    Phone --> WS["WS Servers (different)"]
    Tablet --> WS
    Laptop --> WS

    WS --> Sessions["User has 3 sessions in Redis"]
    Sessions --> Delivery["Message delivery to ALL sessions"]

Sync strategy:

Each device maintains last_sync_timestamp
On reconnect: fetch messages since last_sync
Conflict resolution: server timestamp wins

5. Message Status Tracking

Message states:

flowchart LR
    Sent --> Stored --> Delivered --> Read

Storage:

message_id	user_id	status	timestamp
msg123	user456	delivered	10:30:01
msg123	user456	read	10:30:05

Read receipt flow:

flowchart LR
    A["User B reads message"] --> B["Client sends ACK"]
    B --> C["Update status"]
    C --> D["Notify User A via WebSocket"]

Message Delivery

Interview context: “Let’s discuss delivery guarantees. This is where things get interesting.”

The Challenge: Delivery Guarantees

Interviewer might ask: “How do you ensure messages are delivered reliably?”

Guarantee	Challenge	Solution
At-least-once	Network failures	Retry with exponential backoff
Ordering	Concurrent messages	Sequential IDs per chat (TimeUUID)
No duplicates	Retries cause duplicates	client_msg_id deduplication

Why Not Exactly-Once?

Interviewer might ask: “Can you guarantee exactly-once delivery?”

The hard truth: Exactly-once delivery is impossible in distributed systems without significant complexity. Instead:

At-least-once delivery (guaranteed)
Idempotent processing (client_msg_id dedup)
Result: Effectively exactly-once from user’s perspective

Retry Strategy

Retry with exponential backoff + jitter:
  Attempt 1: immediate
  Attempt 2: 1s + random(0-1s)
  Attempt 3: 2s + random(0-1s)
  Attempt 4: 4s + random(0-1s)
  After max retries: Dead letter queue + push notification

Offline Message Handling

Offline User Flow:

User B is offline
Message arrives for B
Check Redis: B has no active session
Store message in Cassandra (normal flow)
Send push notification via FCM/APNs
When B comes online:
- Connect to WebSocket
- Fetch unread messages since last_sync
- Receive new real-time messages

flowchart TB
    UserA["User A"] -->|Send| WS["WS Server"]
    WS --> Check{"B offline?"}
    Check -->|YES| Push["Push Notif Service"]
    Check -->|NO| Deliver["Deliver to B's WS"]

Key Takeaways

Design Decisions Summary

Decision	Choice	Why
Protocol	WebSocket	True real-time, bidirectional, low overhead
Message storage	Cassandra	High write throughput, time-series optimized
Session storage	Redis	Fast lookups, pub/sub for cross-server messaging
Message queue	Kafka	Reliable fan-out, ordering guarantees
Presence	Redis + TTL	Simple, automatic cleanup on disconnect

Trade-offs to Discuss

Decision	Option A	Option B
Protocol	WebSocket (stateful)	Long polling (stateless)
Delivery	At-least-once + dedup	Exactly-once (complex)
Group fanout	Sync (simple)	Async via Kafka (scalable)
Presence	Push updates	Pull on demand

Scalability Phases

flowchart TB
    subgraph P1["Phase 1: Small scale (~10K concurrent)"]
        S1["Single WS server + Redis + PostgreSQL"]
    end
    subgraph P2["Phase 2: Medium scale (~1M concurrent)"]
        S2["WS cluster + Redis pub/sub + Cassandra"]
    end
    subgraph P3["Phase 3: Large scale (~50M concurrent)"]
        S3["Multiple DCs + Kafka + sharded everything"]
    end
    P1 --> P2 --> P3

Interview Tips

How to Approach (45 minutes)

1. CLARIFY (3-5 min)
   "1:1 only or groups? How large? Need presence? Read receipts?"

2. HIGH-LEVEL DESIGN (5-7 min)
   Draw: Clients → LB → WS Servers → Message Queue → Database

3. DEEP DIVE (25-30 min)
   - Protocol choice (WebSocket vs alternatives)
   - Message routing (how to find recipient's server)
   - Group chat fan-out strategy
   - Presence system design
   - Offline message handling

4. WRAP UP (5 min)
   - Discuss E2E encryption, push notifications
   - Mention monitoring, rate limiting

Key Phrases That Show Depth

Instead of…	Say…
“We use WebSocket”	“HTTP is request-response, so we need WebSocket for server-initiated push. It’s bidirectional and has low per-message overhead after the initial handshake.”
“Store in Cassandra”	“Messages are append-only and queried by (chat_id, time) - perfect for Cassandra’s time-series model with chat_id as partition key.”
“Use Redis for presence”	“Redis with TTL handles presence elegantly - heartbeat refreshes TTL, and if heartbeat stops, TTL expires and user is automatically offline.”

Common Follow-up Questions

Question	Key Points
“How handle 50M connections?”	Horizontal scaling, ~50K connections/server, 1000+ servers
“How ensure ordering?”	TimeUUID per chat, Kafka partition by chat_id
“How handle offline users?”	Store message, send push notification, sync on reconnect
“How scale group chat?”	Async fan-out via Kafka, not blocking sender
“How implement E2E encryption?”	Signal protocol, keys exchanged out-of-band, server can’t read

Design News Feed - Similar fanout patterns
Design Key-Value Store - Session storage
Consistent Hashing - Connection distribution

Design a Chat System

Table of Contents

Requirements

Functional Requirements

Non-Functional Requirements

Extended Requirements

Back of the Envelope Estimation

Traffic Estimates

Storage Estimates

Connection Estimates

Summary

System APIs

Send Message

Send Group Message

Get Messages

WebSocket Events

High-Level Design

Components

Communication Protocols

The Challenge

Protocol Options

Why WebSocket?

WebSocket Architecture

Connection Flow

Database Design

The Challenge

Why Cassandra for Messages?

Message Table (Cassandra)

Chat Table (Cassandra)

Group Table (PostgreSQL)

Session Table (Redis)

Deep Dive

1. Message Delivery Flow (1:1 Chat)

2. Message Delivery Flow (Group Chat)

3. Online Presence System

4. Message Sync for Multiple Devices

5. Message Status Tracking

Message Delivery

The Challenge: Delivery Guarantees

Why Not Exactly-Once?

Retry Strategy

Offline Message Handling

Key Takeaways

Design Decisions Summary

Trade-offs to Discuss

Scalability Phases

Interview Tips

How to Approach (45 minutes)

Key Phrases That Show Depth

Common Follow-up Questions

Related Topics