Distinct Substrings
Problem Overview
| Attribute | Value |
|---|---|
| Difficulty | Medium |
| Category | String Algorithms |
| Time Limit | 1 second |
| Key Technique | Suffix Array + LCP Array |
| CSES Link | https://cses.fi/problemset/task/2105 |
Learning Goals
After solving this problem, you will be able to:
- Build a suffix array efficiently using sorting or radix sort
- Construct the LCP (Longest Common Prefix) array from a suffix array
- Apply the formula: distinct substrings = total substrings - sum(LCP)
- Recognize when suffix array + LCP is the optimal approach for substring problems
Problem Statement
Problem: Given a string s, count the number of distinct substrings.
Input:
- Line 1: A string s consisting of lowercase English letters
Output:
- A single integer: the number of distinct substrings
Constraints:
-
1 <= s <= 10^5 - s contains only lowercase English letters (a-z)
Example
Input:
abab
Output:
7
Explanation: The distinct substrings are: “a”, “b”, “ab”, “ba”, “aba”, “bab”, “abab”. Note that “a” and “b” each appear twice in the string, but we count them only once.
Intuition: How to Think About This Problem
Pattern Recognition
Key Question: How can we efficiently count unique substrings without generating all of them?
The insight is that every suffix of a string contains all substrings starting at that position. If we sort all suffixes lexicographically, consecutive suffixes share common prefixes - and these shared prefixes represent duplicate substrings. The LCP array captures exactly this redundancy.
Breaking Down the Problem
- What are we looking for? Count of unique substrings
- What information do we have? A string of length n has n*(n+1)/2 total substrings (with duplicates)
- What’s the relationship between input and output? Distinct count = Total count - Duplicates
Analogies
Think of this like counting unique words in a sorted dictionary. If consecutive words share a common prefix (like “pre-“ in “predict” and “prefer”), we don’t want to double-count that shared part. The LCP tells us exactly how many characters are shared.
Solution 1: Brute Force
Idea
Generate all possible substrings and store them in a set to eliminate duplicates.
Algorithm
- Iterate over all starting positions i from 0 to n-1
- For each starting position, iterate over all ending positions j from i to n-1
- Add substring s[i:j+1] to a hash set
- Return the size of the set
Code
def count_distinct_brute(s: str) -> int:
"""
Brute force: generate all substrings and use a set.
Time: O(n^2) substrings, each taking O(n) to hash
Space: O(n^2) for storing substrings
"""
n = len(s)
substrings = set()
for i in range(n):
for j in range(i + 1, n + 1):
substrings.add(s[i:j])
return len(substrings)
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n^3) | O(n^2) substrings, each O(n) to create/hash |
| Space | O(n^2) | Up to n*(n+1)/2 unique substrings stored |
Why This Works (But Is Slow)
Correctness is guaranteed because we enumerate all substrings. However, for n=10^5, this generates up to 5 billion substrings - far too slow.
Solution 2: Optimal Solution (Suffix Array + LCP)
Key Insight
The Trick: Each suffix contributes (length - LCP with previous suffix) new distinct substrings.
A suffix of length L represents L unique substrings (its prefixes). When suffixes are sorted, the LCP with the previous suffix tells us how many of those substrings are duplicates.
Core Formula
Distinct Substrings = n*(n+1)/2 - sum(LCP[i])
Or equivalently, sum over all suffixes: (suffix_length - LCP_with_previous).
Algorithm
- Build the suffix array (sort all suffix starting positions by lexicographic order)
- Build the LCP array (longest common prefix between adjacent suffixes)
- Total substrings = n*(n+1)/2
- Subtract sum of all LCP values
- Return the result
Dry Run Example
Let’s trace through with input s = "abab":
Step 1: Generate all suffixes with their starting indices
Index 0: "abab"
Index 1: "bab"
Index 2: "ab"
Index 3: "b"
Step 2: Sort suffixes lexicographically
Sorted order: "ab" (2), "abab" (0), "b" (3), "bab" (1)
Suffix Array: [2, 0, 3, 1]
Step 3: Compute LCP array (LCP between adjacent sorted suffixes)
LCP[0] = 0 (no previous suffix)
LCP[1] = 2 ("ab" and "abab" share "ab")
LCP[2] = 0 ("abab" and "b" share nothing)
LCP[3] = 1 ("b" and "bab" share "b")
LCP Array: [0, 2, 0, 1]
Step 4: Calculate distinct substrings
Total possible = 4 * 5 / 2 = 10
Sum of LCP = 0 + 2 + 0 + 1 = 3
Distinct = 10 - 3 = 7
Answer: 7
Visual Diagram
String: "abab" (n=4)
Sorted Suffixes: Suffix | Length | LCP | New Substrings
--------|----------|-------|------------------
"ab" | 2 | 0 | 2 ("a", "ab")
"abab" | 4 | 2 | 2 ("aba", "abab")
"b" | 1 | 0 | 1 ("b")
"bab" | 3 | 1 | 2 ("ba", "bab")
--------|----------|-------|------------------
Total | 10 | 3 | 7 distinct
Code (Python)
def count_distinct_substrings(s: str) -> int:
"""
Count distinct substrings using Suffix Array + LCP.
Time: O(n log n) for suffix array construction
Space: O(n) for arrays
"""
n = len(s)
if n == 0:
return 0
# Build suffix array (indices sorted by suffix)
suffix_array = sorted(range(n), key=lambda i: s[i:])
# Build rank array (position of each suffix in sorted order)
rank = [0] * n
for i, sa in enumerate(suffix_array):
rank[sa] = i
# Build LCP array using Kasai's algorithm
lcp = [0] * n
k = 0
for i in range(n):
if rank[i] == 0:
k = 0
continue
j = suffix_array[rank[i] - 1]
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k += 1
lcp[rank[i]] = k
if k > 0:
k -= 1
# Distinct = total - duplicates
total = n * (n + 1) // 2
duplicates = sum(lcp)
return total - duplicates
# Read input and solve
if __name__ == "__main__":
s = input().strip()
print(count_distinct_substrings(s))
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n log n) | Suffix array sorting dominates |
| Space | O(n) | Suffix array + rank + LCP arrays |
Common Mistakes
Mistake 1: Integer Overflow
Problem: n*(n+1) can exceed 2^31 for n = 10^5. Fix: Cast to long long before multiplication.
Mistake 2: Forgetting Empty LCP at Position 0
# WRONG - Starting LCP sum from index 0 with undefined value
lcp = compute_lcp()
# Assuming lcp[0] has a meaningful value
# CORRECT - LCP[0] is always 0 (no previous suffix)
lcp[0] = 0 # First suffix has no predecessor
Problem: The first suffix in sorted order has no previous suffix to compare with. Fix: Always set LCP[0] = 0.
Mistake 3: Incorrect Kasai’s Algorithm Implementation
# WRONG - Not decrementing k properly
k = 0
for i in range(n):
# ... compute lcp
k = lcp[rank[i]] # Wrong: should preserve k and decrement
# CORRECT
k = 0
for i in range(n):
# ... compute lcp
if k > 0:
k -= 1 # Key optimization in Kasai's algorithm
Problem: Kasai’s algorithm relies on the property that LCP can decrease by at most 1. Fix: Decrement k by 1 (not reset to 0) after each iteration.
Edge Cases
| Case | Input | Expected Output | Why |
|---|---|---|---|
| Single character | "a" |
1 | Only substring is “a” |
| All same characters | "aaaa" |
4 | “a”, “aa”, “aaa”, “aaaa” |
| All distinct characters | "abcd" |
10 | n*(n+1)/2 = 10, no duplicates |
| Two characters alternating | "abab" |
7 | Some prefixes repeat |
| Empty string | "" |
0 | No substrings possible |
When to Use This Pattern
Use Suffix Array + LCP When:
- Counting distinct substrings
- Finding longest repeated substring
- Comparing all pairs of suffixes efficiently
- Problems involving lexicographic ordering of substrings
Don’t Use When:
- Simple pattern matching (use KMP or Z-algorithm)
- String length is very small (brute force may be simpler)
- You only need to check existence of a specific substring
Pattern Recognition Checklist:
- Need to count/enumerate all substrings? -> Consider Suffix Array
- Looking for repeated patterns in string? -> Consider Suffix Array + LCP
- Need lexicographically k-th substring? -> Consider Suffix Array
- Simple string matching? -> Use KMP or hashing instead
Related Problems
Easier (Do These First)
| Problem | Why It Helps |
|---|---|
| String Matching (CSES 1753) | Basic string matching with KMP/Z-algorithm |
| Finding Borders (CSES 1732) | Understanding string prefixes/suffixes |
Similar Difficulty
| Problem | Key Difference |
|---|---|
| Repeating Substring (CSES 2106) | Find longest substring that appears >= k times |
| Longest Repeating Substring (LC 1062) | Find max LCP in array |
Harder (Do These After)
| Problem | New Concept |
|---|---|
| Substring Distribution (CSES 2110) | Count substrings of each length |
| Distinct Substrings II (LC 940) | Subsequences instead of substrings |
Key Takeaways
- The Core Idea: Suffix Array + LCP lets us count duplicates among all substrings efficiently
- Time Optimization: From O(n^3) brute force to O(n log n) with suffix array
- Space Trade-off: O(n) space for suffix array and LCP array
- Pattern: This is the standard approach for substring counting problems
Practice Checklist
Before moving on, make sure you can:
- Build a suffix array from scratch (sorting-based approach)
- Implement Kasai’s algorithm for LCP construction
- Explain why distinct = total - sum(LCP)
- Implement the full solution in under 15 minutes
Additional Resources
- CP-Algorithms: Suffix Array
- CP-Algorithms: LCP Array (Kasai’s Algorithm)
- CSES Distinct Substrings - Count unique substrings