Distinct Substrings
Problem Overview
| Attribute | Value |
|---|---|
| CSES Link | https://cses.fi/problemset/task/2223 |
| Difficulty | Medium |
| Category | String Algorithms |
| Time Limit | 1 second |
| Key Technique | Suffix Array + LCP Array |
Learning Goals
After solving this problem, you will be able to:
- Build a Suffix Array in O(n log n) time
- Compute the LCP (Longest Common Prefix) array from a Suffix Array
- Use the formula:
distinct_substrings = n*(n+1)/2 - sum(LCP)to count distinct substrings - Understand the relationship between suffix arrays and substring enumeration
Problem Statement
Problem: Given a string of length n, count the number of distinct substrings.
Input:
- A single string s containing only lowercase letters a-z
Output:
- One integer: the number of distinct substrings
Constraints:
- 1 <= n <= 10^5
- s contains only lowercase English letters
Example
Input:
abab
Output:
7
Explanation: The distinct substrings are: “a”, “b”, “ab”, “ba”, “aba”, “bab”, “abab”. Note that “a”, “b”, and “ab” each appear twice but are counted only once.
Intuition: How to Think About This Problem
Pattern Recognition
Key Question: How can we efficiently count unique substrings without storing them all?
The total number of substrings in a string of length n is n*(n+1)/2. If we can figure out how many of these are duplicates, we get our answer.
Breaking Down the Problem
- What are we looking for? Count of unique substrings
- What information do we have? A string of lowercase letters
- What’s the relationship between input and output? Each suffix contains all substrings starting at that position. The LCP between adjacent sorted suffixes tells us the overlap (duplicates).
Key Insight
The Trick: Each suffix of length k contributes k substrings (all its prefixes). When suffixes are sorted, the LCP with the previous suffix tells us how many prefixes are duplicates.
Solution 1: Brute Force (HashSet)
Idea
Generate all substrings and store them in a set to count unique ones.
Algorithm
- For each starting position i from 0 to n-1
- For each ending position j from i to n-1
- Extract substring s[i:j+1] and add to set
- Return set size
Code
def count_distinct_brute(s):
"""
Brute force using a set to store all substrings.
Time: O(n^3) for generation + hashing
Space: O(n^2) to store all substrings
"""
n = len(s)
substrings = set()
for i in range(n):
for j in range(i + 1, n + 1):
substrings.add(s[i:j])
return len(substrings)
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n^3) | O(n^2) substrings, each taking O(n) to hash |
| Space | O(n^2) | Storing up to n*(n+1)/2 substrings |
Why This Works (But Is Slow)
Correct because sets eliminate duplicates automatically. However, O(n^3) is too slow for n = 10^5.
Solution 2: Optimal - Suffix Array + LCP
Key Insight
The Trick: Total substrings = n*(n+1)/2. Duplicates = sum of LCP values between adjacent sorted suffixes.
When we sort all suffixes alphabetically, adjacent suffixes with a common prefix represent duplicate substrings. The LCP array tells us exactly how many duplicates each suffix introduces.
Why This Formula Works
Consider string “abab” with sorted suffixes:
- “ab” (pos 2) - contributes 2 substrings: “a”, “ab”
- “abab” (pos 0) - LCP with “ab” is 2, so 2 duplicates. New: 4-2 = 2 substrings
- “b” (pos 3) - LCP with “abab” is 0, so 0 duplicates. New: 1 substring
- “bab” (pos 1) - LCP with “b” is 1, so 1 duplicate. New: 3-1 = 2 substrings
Total = 2 + 2 + 1 + 2 = 7 distinct substrings.
Algorithm
- Build the suffix array (sorted indices of all suffixes)
- Compute the LCP array using Kasai’s algorithm
- Result = n*(n+1)/2 - sum(LCP)
Dry Run Example
Let’s trace through with input s = "abab":
Step 1: List all suffixes with their starting indices
Index 0: "abab"
Index 1: "bab"
Index 2: "ab"
Index 3: "b"
Step 2: Sort suffixes alphabetically
Sorted order: "ab" < "abab" < "b" < "bab"
Suffix Array: [2, 0, 3, 1]
Step 3: Compute LCP array (LCP between adjacent sorted suffixes)
LCP[0] = 0 (no previous suffix)
LCP[1] = LCP("ab", "abab") = 2
LCP[2] = LCP("abab", "b") = 0
LCP[3] = LCP("b", "bab") = 1
Step 4: Apply formula
Total substrings = 4 * 5 / 2 = 10
Sum of LCP = 0 + 2 + 0 + 1 = 3
Distinct = 10 - 3 = 7
Visual Diagram
String: "abab" (n=4)
Sorted Suffixes: Length: LCP with prev: New substrings:
"ab" (idx 2) 2 - 2
"abab" (idx 0) 4 2 2 (4-2)
"b" (idx 3) 1 0 1 (1-0)
"bab" (idx 1) 3 1 2 (3-1)
---
Total: 7
Code (Python)
def build_suffix_array(s):
"""Build suffix array using O(n log n) algorithm."""
n = len(s)
suffixes = [(s[i:], i) for i in range(n)]
suffixes.sort()
return [idx for _, idx in suffixes]
def build_lcp_array(s, sa):
"""Kasai's algorithm for LCP array in O(n)."""
n = len(s)
rank = [0] * n
lcp = [0] * n
for i, suffix_idx in enumerate(sa):
rank[suffix_idx] = i
k = 0
for i in range(n):
if rank[i] == 0:
k = 0
continue
j = sa[rank[i] - 1]
while i + k < n and j + k < n and s[i + k] == s[j + k]:
k += 1
lcp[rank[i]] = k
if k > 0:
k -= 1
return lcp
def count_distinct_substrings(s):
"""
Count distinct substrings using Suffix Array + LCP.
Time: O(n log n) for suffix array, O(n) for LCP
Space: O(n) for arrays
"""
n = len(s)
if n == 0:
return 0
sa = build_suffix_array(s)
lcp = build_lcp_array(s, sa)
total = n * (n + 1) // 2
duplicates = sum(lcp)
return total - duplicates
# Read input and solve
s = input().strip()
print(count_distinct_substrings(s))
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n log n) | Dominated by suffix array construction |
| Space | O(n) | Suffix array and LCP array |
Common Mistakes
Mistake 1: Integer Overflow
Problem: n*(n+1) can exceed 2^31 for n = 10^5.
Fix: Cast to long long before multiplication.
Mistake 2: Forgetting LCP[0] = 0
# WRONG - trying to compute LCP for first suffix
lcp[0] = some_computation()
# CORRECT - first sorted suffix has no predecessor
lcp[0] = 0 # Always zero
Problem: The first suffix in sorted order has no previous suffix to compare. Fix: Always set LCP[0] = 0.
Mistake 3: Using Simple Sorting
# WRONG - O(n^2 log n) time
suffixes = [s[i:] for i in range(n)]
suffixes.sort() # Each comparison is O(n)
# CORRECT - Use rank-based sorting for O(n log n)
# (See the full implementation above)
Problem: Comparing strings directly takes O(n) per comparison. Fix: Use the doubling algorithm with ranks.
Edge Cases
| Case | Input | Expected Output | Why |
|---|---|---|---|
| Single character | “a” | 1 | Only one substring |
| All same chars | “aaa” | 3 | “a”, “aa”, “aaa” |
| All unique chars | “abc” | 6 | No duplicates: n*(n+1)/2 |
| Two characters alternating | “abab” | 7 | Multiple repeated substrings |
| Empty string | ”” | 0 | No substrings |
When to Use This Pattern
Use Suffix Array + LCP When:
- Counting distinct substrings
- Finding longest repeated substring
- Computing the number of occurrences of each substring
- Problems involving suffix comparisons
Don’t Use When:
- Simple pattern matching (use KMP or Z-algorithm)
- You only need to check if a specific substring exists (use hashing)
- The string is very short (brute force is simpler)
Pattern Recognition Checklist:
- Need to enumerate or count substrings? -> Consider Suffix Array
- Need to find common prefixes between suffixes? -> Use LCP Array
- Need to handle multiple pattern queries? -> Suffix Array with binary search
Related Problems
Easier (Do These First)
| Problem | Why It Helps |
|---|---|
| String Matching | Basic pattern matching, introduction to string algorithms |
| Finding Borders | Understanding prefix-suffix relationships |
Similar Difficulty
| Problem | Key Difference |
|---|---|
| Repeating Substring | Find the longest substring that appears at least twice |
| Minimal Rotation | Uses suffix array for lexicographic comparison |
Harder (Do These After)
| Problem | New Concept |
|---|---|
| Required Substring | Combines DP with string constraints |
| Substring Distribution | Count substrings of each length |
Key Takeaways
- The Core Idea: Distinct substrings = Total substrings - Duplicates (from LCP sum)
- Time Optimization: From O(n^3) brute force to O(n log n) using suffix array
- Space Trade-off: O(n) extra space for suffix array and LCP array
- Pattern: This is a classic application of suffix array + LCP for substring counting
Practice Checklist
Before moving on, make sure you can:
- Build a suffix array from scratch
- Implement Kasai’s algorithm for LCP array
- Explain why the formula
n*(n+1)/2 - sum(LCP)works - Solve this problem without looking at the solution
- Implement in your preferred language in under 15 minutes