Substring Order I
Problem Overview
| Attribute | Value |
|---|---|
| CSES Link | https://cses.fi/problemset/task/2108 |
| Difficulty | Hard |
| Category | String Algorithms |
| Time Limit | 1 second |
| Key Technique | Suffix Array + LCP Array |
Learning Goals
After solving this problem, you will be able to:
- Build a suffix array and understand its lexicographical properties
- Construct an LCP (Longest Common Prefix) array efficiently
- Use LCP to count distinct substrings without duplication
- Apply binary search to find the k-th element in a cumulative sum array
Problem Statement
Problem: Given a string and a value k, find the k-th lexicographically smallest distinct substring.
Input:
- Line 1: A string s of lowercase letters
- Line 2: An integer k
Output:
- The k-th lexicographically smallest distinct substring
Constraints:
-
1 <= s <= 10^5 - 1 <= k <= number of distinct substrings
Example
Input:
ababab
3
Output:
aba
Explanation: The distinct substrings sorted lexicographically:
- “a”
- “ab”
- “aba” <-- k=3
- “abab”
- “ababab”
- “b”
- “ba”
- “bab”
- “baba”
- “babab”
Intuition: How to Think About This Problem
Pattern Recognition
Key Question: How do we efficiently enumerate all distinct substrings in sorted order?
The suffix array gives us all suffixes sorted lexicographically. Each suffix contributes multiple substrings (its prefixes). The LCP array tells us how many of those prefixes are duplicates from the previous suffix.
Breaking Down the Problem
- What are we looking for? The k-th smallest distinct substring
- What information do we have? A string and a position k
- What’s the relationship? Suffix array orders suffixes; each suffix’s prefixes are substrings in sorted order
Key Insight
For suffix at position i in suffix array:
- Total prefixes = n - SA[i] (all substrings starting at SA[i])
- Duplicate prefixes = LCP[i] (shared with previous suffix)
- New distinct substrings = (n - SA[i]) - LCP[i]
Solution 1: Brute Force
Idea
Generate all substrings, remove duplicates, sort them, and return the k-th one.
Code
def brute_force(s, k):
"""
Time: O(n^2 log n) for generation and sorting
Space: O(n^2) to store all substrings
"""
substrings = set()
n = len(s)
for i in range(n):
for j in range(i + 1, n + 1):
substrings.add(s[i:j])
sorted_subs = sorted(substrings)
return sorted_subs[k - 1]
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n^2 log n) | Generate O(n^2) substrings, sort them |
| Space | O(n^2) | Store all distinct substrings |
Why this is too slow: For n = 10^5, we cannot store 10^10 substrings.
Solution 2: Suffix Array + LCP (Optimal)
Key Insight
The Trick: Use suffix array for sorted order and LCP to skip duplicates. Binary search finds which suffix contains the k-th substring.
Algorithm Overview
- Build suffix array SA[] - gives suffixes in sorted order
- Build LCP array - tells overlap with previous suffix
- For each suffix i: new substrings = (n - SA[i]) - LCP[i]
- Build cumulative count array
- Binary search to find which suffix contains k-th substring
- Extract the substring at the correct offset
Dry Run Example
Let’s trace through s = "aba", k = 3:
Step 1: Build Suffix Array
Suffixes: Sorted:
0: "aba" 1: "a" (SA[0] = 2)
1: "ba" 0: "aba" (SA[1] = 0)
2: "a" 1: "ba" (SA[2] = 1)
SA = [2, 0, 1]
Step 2: Build LCP Array
LCP[0] = 0 (no previous suffix)
LCP[1] = 1 (lcp("a", "aba") = "a" -> length 1)
LCP[2] = 0 (lcp("aba", "ba") = "" -> length 0)
LCP = [0, 1, 0]
Step 3: Count New Substrings per Suffix
i=0: suffix "a", new = (3-2) - 0 = 1 -> adds "a"
i=1: suffix "aba", new = (3-0) - 1 = 2 -> adds "ab", "aba"
i=2: suffix "ba", new = (3-1) - 0 = 2 -> adds "b", "ba"
Cumulative: [1, 3, 5]
Step 4: Find k=3
Binary search: k=3 falls in suffix i=1 (cum[1]=3)
Offset within suffix = 3 - 1 = 2 (1 is cum[0])
Length = LCP[1] + offset = 1 + 2 = 3
Result: s[SA[1] : SA[1]+3] = s[0:3] = "aba"
Visual Diagram
String: "aba" (n=3)
Suffix Array (sorted suffixes):
+-------+--------+------+-------+------------------+
| Index | SA[i] | LCP | Suffix| New Substrings |
+-------+--------+------+-------+------------------+
| 0 | 2 | 0 | "a" | 1: "a" |
| 1 | 0 | 1 | "aba" | 2: "ab", "aba" |
| 2 | 1 | 0 | "ba" | 2: "b", "ba" |
+-------+--------+------+-------+------------------+
Cumulative count: [1, 3, 5]
k=1 k=2,3 k=4,5
For k=3: Find suffix 1, extract "aba"
Code (Python)
def build_suffix_array(s):
"""Build suffix array using O(n log n) algorithm."""
n = len(s)
sa = list(range(n))
rank = [ord(c) for c in s]
tmp = [0] * n
k = 1
while k < n:
def key(i):
return (rank[i], rank[i + k] if i + k < n else -1)
sa.sort(key=key)
tmp[sa[0]] = 0
for i in range(1, n):
tmp[sa[i]] = tmp[sa[i-1]]
if key(sa[i]) != key(sa[i-1]):
tmp[sa[i]] += 1
rank = tmp[:]
k *= 1
return sa
def build_lcp(s, sa):
"""Build LCP array using Kasai's algorithm - O(n)."""
n = len(s)
rank = [0] * n
for i in range(n):
rank[sa[i]] = i
lcp = [0] * n
h = 0
for i in range(n):
if rank[i] > 0:
j = sa[rank[i] - 1]
while i + h < n and j + h < n and s[i + h] == s[j + h]:
h += 1
lcp[rank[i]] = h
if h > 0:
h -= 1
return lcp
def solve(s, k):
"""
Find k-th lexicographically smallest distinct substring.
Time: O(n log n) for SA + O(n) for LCP + O(log n) for query
Space: O(n)
"""
n = len(s)
sa = build_suffix_array(s)
lcp = build_lcp(s, sa)
# Build cumulative count of new substrings
cum = []
total = 0
for i in range(n):
new_count = (n - sa[i]) - lcp[i]
total += new_count
cum.append(total)
# Binary search for the suffix containing k-th substring
lo, hi = 0, n - 1
while lo < hi:
mid = (lo + hi) // 2
if cum[mid] < k:
lo = mid + 1
else:
hi = mid
# Calculate substring length
prev_cum = cum[lo - 1] if lo > 0 else 0
offset = k - prev_cum # Position within this suffix's new substrings
length = lcp[lo] + offset
return s[sa[lo]:sa[lo] + length]
# Main
s = input().strip()
k = int(input())
print(solve(s, k))
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(n log n) | SA construction dominates |
| Space | O(n) | SA, LCP, and cumulative arrays |
Common Mistakes
Mistake 1: Integer Overflow
Problem: With n = 10^5, total distinct substrings can reach ~5 * 10^9.
Fix: Use long long for cumulative counts.
Mistake 2: Off-by-One in Length Calculation
# WRONG
length = lcp[lo] + offset - 1
# CORRECT
length = lcp[lo] + offset
Problem: The offset represents “which new substring” (1-indexed within the suffix). Fix: Add offset directly to LCP without subtracting 1.
Mistake 3: Forgetting LCP[0] = 0
Edge Cases
| Case | Input | Expected Output | Why |
|---|---|---|---|
| Single char | "a", k=1 |
"a" |
Only one substring |
| All same chars | "aaa", k=3 |
"aaa" |
Only 3 distinct: a, aa, aaa |
| k = 1 | "cab", k=1 |
"a" |
Smallest is lexicographically first char |
| Max k | "ab", k=3 |
"b" |
Substrings: a, ab, b |
When to Use This Pattern
Use This Approach When:
- Finding k-th lexicographically smallest substring
- Counting or enumerating distinct substrings
- Problems involving lexicographical ordering of substrings
- Need to avoid O(n^2) substring generation
Don’t Use When:
- Simple pattern matching (use KMP or Z-algorithm)
- Finding longest common substring between two strings (use different SA approach)
- String has special structure that allows simpler solution
Pattern Recognition Checklist:
- Need substrings in sorted order? -> Suffix Array
- Need to count distinct substrings? -> SA + LCP
- Need k-th element in ordered set? -> Binary search on cumulative counts
Related Problems
Similar Difficulty (CSES)
| Problem | Key Difference |
|---|---|
| Substring Order II | Counts duplicates (non-distinct) |
| Distinct Substrings | Count only, no need to find k-th |
| Repeating Substring | Find longest repeating via LCP |
Prerequisite Problems (CSES)
| Problem | Why It Helps |
|---|---|
| String Matching | Basic string algorithm foundation |
| Finding Borders | Understanding string structure |
Harder Extensions
| Problem | New Concept |
|---|---|
| Substring Distribution | Count substrings by length |
| Finding Patterns | Multiple pattern matching |
Key Takeaways
- Core Idea: Suffix array gives sorted suffixes; each suffix’s prefixes are substrings in sorted order
- LCP Optimization: LCP[i] tells how many prefixes of suffix i are duplicates from suffix i-1
- Counting Formula: New distinct substrings from suffix i = (n - SA[i]) - LCP[i]
- Binary Search: Use cumulative counts to locate which suffix contains the k-th substring
Practice Checklist
Before moving on, make sure you can:
- Implement suffix array construction from scratch
- Build LCP array using Kasai’s algorithm
- Explain why LCP helps count distinct substrings
- Derive the substring extraction formula given suffix index and offset
- Handle edge cases with single characters and repeated patterns