Distinct Substrings

Problem Overview

Attribute	Value
Difficulty	Medium
Category	String Algorithms
Time Limit	1 second
Key Technique	Suffix Array + LCP Array
CSES Link	https://cses.fi/problemset/task/2105

Learning Goals

After solving this problem, you will be able to:

Build a suffix array efficiently using sorting or radix sort
Construct the LCP (Longest Common Prefix) array from a suffix array
Apply the formula: distinct substrings = total substrings - sum(LCP)
Recognize when suffix array + LCP is the optimal approach for substring problems

Problem Statement

Problem: Given a string s, count the number of distinct substrings.

Input:

Line 1: A string s consisting of lowercase English letters

Output:

A single integer: the number of distinct substrings

Constraints:

1 <= s <= 10^5
s contains only lowercase English letters (a-z)

Example

Input:
abab

Output:
7

Explanation: The distinct substrings are: “a”, “b”, “ab”, “ba”, “aba”, “bab”, “abab”. Note that “a” and “b” each appear twice in the string, but we count them only once.

Intuition: How to Think About This Problem

Pattern Recognition

Key Question: How can we efficiently count unique substrings without generating all of them?

The insight is that every suffix of a string contains all substrings starting at that position. If we sort all suffixes lexicographically, consecutive suffixes share common prefixes - and these shared prefixes represent duplicate substrings. The LCP array captures exactly this redundancy.

Breaking Down the Problem

What are we looking for? Count of unique substrings
What information do we have? A string of length n has n*(n+1)/2 total substrings (with duplicates)
What’s the relationship between input and output? Distinct count = Total count - Duplicates

Analogies

Think of this like counting unique words in a sorted dictionary. If consecutive words share a common prefix (like “pre-“ in “predict” and “prefer”), we don’t want to double-count that shared part. The LCP tells us exactly how many characters are shared.

Solution 1: Brute Force

Idea

Generate all possible substrings and store them in a set to eliminate duplicates.

Algorithm

Iterate over all starting positions i from 0 to n-1
For each starting position, iterate over all ending positions j from i to n-1
Add substring s[i:j+1] to a hash set
Return the size of the set

Code

def count_distinct_brute(s: str) -> int:
  """
  Brute force: generate all substrings and use a set.

  Time: O(n^2) substrings, each taking O(n) to hash
  Space: O(n^2) for storing substrings
  """
  n = len(s)
  substrings = set()

  for i in range(n):
    for j in range(i + 1, n + 1):
      substrings.add(s[i:j])

  return len(substrings)

Complexity

Metric	Value	Explanation
Time	O(n^3)	O(n^2) substrings, each O(n) to create/hash
Space	O(n^2)	Up to n*(n+1)/2 unique substrings stored

Why This Works (But Is Slow)

Correctness is guaranteed because we enumerate all substrings. However, for n=10^5, this generates up to 5 billion substrings - far too slow.

Solution 2: Optimal Solution (Suffix Array + LCP)

Key Insight

The Trick: Each suffix contributes (length - LCP with previous suffix) new distinct substrings.

A suffix of length L represents L unique substrings (its prefixes). When suffixes are sorted, the LCP with the previous suffix tells us how many of those substrings are duplicates.

Core Formula

Distinct Substrings = n*(n+1)/2 - sum(LCP[i])

Or equivalently, sum over all suffixes: (suffix_length - LCP_with_previous).

Algorithm

Build the suffix array (sort all suffix starting positions by lexicographic order)
Build the LCP array (longest common prefix between adjacent suffixes)
Total substrings = n*(n+1)/2
Subtract sum of all LCP values
Return the result

Dry Run Example

Let’s trace through with input s = "abab":

Step 1: Generate all suffixes with their starting indices
  Index 0: "abab"
  Index 1: "bab"
  Index 2: "ab"
  Index 3: "b"

Step 2: Sort suffixes lexicographically
  Sorted order: "ab" (2), "abab" (0), "b" (3), "bab" (1)
  Suffix Array: [2, 0, 3, 1]

Step 3: Compute LCP array (LCP between adjacent sorted suffixes)
  LCP[0] = 0  (no previous suffix)
  LCP[1] = 2  ("ab" and "abab" share "ab")
  LCP[2] = 0  ("abab" and "b" share nothing)
  LCP[3] = 1  ("b" and "bab" share "b")

  LCP Array: [0, 2, 0, 1]

Step 4: Calculate distinct substrings
  Total possible = 4 * 5 / 2 = 10
  Sum of LCP = 0 + 2 + 0 + 1 = 3
  Distinct = 10 - 3 = 7

Answer: 7

Visual Diagram

String: "abab" (n=4)

Sorted Suffixes:        Suffix  |  Length  |  LCP  |  New Substrings
                        --------|----------|-------|------------------
                        "ab"    |    2     |   0   |  2 ("a", "ab")
                        "abab"  |    4     |   2   |  2 ("aba", "abab")
                        "b"     |    1     |   0   |  1 ("b")
                        "bab"   |    3     |   1   |  2 ("ba", "bab")
                        --------|----------|-------|------------------
                        Total   |   10     |   3   |  7 distinct

Code (Python)

def count_distinct_substrings(s: str) -> int:
  """
  Count distinct substrings using Suffix Array + LCP.

  Time: O(n log n) for suffix array construction
  Space: O(n) for arrays
  """
  n = len(s)
  if n == 0:
    return 0

  # Build suffix array (indices sorted by suffix)
  suffix_array = sorted(range(n), key=lambda i: s[i:])

  # Build rank array (position of each suffix in sorted order)
  rank = [0] * n
  for i, sa in enumerate(suffix_array):
    rank[sa] = i

  # Build LCP array using Kasai's algorithm
  lcp = [0] * n
  k = 0
  for i in range(n):
    if rank[i] == 0:
      k = 0
      continue
    j = suffix_array[rank[i] - 1]
    while i + k < n and j + k < n and s[i + k] == s[j + k]:
      k += 1
    lcp[rank[i]] = k
    if k > 0:
      k -= 1

  # Distinct = total - duplicates
  total = n * (n + 1) // 2
  duplicates = sum(lcp)
  return total - duplicates


# Read input and solve
if __name__ == "__main__":
  s = input().strip()
  print(count_distinct_substrings(s))

Complexity

Metric	Value	Explanation
Time	O(n log n)	Suffix array sorting dominates
Space	O(n)	Suffix array + rank + LCP arrays

Common Mistakes

Mistake 1: Integer Overflow

Problem: n*(n+1) can exceed 2^31 for n = 10^5. Fix: Cast to long long before multiplication.

Mistake 2: Forgetting Empty LCP at Position 0

# WRONG - Starting LCP sum from index 0 with undefined value
lcp = compute_lcp()
# Assuming lcp[0] has a meaningful value

# CORRECT - LCP[0] is always 0 (no previous suffix)
lcp[0] = 0  # First suffix has no predecessor

Problem: The first suffix in sorted order has no previous suffix to compare with. Fix: Always set LCP[0] = 0.

Mistake 3: Incorrect Kasai’s Algorithm Implementation

# WRONG - Not decrementing k properly
k = 0
for i in range(n):
  # ... compute lcp
  k = lcp[rank[i]]  # Wrong: should preserve k and decrement

# CORRECT
k = 0
for i in range(n):
  # ... compute lcp
  if k > 0:
    k -= 1  # Key optimization in Kasai's algorithm

Problem: Kasai’s algorithm relies on the property that LCP can decrease by at most 1. Fix: Decrement k by 1 (not reset to 0) after each iteration.

Edge Cases

Case	Input	Expected Output	Why
Single character	`"a"`	1	Only substring is “a”
All same characters	`"aaaa"`	4	“a”, “aa”, “aaa”, “aaaa”
All distinct characters	`"abcd"`	10	n*(n+1)/2 = 10, no duplicates
Two characters alternating	`"abab"`	7	Some prefixes repeat
Empty string	`""`	0	No substrings possible

When to Use This Pattern

Use Suffix Array + LCP When:

Counting distinct substrings
Finding longest repeated substring
Comparing all pairs of suffixes efficiently
Problems involving lexicographic ordering of substrings

Don’t Use When:

Simple pattern matching (use KMP or Z-algorithm)
String length is very small (brute force may be simpler)
You only need to check existence of a specific substring

Pattern Recognition Checklist:

Need to count/enumerate all substrings? -> Consider Suffix Array
Looking for repeated patterns in string? -> Consider Suffix Array + LCP
Need lexicographically k-th substring? -> Consider Suffix Array
Simple string matching? -> Use KMP or hashing instead

Easier (Do These First)

Problem	Why It Helps
String Matching (CSES 1753)	Basic string matching with KMP/Z-algorithm
Finding Borders (CSES 1732)	Understanding string prefixes/suffixes

Similar Difficulty

Problem	Key Difference
Repeating Substring (CSES 2106)	Find longest substring that appears >= k times
Longest Repeating Substring (LC 1062)	Find max LCP in array

Harder (Do These After)

Problem	New Concept
Substring Distribution (CSES 2110)	Count substrings of each length
Distinct Substrings II (LC 940)	Subsequences instead of substrings

Key Takeaways

The Core Idea: Suffix Array + LCP lets us count duplicates among all substrings efficiently
Time Optimization: From O(n^3) brute force to O(n log n) with suffix array
Space Trade-off: O(n) space for suffix array and LCP array
Pattern: This is the standard approach for substring counting problems

Practice Checklist

Before moving on, make sure you can:

Build a suffix array from scratch (sorting-based approach)
Implement Kasai’s algorithm for LCP construction
Explain why distinct = total - sum(LCP)
Implement the full solution in under 15 minutes

Additional Resources

CP-Algorithms: Suffix Array
CP-Algorithms: LCP Array (Kasai’s Algorithm)
CSES Distinct Substrings - Count unique substrings

Distinct Substrings

Problem Overview

Learning Goals

Problem Statement

Example

Intuition: How to Think About This Problem

Pattern Recognition

Breaking Down the Problem

Analogies

Solution 1: Brute Force

Idea

Algorithm

Code

Complexity

Why This Works (But Is Slow)

Solution 2: Optimal Solution (Suffix Array + LCP)

Key Insight

Core Formula

Algorithm

Dry Run Example

Visual Diagram

Code (Python)

Complexity

Common Mistakes

Mistake 1: Integer Overflow

Mistake 2: Forgetting Empty LCP at Position 0

Mistake 3: Incorrect Kasai’s Algorithm Implementation

Edge Cases

When to Use This Pattern

Use Suffix Array + LCP When:

Don’t Use When:

Pattern Recognition Checklist:

Related Problems

Easier (Do These First)

Similar Difficulty

Harder (Do These After)

Key Takeaways

Practice Checklist

Additional Resources