Distinct Values Queries
Problem Overview
| Attribute | Value |
|---|---|
| Difficulty | Hard |
| Category | Range Queries |
| Time Limit | 1 second |
| Key Technique | Offline Processing + BIT / Mo’s Algorithm |
| CSES Link | Distinct Values Queries |
Learning Goals
After solving this problem, you will be able to:
- Apply offline query processing to transform a hard problem into a manageable one
- Use Binary Indexed Tree (BIT) for efficient range sum queries
- Track “last occurrence” of elements to avoid double counting
- Implement Mo’s algorithm as an alternative approach
- Recognize when sorting queries by right endpoint is beneficial
Problem Statement
Problem: Given an array of n integers and q queries, for each query (l, r), count the number of distinct values in the subarray from index l to r (inclusive).
Input:
- Line 1: Two integers n and q (array size and number of queries)
- Line 2: n integers (the array elements)
- Next q lines: Two integers l and r for each query (1-indexed)
Output:
- q lines: The number of distinct values in each queried range
Constraints:
- 1 <= n, q <= 2 * 10^5
- 1 <= arr[i] <= 10^9
- 1 <= l <= r <= n
Example
Input:
5 3
3 2 3 1 2
1 3
2 4
1 5
Output:
2
3
3
Explanation:
- Query (1,3): Subarray [3, 2, 3] has distinct values {2, 3} = 2 distinct
- Query (2,4): Subarray [2, 3, 1] has distinct values {1, 2, 3} = 3 distinct
- Query (1,5): Subarray [3, 2, 3, 1, 2] has distinct values {1, 2, 3} = 3 distinct
Intuition: How to Think About This Problem
Pattern Recognition
Key Question: How do we count distinct elements efficiently when we cannot merge segment tree nodes?
The critical insight is that counting distinct values is not decomposable - knowing the distinct count in [l, mid] and [mid+1, r] does not help compute the distinct count in [l, r] because elements may appear in both halves.
Breaking Down the Problem
- What are we looking for? Count of unique elements in each range [l, r]
- What information do we have? Static array, multiple queries
- What’s the relationship between input and output? For each (l, r), count elements appearing at least once in that range
The Key Insight: Last Occurrence Tracking
Instead of counting elements directly, we can ask: “For each position i, does the element at i contribute to the distinct count for range [l, r]?”
An element at position i contributes to range [l, r] if:
- l <= i <= r (it’s within the range)
- There is no occurrence of the same element at position j where l <= j < i (it’s the first/leftmost occurrence in the range)
Reformulation: If we mark position i with 1 when arr[i]’s last occurrence before or at position r is exactly at position i, then the distinct count for [l, r] = sum of marks in [l, r].
Solution 1: Brute Force
Idea
For each query, iterate through the range and use a set to count distinct elements.
Algorithm
- For each query (l, r)
- Add all elements in range [l, r] to a set
- Return the size of the set
Code
def solve_brute_force(n, arr, queries):
"""
Brute force: Use a set for each query.
Time: O(q * n)
Space: O(n)
"""
results = []
for l, r in queries:
distinct = set()
for i in range(l - 1, r): # Convert to 0-indexed
distinct.add(arr[i])
results.append(len(distinct))
return results
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O(q * n) | Each query scans up to n elements |
| Space | O(n) | Set can hold up to n elements |
Why This Works (But Is Slow)
Correctness is guaranteed because a set inherently stores only unique elements. However, with q and n up to 2 * 10^5, this gives 4 * 10^10 operations - far too slow.
Solution 2: Offline Processing + BIT (Optimal)
Key Insight
The Trick: Process queries sorted by right endpoint. For each position, only mark the rightmost occurrence of each value up to the current position.
When we process queries in order of increasing right endpoint r:
- As we move r from 1 to n, we update a BIT where position i has value 1 if arr[i] is the rightmost occurrence of that value among positions 1 to r
- For query (l, r), the answer is sum(l, r) in the BIT
Why This Works
When processing a query with right endpoint r:
- For each unique value, exactly one position (its rightmost occurrence up to r) has a 1 in the BIT
- Elements appearing multiple times are counted exactly once
- The range sum [l, r] counts how many “rightmost occurrences” fall within [l, r]
Algorithm
- Store queries as (l, r, original_index), sort by r
- Maintain a BIT and a map: value -> its current marked position
- Process array left to right (position i from 1 to n):
- If arr[i] was previously marked at position p, unmark it (BIT subtract 1 at p)
- Mark position i for arr[i] (BIT add 1 at i)
- Update map: arr[i] -> i
- Answer all queries with r == i: answer = BIT.sum(l, i)
- Return answers in original query order
Dry Run Example
Let’s trace through with arr = [3, 2, 3, 1, 2] and queries [(1,3), (2,4), (1,5)]:
Sorted queries by r: [(1,3,0), (2,4,1), (1,5,2)]
Initial state:
BIT = [0, 0, 0, 0, 0] (conceptually)
last_pos = {}
Step 1: i = 1, arr[1] = 3
3 not in last_pos
BIT.add(1, +1) -> marks position 1
last_pos = {3: 1}
BIT state: [1, 0, 0, 0, 0]
No query with r = 1
Step 2: i = 2, arr[2] = 2
2 not in last_pos
BIT.add(2, +1) -> marks position 2
last_pos = {3: 1, 2: 2}
BIT state: [1, 1, 0, 0, 0]
No query with r = 2
Step 3: i = 3, arr[3] = 3
3 IS in last_pos at position 1
BIT.add(1, -1) -> unmarks position 1
BIT.add(3, +1) -> marks position 3
last_pos = {3: 3, 2: 2}
BIT state: [0, 1, 1, 0, 0]
Query (1, 3): sum(1, 3) = 0 + 1 + 1 = 2 [Answer for query 0]
Step 4: i = 4, arr[4] = 1
1 not in last_pos
BIT.add(4, +1) -> marks position 4
last_pos = {3: 3, 2: 2, 1: 4}
BIT state: [0, 1, 1, 1, 0]
Query (2, 4): sum(2, 4) = 1 + 1 + 1 = 3 [Answer for query 1]
Step 5: i = 5, arr[5] = 2
2 IS in last_pos at position 2
BIT.add(2, -1) -> unmarks position 2
BIT.add(5, +1) -> marks position 5
last_pos = {3: 3, 2: 5, 1: 4}
BIT state: [0, 0, 1, 1, 1]
Query (1, 5): sum(1, 5) = 0 + 0 + 1 + 1 + 1 = 3 [Answer for query 2]
Final answers in original order: [2, 3, 3]
Visual Diagram
Array: [3, 2, 3, 1, 2]
Index: 1 2 3 4 5
Processing at i=5 (final state):
Value 3: last at position 3 -> BIT[3] = 1
Value 2: last at position 5 -> BIT[5] = 1 (moved from position 2)
Value 1: last at position 4 -> BIT[4] = 1
BIT: [0, 0, 1, 1, 1]
1 2 3 4 5
Query (1,5): sum of BIT[1..5] = 3 distinct values
Code (Python)
class BIT:
"""Binary Indexed Tree for point updates and prefix sums."""
def __init__(self, n):
self.n = n
self.tree = [0] * (n + 1)
def update(self, i, delta):
"""Add delta to position i (1-indexed)."""
while i <= self.n:
self.tree[i] += delta
i += i & (-i)
def prefix_sum(self, i):
"""Sum of elements from 1 to i."""
total = 0
while i > 0:
total += self.tree[i]
i -= i & (-i)
return total
def range_sum(self, l, r):
"""Sum of elements from l to r (inclusive)."""
return self.prefix_sum(r) - self.prefix_sum(l - 1)
def solve_offline_bit(n, arr, queries):
"""
Offline processing with BIT.
Time: O((n + q) * log(n))
Space: O(n + q)
"""
# Store queries with original index, sort by right endpoint
indexed_queries = [(l, r, i) for i, (l, r) in enumerate(queries)]
indexed_queries.sort(key=lambda x: x[1]) # Sort by r
bit = BIT(n)
last_pos = {} # value -> last position where it's marked
answers = [0] * len(queries)
query_idx = 0
for i in range(1, n + 1):
val = arr[i - 1] # Convert to 0-indexed array access
# If this value was marked before, unmark it
if val in last_pos:
bit.update(last_pos[val], -1)
# Mark current position
bit.update(i, 1)
last_pos[val] = i
# Answer all queries with right endpoint == i
while query_idx < len(indexed_queries) and indexed_queries[query_idx][1] == i:
l, r, orig_idx = indexed_queries[query_idx]
answers[orig_idx] = bit.range_sum(l, r)
query_idx += 1
return answers
def main():
import sys
input_data = sys.stdin.read().split()
idx = 0
n, q = int(input_data[idx]), int(input_data[idx + 1])
idx += 2
arr = [int(input_data[idx + i]) for i in range(n)]
idx += n
queries = []
for _ in range(q):
l, r = int(input_data[idx]), int(input_data[idx + 1])
queries.append((l, r))
idx += 2
results = solve_offline_bit(n, arr, queries)
print('\n'.join(map(str, results)))
if __name__ == "__main__":
main()
Complexity
| Metric | Value | Explanation |
|---|---|---|
| Time | O((n + q) * log n) | n BIT updates, q BIT range queries |
| Space | O(n + q) | BIT array, hash map, query storage |
Solution 3: Mo’s Algorithm (Alternative)
Key Insight
The Trick: Process queries in a special order that minimizes pointer movement, allowing O(1) amortized add/remove operations.
Mo’s algorithm divides the array into blocks of size sqrt(n) and sorts queries by (block of l, r). This ensures the total pointer movement is O((n + q) * sqrt(n)).
Algorithm
- Choose block size B = sqrt(n)
- Sort queries by (l / B, r)
- Maintain current range [cur_l, cur_r] and frequency map
- For each query, expand/shrink the current range to match the query
- Track distinct count as we add/remove elements
Code (Python)
from collections import defaultdict
import math
def solve_mo(n, arr, queries):
"""
Mo's algorithm for offline range queries.
Time: O((n + q) * sqrt(n))
Space: O(n + q)
"""
block_size = max(1, int(math.sqrt(n)))
# Store queries with original index
indexed_queries = [(l, r, i) for i, (l, r) in enumerate(queries)]
# Sort by (block of l, r)
indexed_queries.sort(key=lambda x: (x[0] // block_size, x[1]))
freq = defaultdict(int)
distinct_count = 0
cur_l, cur_r = 1, 0 # Empty range initially
answers = [0] * len(queries)
def add(idx):
nonlocal distinct_count
val = arr[idx - 1] # Convert to 0-indexed
if freq[val] == 0:
distinct_count += 1
freq[val] += 1
def remove(idx):
nonlocal distinct_count
val = arr[idx - 1]
freq[val] -= 1
if freq[val] == 0:
distinct_count -= 1
for l, r, orig_idx in indexed_queries:
# Expand/shrink to reach [l, r]
while cur_r < r:
cur_r += 1
add(cur_r)
while cur_l > l:
cur_l -= 1
add(cur_l)
while cur_r > r:
remove(cur_r)
cur_r -= 1
while cur_l < l:
remove(cur_l)
cur_l += 1
answers[orig_idx] = distinct_count
return answers
Complexity (Mo’s Algorithm)
| Metric | Value | Explanation |
|---|---|---|
| Time | O((n + q) * sqrt(n)) | Each pointer moves O(n * sqrt(n)) total |
| Space | O(n + q) | Frequency map, query storage |
Common Mistakes
Mistake 1: Processing Queries Online
# WRONG - Cannot answer each query independently in O(log n)
for l, r in queries:
# No efficient way to count distinct in [l, r] online
answer = some_magic_query(l, r) # Doesn't exist!
Problem: Distinct value counting is not decomposable; segment trees cannot merge distinct counts. Fix: Use offline processing to handle queries in a specific order.
Mistake 2: Forgetting to Unmark Previous Occurrence
# WRONG
for i in range(1, n + 1):
bit.update(i, 1) # Just marks everything
last_pos[arr[i]] = i
Problem: An element appearing at positions 2 and 5 would be counted twice. Fix: Always unmark the previous occurrence before marking the new one.
Mistake 3: Wrong Query Sorting
# WRONG - Sorting by left endpoint
indexed_queries.sort(key=lambda x: x[0]) # Sorted by l
Problem: We need to process by right endpoint to maintain the BIT invariant correctly. Fix: Sort queries by right endpoint r.
Mistake 4: Off-by-One in BIT Indexing
# WRONG - Using 0-indexed in BIT
bit.update(0, 1) # BIT is 1-indexed!
Problem: BIT operations assume 1-indexed positions. Fix: Use 1-indexed positions throughout, convert array access carefully.
Edge Cases
| Case | Input | Expected Output | Why |
|---|---|---|---|
| Single element | n=1, arr=[5], query=(1,1) |
1 | One element is always distinct |
| All same | n=5, arr=[3,3,3,3,3], query=(1,5) |
1 | All elements are the same value |
| All different | n=5, arr=[1,2,3,4,5], query=(1,5) |
5 | Each element is unique |
| Large values | arr=[10^9, 10^9-1] |
Works | Hash map handles large keys |
| Single query full range | query=(1,n) |
Count of unique in entire array | Maximum range |
| Many overlapping queries | All queries are (1, n) | Same answer for all | Offline still works |
When to Use This Pattern
Use Offline + BIT When:
- Queries are given upfront (not online/streaming)
- Need O(log n) per query after sorting
- Can afford O(n log n) preprocessing
- Space for storing all queries is acceptable
Use Mo’s Algorithm When:
- Add/remove operations are O(1)
- O(sqrt(n)) per query is acceptable
- Problem involves counting or frequency tracking
- Cannot use offline + BIT due to problem structure
Don’t Use When:
- Queries must be answered online (as they arrive)
- Updates to the array are interleaved with queries
- Need sub-linear query time without preprocessing
Pattern Recognition Checklist:
- Static array with multiple range queries? -> Consider offline processing
- Counting/frequency in ranges? -> Offline + BIT or Mo’s algorithm
- Non-decomposable range function? -> Cannot use standard segment tree
- Need to track “first/last occurrence” in range? -> Offline + BIT
Related Problems
Easier (Do These First)
| Problem | Why It Helps |
|---|---|
| Static Range Sum Queries (CSES) | Basic range queries with prefix sums |
| Range Minimum Queries (CSES) | Sparse table for static RMQ |
Similar Difficulty
| Problem | Key Difference |
|---|---|
| Distinct Values (CSES) | Count subarrays with at most k distinct (sliding window) |
| Range Update Queries (CSES) | BIT with lazy propagation |
| K-query (SPOJ) | Count elements greater than k in range |
Harder (Do These After)
| Problem | New Concept |
|---|---|
| Distinct Values Queries with Updates | Handle array modifications |
| Count on a Tree (SPOJ) | HLD + persistent segment tree |
| Range Distinct Query (D-Query SPOJ) | Classic offline + BIT |
Key Takeaways
- The Core Idea: Transform “count distinct” into “count rightmost occurrences” using offline processing
- Time Optimization: O(q * n) -> O((n + q) * log n) by processing queries in sorted order
- Space Trade-off: O(n + q) space for BIT and query storage
- Pattern: Offline query processing - when online is impossible, sort queries cleverly
Practice Checklist
Before moving on, make sure you can:
- Explain why segment trees cannot directly solve this problem
- Implement offline + BIT solution from scratch
- Trace through the “last occurrence” invariant manually
- Recognize when Mo’s algorithm is a better choice
- Handle coordinate compression if values are very large
Additional Resources
- CP-Algorithms: Fenwick Tree
- CP-Algorithms: Mo’s Algorithm
- CSES Distinct Values Queries - Count distinct in range