This problem requires calculating the median from a table where numbers and their frequencies are stored. We can't directly compute the median from the aggregated data; we need to "decompress" the table first, expanding each number according to its frequency. The efficient solution uses window functions in SQL to avoid explicit decompression.
The core idea is to leverage SQL's window functions to efficiently find the median without explicitly constructing the decompressed list of numbers. Here's a breakdown of the steps:
Calculate Cumulative Frequencies: We use the SUM() OVER (ORDER BY num ASC)
function to compute the cumulative frequency from the smallest number upwards (rk1
). Similarly, SUM() OVER (ORDER BY num DESC)
calculates cumulative frequency from the largest number downwards (rk2
). The SUM() OVER()
calculates the total sum of frequencies (s
).
Identify Median Candidates: The median is the middle value (or average of two middle values) in a sorted dataset. If the total number of elements (s
) is even, the median is the average of the two middle values. We find the rows where the cumulative frequencies from both ends (rk1
and rk2
) are greater than or equal to half of the total count (s/2
). This efficiently identifies the median(s) without explicitly creating the entire decompressed array.
Calculate Median: Finally, we use AVG(num)
to compute the average of the identified number(s). ROUND(AVG(num), 1)
rounds the result to one decimal place, as required.
WITH
t AS (
SELECT
*,
SUM(frequency) OVER (ORDER BY num ASC) AS rk1,
SUM(frequency) OVER (ORDER BY num DESC) AS rk2,
SUM(frequency) OVER () AS s
FROM Numbers
)
SELECT
ROUND(AVG(num), 1) AS median
FROM t
WHERE rk1 >= s / 2 AND rk2 >= s / 2;
t
. CTEs help organize complex queries by breaking them down into smaller, manageable parts.Numbers
table and calculates the cumulative sums.num
column where both rk1
and rk2
are greater than or equal to half the total sum, effectively finding the median. The result is rounded to one decimal place.The time complexity of this SQL query is dominated by the window function calculations. Window functions typically have a time complexity of O(N log N) in the worst case (due to sorting), where N is the number of rows in the Numbers
table. However, database optimizers may use more efficient algorithms, potentially reducing the complexity in practice. The subsequent filtering and averaging steps have a linear time complexity, O(N). Therefore, the overall time complexity is effectively O(N log N).
The space complexity depends on the size of the intermediate CTE (t
), which is proportional to the number of rows in the input table. Hence, the space complexity is O(N).