{x}
blog image

Find Median Given Frequency of Numbers

Solution Explanation: Find Median Given Frequency of Numbers

This problem requires calculating the median from a table where numbers and their frequencies are stored. We can't directly compute the median from the aggregated data; we need to "decompress" the table first, expanding each number according to its frequency. The efficient solution uses window functions in SQL to avoid explicit decompression.

Approach

The core idea is to leverage SQL's window functions to efficiently find the median without explicitly constructing the decompressed list of numbers. Here's a breakdown of the steps:

  1. Calculate Cumulative Frequencies: We use the SUM() OVER (ORDER BY num ASC) function to compute the cumulative frequency from the smallest number upwards (rk1). Similarly, SUM() OVER (ORDER BY num DESC) calculates cumulative frequency from the largest number downwards (rk2). The SUM() OVER() calculates the total sum of frequencies (s).

  2. Identify Median Candidates: The median is the middle value (or average of two middle values) in a sorted dataset. If the total number of elements (s) is even, the median is the average of the two middle values. We find the rows where the cumulative frequencies from both ends (rk1 and rk2) are greater than or equal to half of the total count (s/2). This efficiently identifies the median(s) without explicitly creating the entire decompressed array.

  3. Calculate Median: Finally, we use AVG(num) to compute the average of the identified number(s). ROUND(AVG(num), 1) rounds the result to one decimal place, as required.

MySQL Code Explained

WITH
    t AS (
        SELECT
            *,
            SUM(frequency) OVER (ORDER BY num ASC) AS rk1,
            SUM(frequency) OVER (ORDER BY num DESC) AS rk2,
            SUM(frequency) OVER () AS s
        FROM Numbers
    )
SELECT
    ROUND(AVG(num), 1) AS median
FROM t
WHERE rk1 >= s / 2 AND rk2 >= s / 2;
  • WITH t AS (...): This defines a common table expression (CTE) named t. CTEs help organize complex queries by breaking them down into smaller, manageable parts.
  • *SELECT , ... FROM Numbers: This selects all columns from the Numbers table and calculates the cumulative sums.
  • SUM(frequency) OVER (ORDER BY num ASC) AS rk1: This is the cumulative sum of frequencies ordered ascendingly.
  • SUM(frequency) OVER (ORDER BY num DESC) AS rk2: This is the cumulative sum of frequencies ordered descendingly.
  • SUM(frequency) OVER () AS s: This calculates the total sum of all frequencies.
  • SELECT ROUND(AVG(num), 1) AS median FROM t WHERE rk1 >= s / 2 AND rk2 >= s / 2: This selects the average of the num column where both rk1 and rk2 are greater than or equal to half the total sum, effectively finding the median. The result is rounded to one decimal place.

Time Complexity Analysis

The time complexity of this SQL query is dominated by the window function calculations. Window functions typically have a time complexity of O(N log N) in the worst case (due to sorting), where N is the number of rows in the Numbers table. However, database optimizers may use more efficient algorithms, potentially reducing the complexity in practice. The subsequent filtering and averaging steps have a linear time complexity, O(N). Therefore, the overall time complexity is effectively O(N log N).

Space Complexity Analysis

The space complexity depends on the size of the intermediate CTE (t), which is proportional to the number of rows in the input table. Hence, the space complexity is O(N).