{x}
blog image

Game Play Analysis IV

Table: Activity

+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| player_id    | int     |
| device_id    | int     |
| event_date   | date    |
| games_played | int     |
+--------------+---------+
(player_id, event_date) is the primary key (combination of columns with unique values) of this table.
This table shows the activity of players of some games.
Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.

 

Write a solution to report the fraction of players that logged in again on the day after the day they first logged in, rounded to 2 decimal places. In other words, you need to count the number of players that logged in for at least two consecutive days starting from their first login date, then divide that number by the total number of players.

The result format is in the following example.

 

Example 1:

Input: 
Activity table:
+-----------+-----------+------------+--------------+
| player_id | device_id | event_date | games_played |
+-----------+-----------+------------+--------------+
| 1         | 2         | 2016-03-01 | 5            |
| 1         | 2         | 2016-03-02 | 6            |
| 2         | 3         | 2017-06-25 | 1            |
| 3         | 1         | 2016-03-02 | 0            |
| 3         | 4         | 2018-07-03 | 5            |
+-----------+-----------+------------+--------------+
Output: 
+-----------+
| fraction  |
+-----------+
| 0.33      |
+-----------+
Explanation: 
Only the player with id 1 logged back in after the first day he had logged in so the answer is 1/3 = 0.33

Solution Explanation for LeetCode 550: Game Play Analysis IV

This problem requires calculating the fraction of players who logged in on consecutive days, starting from their first login. The solution involves finding each player's first login date and then checking if they logged in the following day.

Approach 1: Grouping and Minimum Value + Left Join

This approach uses two steps:

  1. Find the first login date: We group the Activity table by player_id and use the MIN(event_date) aggregate function to determine each player's first login date. This is stored in a subquery (or a temporary DataFrame in Python).

  2. Left Join and Counting: We perform a left join between the result from step 1 (containing player IDs and their first login dates) and the original Activity table. The join condition is that player_id matches and the difference between the first login date and the event_date in the Activity table is -1 day (meaning the next day). A left join ensures that all players from the first step are included, even if they didn't log in the next day. By counting the number of non-NULL event_date values from the joined result, we get the count of players who logged in on the consecutive day. The result is then divided by the total number of unique players to get the fraction.

Time Complexity Analysis:

  • The GROUP BY operation in the subquery takes O(N log N) time in the worst case, where N is the number of rows in the Activity table. (This depends on the specific database implementation; some might optimize it better).
  • The LEFT JOIN operation has a time complexity that depends on the database's join algorithm, but is typically O(M log M + N log N) or better (e.g., hash joins can achieve close to linear time in optimal conditions), where M is the number of rows in the subquery result (number of unique players).
  • The final count and division operations take O(M) time.

Therefore, the overall time complexity is dominated by the GROUP BY and JOIN operations, resulting in approximately O(N log N) in the worst case, where N is the number of rows in the Activity table.

Space Complexity: The space complexity is determined by the space required to store intermediate results (the subquery result and the join result). In the worst case, this could be O(N) in the case where all players have unique login dates.

Approach 2: Window Function (MySQL)

This approach leverages MySQL's window functions for a more concise and potentially efficient solution:

  1. LEAD Function: The LEAD window function is used to get the next event_date for each player, ordered by event_date. The result is compared to the current event_date to determine if the next login was on the consecutive day (difference of 1). This result is stored in a common table expression (CTE) called T.

  2. RANK Function: The RANK window function is used to assign a rank to each player's login events, ordered by event_date. This allows us to easily identify the first login event for each player (rank 1).

  3. Counting and Aggregation: We count the number of players with rank 1 who also had a consecutive login (the st column is 1) and divide this by the total number of unique players.

Time Complexity Analysis:

The window functions (LEAD and RANK) have a time complexity of O(N log N) in the worst case, where N is the number of rows in the Activity table. This is because window functions typically require sorting or other operations with logarithmic time complexity. The subsequent counting and aggregation steps have linear time complexity O(N). Therefore, the overall time complexity is dominated by the window functions, resulting in O(N log N).

Space Complexity: The space complexity is determined by the CTE (T) which can require O(N) space in the worst case to store intermediate results.

Code Implementation (Python and MySQL)

The Python code uses pandas for efficient data manipulation, while the MySQL code directly utilizes SQL queries. The comments in the code further clarify each step.

Python (using pandas):

import pandas as pd
 
def gameplay_analysis(activity: pd.DataFrame) -> pd.DataFrame:
    # Find the first login date for each player
    activity["first"] = activity.groupby("player_id")["event_date"].transform("min")
    
    # Filter for players who logged in on the day after their first login
    activity_2nd_day = activity[activity["first"] + pd.DateOffset(1) == activity["event_date"]]
 
    # Calculate and format the fraction
    fraction = round(len(activity_2nd_day) / activity["player_id"].nunique(), 2)
    return pd.DataFrame({"fraction": [fraction]})
 

MySQL (Approach 2 - using window functions):

WITH T AS (
    SELECT
        player_id,
        DATEDIFF(LEAD(event_date) OVER (PARTITION BY player_id ORDER BY event_date), event_date) = 1 AS st,  -- Check for consecutive login
        RANK() OVER (PARTITION BY player_id ORDER BY event_date) AS rk  -- Rank login events
    FROM Activity
)
SELECT ROUND(COUNT(IF(st = 1, player_id, NULL)) / COUNT(DISTINCT player_id), 2) AS fraction
FROM T
WHERE rk = 1;  -- Consider only the first login event for each player

Both approaches achieve the same result. The choice between them depends on factors like database system capabilities, data size, and performance optimization needs. The window function approach (MySQL) might be more efficient for very large datasets, depending on the database's optimization of window functions. The pandas solution is generally good for medium-sized datasets and offers flexibility in data manipulation.