Table: Activity
+--------------+---------+ | Column Name | Type | +--------------+---------+ | player_id | int | | device_id | int | | event_date | date | | games_played | int | +--------------+---------+ (player_id, event_date) is the primary key (combination of columns with unique values) of this table. This table shows the activity of players of some games. Each row is a record of a player who logged in and played a number of games (possibly 0) before logging out on someday using some device.
Write a solution to report the fraction of players that logged in again on the day after the day they first logged in, rounded to 2 decimal places. In other words, you need to count the number of players that logged in for at least two consecutive days starting from their first login date, then divide that number by the total number of players.
The result format is in the following example.
Example 1:
Input: Activity table: +-----------+-----------+------------+--------------+ | player_id | device_id | event_date | games_played | +-----------+-----------+------------+--------------+ | 1 | 2 | 2016-03-01 | 5 | | 1 | 2 | 2016-03-02 | 6 | | 2 | 3 | 2017-06-25 | 1 | | 3 | 1 | 2016-03-02 | 0 | | 3 | 4 | 2018-07-03 | 5 | +-----------+-----------+------------+--------------+ Output: +-----------+ | fraction | +-----------+ | 0.33 | +-----------+ Explanation: Only the player with id 1 logged back in after the first day he had logged in so the answer is 1/3 = 0.33
This problem requires calculating the fraction of players who logged in on consecutive days, starting from their first login. The solution involves finding each player's first login date and then checking if they logged in the following day.
This approach uses two steps:
Find the first login date: We group the Activity
table by player_id
and use the MIN(event_date)
aggregate function to determine each player's first login date. This is stored in a subquery (or a temporary DataFrame in Python).
Left Join and Counting: We perform a left join between the result from step 1 (containing player IDs and their first login dates) and the original Activity
table. The join condition is that player_id
matches and the difference between the first login date and the event_date
in the Activity
table is -1 day (meaning the next day). A left join ensures that all players from the first step are included, even if they didn't log in the next day. By counting the number of non-NULL event_date
values from the joined result, we get the count of players who logged in on the consecutive day. The result is then divided by the total number of unique players to get the fraction.
Time Complexity Analysis:
GROUP BY
operation in the subquery takes O(N log N) time in the worst case, where N is the number of rows in the Activity
table. (This depends on the specific database implementation; some might optimize it better).LEFT JOIN
operation has a time complexity that depends on the database's join algorithm, but is typically O(M log M + N log N) or better (e.g., hash joins can achieve close to linear time in optimal conditions), where M is the number of rows in the subquery result (number of unique players).Therefore, the overall time complexity is dominated by the GROUP BY
and JOIN
operations, resulting in approximately O(N log N) in the worst case, where N is the number of rows in the Activity
table.
Space Complexity: The space complexity is determined by the space required to store intermediate results (the subquery result and the join result). In the worst case, this could be O(N) in the case where all players have unique login dates.
This approach leverages MySQL's window functions for a more concise and potentially efficient solution:
LEAD Function: The LEAD
window function is used to get the next event_date
for each player, ordered by event_date
. The result is compared to the current event_date
to determine if the next login was on the consecutive day (difference of 1). This result is stored in a common table expression (CTE) called T
.
RANK Function: The RANK
window function is used to assign a rank to each player's login events, ordered by event_date
. This allows us to easily identify the first login event for each player (rank 1).
Counting and Aggregation: We count the number of players with rank 1 who also had a consecutive login (the st
column is 1) and divide this by the total number of unique players.
Time Complexity Analysis:
The window functions (LEAD
and RANK
) have a time complexity of O(N log N) in the worst case, where N is the number of rows in the Activity
table. This is because window functions typically require sorting or other operations with logarithmic time complexity. The subsequent counting and aggregation steps have linear time complexity O(N). Therefore, the overall time complexity is dominated by the window functions, resulting in O(N log N).
Space Complexity: The space complexity is determined by the CTE (T
) which can require O(N) space in the worst case to store intermediate results.
The Python code uses pandas for efficient data manipulation, while the MySQL code directly utilizes SQL queries. The comments in the code further clarify each step.
Python (using pandas):
import pandas as pd
def gameplay_analysis(activity: pd.DataFrame) -> pd.DataFrame:
# Find the first login date for each player
activity["first"] = activity.groupby("player_id")["event_date"].transform("min")
# Filter for players who logged in on the day after their first login
activity_2nd_day = activity[activity["first"] + pd.DateOffset(1) == activity["event_date"]]
# Calculate and format the fraction
fraction = round(len(activity_2nd_day) / activity["player_id"].nunique(), 2)
return pd.DataFrame({"fraction": [fraction]})
MySQL (Approach 2 - using window functions):
WITH T AS (
SELECT
player_id,
DATEDIFF(LEAD(event_date) OVER (PARTITION BY player_id ORDER BY event_date), event_date) = 1 AS st, -- Check for consecutive login
RANK() OVER (PARTITION BY player_id ORDER BY event_date) AS rk -- Rank login events
FROM Activity
)
SELECT ROUND(COUNT(IF(st = 1, player_id, NULL)) / COUNT(DISTINCT player_id), 2) AS fraction
FROM T
WHERE rk = 1; -- Consider only the first login event for each player
Both approaches achieve the same result. The choice between them depends on factors like database system capabilities, data size, and performance optimization needs. The window function approach (MySQL) might be more efficient for very large datasets, depending on the database's optimization of window functions. The pandas solution is generally good for medium-sized datasets and offers flexibility in data manipulation.