Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id
.
For SQL users, please note that you are supposed to write a DELETE
statement and not a SELECT
one.
For Pandas users, please note that you are supposed to modify Person
in place.
After running your script, the answer shown is the Person
table. The driver will first compile and run your piece of code and then show the Person
table. The final order of the Person
table does not matter.
The result format is in the following example.
Example 1:
Input: Person table: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | | 3 | john@example.com | +----+------------------+ Output: +----+------------------+ | id | email | +----+------------------+ | 1 | john@example.com | | 2 | bob@example.com | +----+------------------+ Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.
This problem requires deleting duplicate email entries from a Person
table, retaining only the entry with the smallest id
for each unique email. Several SQL and Python (Pandas) solutions are presented.
MIN()
and NOT IN
(MySQL and Python)DELETE FROM Person
WHERE id NOT IN (SELECT MIN(id) FROM (SELECT * FROM Person) AS p GROUP BY email);
This solution uses a subquery to find the minimum id
for each email. The outer query then deletes all rows where the id
is not the minimum id
for its corresponding email.
SELECT MIN(id) FROM (SELECT * FROM Person) AS p GROUP BY email
: This subquery selects the minimum id
for each group of rows with the same email. The aliasing AS p
is for readability.WHERE id NOT IN (...)
: This condition filters rows in the Person
table, deleting only those whose id
is not present in the set of minimum id
values.import pandas as pd
def delete_duplicate_emails(person: pd.DataFrame) -> None:
person.sort_values(by="id", ascending=True, inplace=True)
person.drop_duplicates(subset="email", keep="first", inplace=True)
This Pandas solution leverages the sort_values
, drop_duplicates
, and inplace
methods for efficiency.
person.sort_values(by="id", ascending=True, inplace=True)
: Sorts the DataFrame by the id
column in ascending order. This ensures that when duplicates are dropped, the row with the smallest id
is retained.person.drop_duplicates(subset="email", keep="first", inplace=True)
: Drops duplicate rows based on the email
column, keeping only the first occurrence (which is the one with the smallest id
due to the prior sorting). inplace=True
modifies the DataFrame directly.Time Complexity Analysis (Solution 1):
GROUP BY
operation in the subquery has a time complexity of O(N log N) or O(N) depending on the database engine's optimization. The DELETE
operation then takes O(N) time in the worst case to scan the table. Overall, it's approximately O(N log N).sort_values
is typically O(N log N), and drop_duplicates
is O(N). Therefore, the overall time complexity is O(N log N).ROW_NUMBER()
(MySQL)DELETE FROM Person
WHERE
id NOT IN (
SELECT id
FROM
(
SELECT
id,
ROW_NUMBER() OVER (
PARTITION BY email
ORDER BY id
) AS rk
FROM Person
) AS p
WHERE rk = 1
);
This approach utilizes the ROW_NUMBER()
window function to assign a unique rank to each row within each email group, ordered by id
. It then deletes rows where the rank is not 1.
ROW_NUMBER() OVER (PARTITION BY email ORDER BY id)
: This assigns a rank to each row based on its id
within each group of rows having the same email. The row with the smallest id
gets rank 1.WHERE rk = 1
: This filters the results to only include rows with rank 1.Time Complexity Analysis (Solution 2):
The ROW_NUMBER()
function has a time complexity of O(N log N) in the worst case due to sorting within partitions. The DELETE
operation then takes O(N) time. Overall, it's approximately O(N log N).
DELETE p2
FROM
person AS p1
JOIN person AS p2 ON p1.email = p2.email
WHERE
p1.id < p2.id;
This uses a self-join to compare rows with the same email. It deletes p2
(the row with the larger id
) if a row p1
exists with the same email and a smaller id
.
JOIN person AS p2 ON p1.email = p2.email
: Joins the table to itself on the email column.WHERE p1.id < p2.id
: This condition identifies duplicate emails and deletes the row with the larger id
.Time Complexity Analysis (Solution 3):
The self-join has a time complexity of O(N^2) in the worst-case scenario (though often optimized by database systems). The DELETE
operation then takes time proportional to the number of rows to delete. Therefore, the worst-case complexity is O(N^2). However, in practice, with appropriate indexing, it could perform much better than the theoretical worst case.
Summary:
Solution 1 (using MIN()
and NOT IN
) and Solution 2 (ROW_NUMBER()
) generally offer better performance than Solution 3 (self-join), especially for larger datasets, because their time complexity is O(N log N) while Solution 3 is O(N^2) in the worst case. The Pandas solution (Solution 1, Python) provides a clean and efficient approach for in-memory data manipulation. The choice of the best solution depends on the specific database system and dataset size. Solution 1 provides the most straightforward and efficient approach in most cases.