{x}
blog image

Delete Duplicate Emails

Table: Person

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| id          | int     |
| email       | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. The emails will not contain uppercase letters.

 

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

The result format is in the following example.

 

Example 1:

Input: 
Person table:
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
| 3  | john@example.com |
+----+------------------+
Output: 
+----+------------------+
| id | email            |
+----+------------------+
| 1  | john@example.com |
| 2  | bob@example.com  |
+----+------------------+
Explanation: john@example.com is repeated two times. We keep the row with the smallest Id = 1.

Solution Explanations for Deleting Duplicate Emails

This problem requires deleting duplicate email entries from a Person table, retaining only the entry with the smallest id for each unique email. Several SQL and Python (Pandas) solutions are presented.

Solution 1: Using MIN() and NOT IN (MySQL and Python)

MySQL

DELETE FROM Person
WHERE id NOT IN (SELECT MIN(id) FROM (SELECT * FROM Person) AS p GROUP BY email);

This solution uses a subquery to find the minimum id for each email. The outer query then deletes all rows where the id is not the minimum id for its corresponding email.

  • SELECT MIN(id) FROM (SELECT * FROM Person) AS p GROUP BY email: This subquery selects the minimum id for each group of rows with the same email. The aliasing AS p is for readability.
  • WHERE id NOT IN (...): This condition filters rows in the Person table, deleting only those whose id is not present in the set of minimum id values.

Python (Pandas)

import pandas as pd
 
def delete_duplicate_emails(person: pd.DataFrame) -> None:
    person.sort_values(by="id", ascending=True, inplace=True)
    person.drop_duplicates(subset="email", keep="first", inplace=True)

This Pandas solution leverages the sort_values, drop_duplicates, and inplace methods for efficiency.

  • person.sort_values(by="id", ascending=True, inplace=True): Sorts the DataFrame by the id column in ascending order. This ensures that when duplicates are dropped, the row with the smallest id is retained.
  • person.drop_duplicates(subset="email", keep="first", inplace=True): Drops duplicate rows based on the email column, keeping only the first occurrence (which is the one with the smallest id due to the prior sorting). inplace=True modifies the DataFrame directly.

Time Complexity Analysis (Solution 1):

  • MySQL: The GROUP BY operation in the subquery has a time complexity of O(N log N) or O(N) depending on the database engine's optimization. The DELETE operation then takes O(N) time in the worst case to scan the table. Overall, it's approximately O(N log N).
  • Python (Pandas): sort_values is typically O(N log N), and drop_duplicates is O(N). Therefore, the overall time complexity is O(N log N).

Solution 2: Using ROW_NUMBER() (MySQL)

DELETE FROM Person
WHERE
    id NOT IN (
        SELECT id
        FROM
            (
                SELECT
                    id,
                    ROW_NUMBER() OVER (
                        PARTITION BY email
                        ORDER BY id
                    ) AS rk
                FROM Person
            ) AS p
        WHERE rk = 1
    );

This approach utilizes the ROW_NUMBER() window function to assign a unique rank to each row within each email group, ordered by id. It then deletes rows where the rank is not 1.

  • ROW_NUMBER() OVER (PARTITION BY email ORDER BY id): This assigns a rank to each row based on its id within each group of rows having the same email. The row with the smallest id gets rank 1.
  • WHERE rk = 1: This filters the results to only include rows with rank 1.

Time Complexity Analysis (Solution 2):

The ROW_NUMBER() function has a time complexity of O(N log N) in the worst case due to sorting within partitions. The DELETE operation then takes O(N) time. Overall, it's approximately O(N log N).

Solution 3: Self-Join DELETE (MySQL)

DELETE p2
FROM
    person AS p1
    JOIN person AS p2 ON p1.email = p2.email
WHERE
    p1.id < p2.id;

This uses a self-join to compare rows with the same email. It deletes p2 (the row with the larger id) if a row p1 exists with the same email and a smaller id.

  • JOIN person AS p2 ON p1.email = p2.email: Joins the table to itself on the email column.
  • WHERE p1.id < p2.id: This condition identifies duplicate emails and deletes the row with the larger id.

Time Complexity Analysis (Solution 3):

The self-join has a time complexity of O(N^2) in the worst-case scenario (though often optimized by database systems). The DELETE operation then takes time proportional to the number of rows to delete. Therefore, the worst-case complexity is O(N^2). However, in practice, with appropriate indexing, it could perform much better than the theoretical worst case.

Summary:

Solution 1 (using MIN() and NOT IN) and Solution 2 (ROW_NUMBER()) generally offer better performance than Solution 3 (self-join), especially for larger datasets, because their time complexity is O(N log N) while Solution 3 is O(N^2) in the worst case. The Pandas solution (Solution 1, Python) provides a clean and efficient approach for in-memory data manipulation. The choice of the best solution depends on the specific database system and dataset size. Solution 1 provides the most straightforward and efficient approach in most cases.