Table: Person
+-------------+---------+ | Column Name | Type | +-------------+---------+ | id | int | | email | varchar | +-------------+---------+ id is the primary key (column with unique values) for this table. Each row of this table contains an email. The emails will not contain uppercase letters.
Write a solution to report all the duplicate emails. Note that it's guaranteed that the email field is not NULL.
Return the result table in any order.
The result format is in the following example.
Example 1:
Input: Person table: +----+---------+ | id | email | +----+---------+ | 1 | a@b.com | | 2 | c@d.com | | 3 | a@b.com | +----+---------+ Output: +---------+ | Email | +---------+ | a@b.com | +---------+ Explanation: a@b.com is repeated two times.
The problem asks to find and report all duplicate emails from a Person
table. Two efficient approaches are presented: using GROUP BY
and HAVING
and using a self-join.
GROUP BY
and HAVING
This approach leverages SQL's aggregate functions.
1. GROUP BY email
: This clause groups the rows in the Person
table based on the values in the email
column. All rows with the same email address will be grouped together.
2. COUNT(1)
: This counts the number of rows within each group (i.e., the number of times each email address appears). COUNT(1)
is equivalent to COUNT(*)
and is slightly more efficient in some database systems.
3. HAVING COUNT(1) > 1
: This clause filters the groups. Only groups (email addresses) that have a count greater than 1 (meaning they appear more than once) are included in the result.
4. SELECT email
: This selects the email
column from the groups that satisfy the HAVING
condition. This gives the list of duplicate emails.
Code:
MySQL:
SELECT email
FROM Person
GROUP BY email
HAVING COUNT(*) > 1;
Python (using pandas):
import pandas as pd
def duplicate_emails(person: pd.DataFrame) -> pd.DataFrame:
return person[person.duplicated(subset=['email'], keep=False)]['email'].drop_duplicates()
The pandas solution first identifies all rows with duplicate emails using duplicated(subset=['email'], keep=False)
. keep=False
marks all occurrences of duplicates. Then it selects the 'email' column and removes any remaining duplicates using .drop_duplicates()
.
Time Complexity: The GROUP BY
and HAVING
approach has a time complexity of O(N log N) in the worst case, where N is the number of rows in the Person
table due to the sorting inherent in the GROUP BY
operation. However, many database systems optimize GROUP BY
significantly, making the actual performance much faster in practice.
This approach uses a self-join to compare each row with every other row.
1. SELECT DISTINCT p1.email
: This selects the email address from the first instance of the person table, ensuring only unique email addresses are returned.
2. FROM person AS p1, person AS p2
: This creates a self-join, aliasing the Person
table as p1
and p2
. This creates a Cartesian product, effectively comparing every row in p1
with every row in p2
.
3. WHERE p1.id != p2.id AND p1.email = p2.email
: This is the crucial filtering condition. It ensures that:
* p1.id != p2.id
: We're not comparing a row to itself.
* p1.email = p2.email
: We're only considering rows with the same email address.
Code (MySQL):
SELECT DISTINCT p1.email
FROM person AS p1, person AS p2
WHERE p1.id != p2.id AND p1.email = p2.email;
Time Complexity: The self-join approach has a time complexity of O(N^2) in the worst case, where N is the number of rows in the Person
table. This is because it compares every row with every other row. This approach is less efficient than the GROUP BY
approach for large datasets.
Summary:
| Approach | Time Complexity | Space Complexity | Notes |
|----------------|-----------------|-------------------|----------------------------------------------|
| GROUP BY
| O(N log N) | O(N) | More efficient for large datasets |
| Self-Join | O(N^2) | O(N) | Less efficient, but conceptually simpler |
The GROUP BY
and HAVING
approach is generally preferred for its better time complexity, especially when dealing with large datasets. However, the self-join approach can be easier to understand for those less familiar with aggregate functions.