Given a list paths
of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.
A group of duplicate files consists of at least two files that have the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n
files (f1.txt, f2.txt ... fn.txt)
with content (f1_content, f2_content ... fn_content)
respectively in the directory "root/d1/d2/.../dm"
. Note that n >= 1
and m >= 0
. If m = 0
, it means the directory is just the root directory.
The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"
Example 1:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"] Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Example 2:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"] Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Constraints:
1 <= paths.length <= 2 * 104
1 <= paths[i].length <= 3000
1 <= sum(paths[i].length) <= 5 * 105
paths[i]
consist of English letters, digits, '/'
, '.'
, '('
, ')'
, and ' '
.
Follow up:
This problem involves finding groups of files within a file system that have identical content. The input is a list of strings, each representing a directory and its files with their content. The output is a list of lists, where each inner list contains the paths of files with the same content.
Approach:
The most efficient approach uses a hash map (or dictionary in Python) to store file content as keys and a list of file paths as values. We iterate through the input paths
, parsing each string to extract the directory path, file name, and file content. The file content is used as the key in the hash map, and the corresponding file path is added to the value list. Finally, we filter the hash map to include only entries where the value list (paths) has more than one element, indicating duplicate files.
Time Complexity:
The time complexity is dominated by the iteration through the input paths
and the string manipulation within the loop. String splitting, finding indices, and substring extraction have a time complexity proportional to the length of the strings. The overall time complexity is O(N*M), where N is the number of strings in paths
and M is the average length of each string in paths
. Hash map operations (insertion and lookup) have an average time complexity of O(1).
Space Complexity:
The space complexity is determined by the size of the hash map. In the worst case, if all files have unique content, the space complexity will be O(N*M), where N is the number of strings and M is the maximum length of a file's path and content. In the best case where all files have the same content, the space complexity will be O(M), where M is the maximum length of a path.
Code Explanation (Python):
from collections import defaultdict
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
d = defaultdict(list) # Use defaultdict for easier handling of missing keys
for p in paths:
ps = p.split() #split the string by spaces
root_path = ps[0] #first element is the root path
for file_info in ps[1:]: #the rest are file info
file_name_content = file_info.split('(') #split by '('
file_name = file_name_content[0] #first element is the file name
file_content = file_name_content[1][:-1] #second element is the content (remove ')')
full_path = root_path + "/" + file_name #construct the full path
d[file_content].append(full_path) #add the path to the content in the map
result = [files for files in d.values() if len(files) > 1] #filter for duplicates
return result
The Python code efficiently uses a defaultdict
to avoid explicit checks for key existence. The split()
method is used effectively to parse the input strings. The list comprehension concisely filters the results.
The other languages (Java, C++, Go, TypeScript) follow a similar approach with minor syntactic differences to accommodate their respective language features. The core logic remains consistent across all implementations.