{x}
blog image

Find Duplicate File in System

Given a list paths of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.

A group of duplicate files consists of at least two files that have the same content.

A single directory info string in the input list has the following format:

  • "root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"

It means there are n files (f1.txt, f2.txt ... fn.txt) with content (f1_content, f2_content ... fn_content) respectively in the directory "root/d1/d2/.../dm". Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.

The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:

  • "directory_path/file_name.txt"

 

Example 1:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

Example 2:

Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"]
Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]

 

Constraints:

  • 1 <= paths.length <= 2 * 104
  • 1 <= paths[i].length <= 3000
  • 1 <= sum(paths[i].length) <= 5 * 105
  • paths[i] consist of English letters, digits, '/', '.', '(', ')', and ' '.
  • You may assume no files or directories share the same name in the same directory.
  • You may assume each given directory info represents a unique directory. A single blank space separates the directory path and file info.

 

Follow up:

  • Imagine you are given a real file system, how will you search files? DFS or BFS?
  • If the file content is very large (GB level), how will you modify your solution?
  • If you can only read the file by 1kb each time, how will you modify your solution?
  • What is the time complexity of your modified solution? What is the most time-consuming part and memory-consuming part of it? How to optimize?
  • How to make sure the duplicated files you find are not false positive?

Solution Explanation for Find Duplicate File in System

This problem involves finding groups of files within a file system that have identical content. The input is a list of strings, each representing a directory and its files with their content. The output is a list of lists, where each inner list contains the paths of files with the same content.

Approach:

The most efficient approach uses a hash map (or dictionary in Python) to store file content as keys and a list of file paths as values. We iterate through the input paths, parsing each string to extract the directory path, file name, and file content. The file content is used as the key in the hash map, and the corresponding file path is added to the value list. Finally, we filter the hash map to include only entries where the value list (paths) has more than one element, indicating duplicate files.

Time Complexity:

The time complexity is dominated by the iteration through the input paths and the string manipulation within the loop. String splitting, finding indices, and substring extraction have a time complexity proportional to the length of the strings. The overall time complexity is O(N*M), where N is the number of strings in paths and M is the average length of each string in paths. Hash map operations (insertion and lookup) have an average time complexity of O(1).

Space Complexity:

The space complexity is determined by the size of the hash map. In the worst case, if all files have unique content, the space complexity will be O(N*M), where N is the number of strings and M is the maximum length of a file's path and content. In the best case where all files have the same content, the space complexity will be O(M), where M is the maximum length of a path.

Code Explanation (Python):

from collections import defaultdict
 
class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        d = defaultdict(list)  # Use defaultdict for easier handling of missing keys
        for p in paths:
            ps = p.split() #split the string by spaces
            root_path = ps[0] #first element is the root path
            for file_info in ps[1:]: #the rest are file info
                file_name_content = file_info.split('(') #split by '('
                file_name = file_name_content[0] #first element is the file name
                file_content = file_name_content[1][:-1] #second element is the content (remove ')')
                full_path = root_path + "/" + file_name #construct the full path
                d[file_content].append(full_path) #add the path to the content in the map
 
        result = [files for files in d.values() if len(files) > 1] #filter for duplicates
        return result

The Python code efficiently uses a defaultdict to avoid explicit checks for key existence. The split() method is used effectively to parse the input strings. The list comprehension concisely filters the results.

The other languages (Java, C++, Go, TypeScript) follow a similar approach with minor syntactic differences to accommodate their respective language features. The core logic remains consistent across all implementations.