{x}
blog image

Web Crawler

1236. Web Crawler

This problem requires implementing a web crawler that explores links within the same hostname as the starting URL. The solution uses Depth-First Search (DFS) to traverse the web pages.

Approach

  1. Hostname Extraction: A helper function host(url) extracts the hostname from a given URL. It removes the "http://" prefix and takes the part of the URL before the first "/".

  2. Depth-First Search (DFS): A recursive function dfs(url) performs the crawling.

    • Base Case: If the URL has already been visited, the function returns.
    • Mark as Visited: The current URL is marked as visited to prevent cycles.
    • Add to Result: The current URL is added to the result set.
    • Explore Neighbors: The HtmlParser.getUrls(url) function is called to get links from the current page. For each neighbor link:
      • If the neighbor link's hostname is the same as the current URL's hostname, the dfs function is recursively called on the neighbor.
  3. Initialization and Return: The crawl function initializes a set to store visited URLs and calls dfs on the starting URL. Finally, it converts the set of visited URLs into a list and returns it.

Time and Space Complexity

  • Time Complexity: O(V + E), where V is the number of unique URLs and E is the number of links between the URLs. In the worst case, the algorithm visits every URL and follows every link. The time complexity is dominated by the number of URLs and links.

  • Space Complexity: O(V), where V is the number of unique URLs. The space is used primarily to store the set of visited URLs. In the worst case, all URLs are unique and need to be stored.

Code Implementation (Python)

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        def host(url):
            return url.split('/')[2]  #Extract host after http:// or https://
 
        def dfs(url):
            if url in visited:
                return
            visited.add(url)
            result.append(url)
            for next_url in htmlParser.getUrls(url):
                if host(next_url) == host(startUrl):
                    dfs(next_url)
 
        visited = set()
        result = []
        dfs(startUrl)
        return result

Note: The provided code uses url.split('/') which may fail if an invalid URL is provided without a host. A more robust approach would involve using a dedicated URL parsing library to handle such edge cases. The corrected host function now accounts for http:// or https:// prefixes more accurately.

Code Implementation (Other Languages)

The approach is very similar across languages. The main differences lie in syntax and standard library usage for URL parsing and set operations. The provided Java, C++, and Go implementations in the original response accurately reflect this. They use similar DFS logic with minor syntax variations to handle sets and URL parsing.