Web Crawler

1236. Web Crawler This problem requires implementing a web crawler that explores links within the same hostname as the starting URL. The solution uses Depth-First Search (DFS) to traverse the web pages. Approach Hostname Extraction: A helper function host(url) extracts the hostname from a given URL. It removes the "http://" prefix and takes the part of the URL before the first "/". Depth-First Search (DFS): A recursive function dfs(url) performs the crawling. Base Case: If the URL has already been visited, the function returns. Mark as Visited: The current URL is marked as visited to prevent cycles. Add to Result: The current URL is added to the result set. Explore Neighbors: The HtmlParser.getUrls(url) function is called to get links from the current page. For each neighbor link: If the neighbor link's hostname is the same as the current URL's hostname, the dfs function is recursively called on the neighbor. Initialization and Return: The crawl function initializes a set to store visited URLs and calls dfs on the starting URL. Finally, it converts the set of visited URLs into a list and returns it. Time and Space Complexity Time Complexity: O(V + E), where V is the number of unique URLs and E is the number of links between the URLs. In the worst case, the algorithm visits every URL and follows every link. The time complexity is dominated by the number of URLs and links. Space Complexity: O(V), where V is the number of unique URLs. The space is used primarily to store the set of visited URLs. In the worst case, all URLs are unique and need to be stored. Code Implementation (Python) class Solution: def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: def host(url): return url.split('/')[2] #Extract host after http:// or https:// def dfs(url): if url in visited: return visited.add(url) result.append(url) for next_url in htmlParser.getUrls(url): if host(next_url) == host(startUrl): dfs(next_url) visited = set() result = [] dfs(startUrl) return result:root {--copy-icon: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 48 48'%3E%3Cpath fill='%23adadad' d='M16.187 9.5H12.25a1.75 1.75 0 0 0-1.75 1.75v28.5c0 .967.784 1.75 1.75 1.75h23.5a1.75 1.75 0 0 0 1.75-1.75v-28.5a1.75 1.75 0 0 0-1.75-1.75h-3.937a4.25 4.25 0 0 1-4.063 3h-7.5a4.25 4.25 0 0 1-4.063-3M31.813 7h3.937A4.25 4.25 0 0 1 40 11.25v28.5A4.25 4.25 0 0 1 35.75 44h-23.5A4.25 4.25 0 0 1 8 39.75v-28.5A4.25 4.25 0 0 1 12.25 7h3.937a4.25 4.25 0 0 1 4.063-3h7.5a4.25 4.25 0 0 1 4.063 3M18.5 8.25c0 .966.784 1.75 1.75 1.75h7.5a1.75 1.75 0 1 0 0-3.5h-7.5a1.75 1.75 0 0 0-1.75 1.75'/%3E%3C/svg%3E");--success-icon: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 24 24'%3E%3Cpath fill='%2366ff85' d='M9 16.17L5.53 12.7a.996.996 0 1 0-1.41 1.41l4.18 4.18c.39.39 1.02.39 1.41 0L20.29 7.71a.996.996 0 1 0-1.41-1.41z'/%3E%3C/svg%3E");}pre:has(code) {position: relative;}pre button.rehype-pretty-copy {right: 1px;padding: 0;width: 24px;height: 24px;display: flex;margin-top: 2px;margin-right: 8px;position: absolute;border-radius: 25%;backdrop-filter: blur(3px);& span {width: 100%;aspect-ratio: 1 / 1;}& .ready {background-image: var(--copy-icon);}& .success {display: none; background-image: var(--success-icon);}}&.rehype-pretty-copied {& .success {display: block;} & .ready {display: none;}}pre button.rehype-pretty-copy.rehype-pretty-copied {opacity: 1;& .ready { display: none; }& .success { display: block; }} Note: The provided code uses url.split('/') which may fail if an invalid URL is provided without a host. A more robust approach would involve using a dedicated URL parsing library to handle such edge cases. The corrected host function now accounts for http:// or https:// prefixes more accurately. Code Implementation (Other Languages) The approach is very similar across languages. The main differences lie in syntax and standard library usage for URL parsing and set operations. The provided Java, C++, and Go implementations in the original response accurately reflect this. They use similar DFS logic with minor syntax variations to handle sets and URL parsing.

Also Explore

DSA Questions

Missing Number In Arithmetic Progression

DSA Questions

Meeting Scheduler

DSA Questions

Toss Strange Coins

DSA Questions

Divide Chocolate

DSA Questions

Check If It Is a Straight Line

DSA Questions

Remove Sub-Folders from the Filesystem

DSA Questions

Replace the Substring for Balanced String

DSA Questions

Maximum Profit in Job Scheduling

DSA Questions

Web Crawler

DSA Questions

Find Positive Integer Solution for a Given Equation

DSA Questions

Circular Permutation in Binary Representation

DSA Questions

Maximum Length of a Concatenated String with Unique Characters

DSA Questions

Tiling a Rectangle with the Fewest Squares

DSA Questions

Number of Comments per Post

DSA Questions

Web Crawler Multithreaded

DSA Questions

Array Transformation

DSA Questions

Web Crawler

1236. Web Crawler

Approach

Time and Space Complexity

Code Implementation (Python)

Code Implementation (Other Languages)

On This Page

Also Explore

Missing Number In Arithmetic Progression

Meeting Scheduler

Toss Strange Coins

Divide Chocolate

Check If It Is a Straight Line

Remove Sub-Folders from the Filesystem

Replace the Substring for Balanced String

Maximum Profit in Job Scheduling

Web Crawler

Find Positive Integer Solution for a Given Equation

Circular Permutation in Binary Representation

Maximum Length of a Concatenated String with Unique Characters

Tiling a Rectangle with the Fewest Squares

Number of Comments per Post

Web Crawler Multithreaded

Array Transformation

Design A Leaderboard