Web Crawler LeetCode Solution

Table of Contents

Problem Statement

Web Crawler LeetCode Solution – Given a URL startUrl and an interface HtmlParser, implement a web crawler to crawl all links that are under the same hostname as startUrl.

Return all URLs obtained by your web crawler in any order.

Your crawler should:

Start from the page: startUrl
Call HtmlParser.getUrls(url) to get all URLs from a webpage of the given URL.
Do not crawl the same link twice.
Explore only the links that are under the same hostname as startUrl

As shown in the example URL above, the hostname is example.org. For simplicity’s sake, you may assume all URLs use HTTP protocol without any port specified. For example, the URLs http://leetcode.com/problems and http://leetcode.com/contest are under the same hostname, while URLs http://example.org/test and http://example.com/abc are not under the same hostname.

The HtmlParser interface is defined as such:

interface HtmlParser { 
       // Return a

list of all urls from a webpage of given

 url. 
       public List<String> getUrls(String url); 
}

Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls, edges and startUrl. Notice that you will only have access to startUrl in your code, while urls and edges are not directly accessible to you in code.

Input:
urls = [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.google.com",
  "http://news.yahoo.com/us"
]
edges = [[2,0],[2,1],[3,2],[3,1],[0,4]]
startUrl = "http://news.yahoo.com/news/topics/"
Output: [
  "http://news.yahoo.com",
  "http://news.yahoo.com/news",
  "http://news.yahoo.com/news/topics/",
  "http://news.yahoo.com/us"
]

Web Crawler LeetCode Solution

Explanation

Each URL can be considered as a node of a directed graph and whenever we call getUrls function of htmlParser we get the edges associated with the current node (i.e. URL). So a simple BFS/DFS traversal that keeps on checking whether a particular edge is valid (is the hostname the same for both) or not will work and if it’s a valid edge then we can add it to the result.

Code

Java Code for Web Crawler LeetCode Solution

class Solution {
  public List<String> crawl(String startUrl, HtmlParser htmlParser) {
        Set<String> set = new HashSet<>();
        Queue<String> queue = new LinkedList<>();
        String hostname = getHostname(startUrl);
        
        queue.offer(startUrl);
        set.add(startUrl);
        
        while (!queue.isEmpty()) {
            String currentUrl = queue.poll();
            for (String url : htmlParser.getUrls(currentUrl)) {
                if (url.contains(hostname) && !set.contains(url)) {
                    queue.offer(url);
                    set.add(url);
                }
            }
        }
        
        return new ArrayList<String>(set);
    }
    
    private String getHostname(String Url) {
        String[] ss = Url.split("/");
        return ss[2];
    }
}

C++ Code for Web Crawler LeetCode Solution

class Solution {
public:
  vector<string> crawl(string startUrl, HtmlParser htmlParser) {
        const string hostname = startUrl.substr(0, startUrl.find('/', 7));
        vector<string> q {startUrl};
        unordered_set<string> seen(q.cbegin(), q.cend());
        for (int i = 0; i < q.size(); ++i) {
            string url = q[i];
            for (auto& child : htmlParser.getUrls(url)) {
                if (child.find(hostname) == string::npos || seen.count(child)) continue;
                q.push_back(child);
                seen.insert(child);
            }
        }
        return q;
    }
};

Python Code for Web Crawler LeetCode Solution

class Solution:
    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
        visited = {startUrl}
        domain = startUrl.split("http://")[1].split("/")[0]
        ans = [startUrl]
        queue = collections.deque([startUrl])
        while queue:
            for _ in range(len(queue)):
                url = queue.popleft()
                check = htmlParser.getUrls(url)
                for new_url in check:
                    if new_url in visited:
                        continue
                    if new_url.split("http://")[1].split("/")[0] != domain:
                        continue
                    ans.append(new_url)
                    visited.add(new_url)
                    queue.append(new_url)        
        return ans

Complexity Analysis

Time Complexity

O(N) since we would visit each node once.

Space Complexity

O(N) since we would have a queue to keep track of adjacent nodes.

Reference: https://en.wikipedia.org/wiki/Web_crawler