Table of Contents
Problem Statement
Web Crawler LeetCode Solution – Given a URL startUrl
and an interface HtmlParser
, implement a web crawler to crawl all links that are under the same hostname as startUrl
.
Return all URLs obtained by your web crawler in any order.
Your crawler should:
- Start from the page:
startUrl
- Call
HtmlParser.getUrls(url)
to get all URLs from a webpage of the given URL. - Do not crawl the same link twice.
- Explore only the links that are under the same hostname as
startUrl
As shown in the example URL above, the hostname is example.org
. For simplicity’s sake, you may assume all URLs use HTTP protocol without any port specified. For example, the URLs http://leetcode.com/problems
and http://leetcode.com/contest
are under the same hostname, while URLs http://example.org/test
and http://example.com/abc
are not under the same hostname.
The HtmlParser
interface is defined as such:
interface HtmlParser { // Return a
list of all urls from a webpage of given
url. public List<String> getUrls(String url); }
Below are two examples explaining the functionality of the problem, for custom testing purposes you’ll have three variables urls
, edges
and startUrl
. Notice that you will only have access to startUrl
in your code, while urls
and edges
are not directly accessible to you in code.
Input: urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ] edges = [[2,0],[2,1],[3,2],[3,1],[0,4]] startUrl = "http://news.yahoo.com/news/topics/" Output: [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.yahoo.com/us" ]
Explanation
Each URL can be considered as a node of a directed graph and whenever we call getUrls function of htmlParser we get the edges associated with the current node (i.e. URL). So a simple BFS/DFS traversal that keeps on checking whether a particular edge is valid (is the hostname the same for both) or not will work and if it’s a valid edge then we can add it to the result.
Code
Java Code for Web Crawler LeetCode Solution
class Solution { public List<String> crawl(String startUrl, HtmlParser htmlParser) { Set<String> set = new HashSet<>(); Queue<String> queue = new LinkedList<>(); String hostname = getHostname(startUrl); queue.offer(startUrl); set.add(startUrl); while (!queue.isEmpty()) { String currentUrl = queue.poll(); for (String url : htmlParser.getUrls(currentUrl)) { if (url.contains(hostname) && !set.contains(url)) { queue.offer(url); set.add(url); } } } return new ArrayList<String>(set); } private String getHostname(String Url) { String[] ss = Url.split("/"); return ss[2]; } }
C++ Code for Web Crawler LeetCode Solution
class Solution { public: vector<string> crawl(string startUrl, HtmlParser htmlParser) { const string hostname = startUrl.substr(0, startUrl.find('/', 7)); vector<string> q {startUrl}; unordered_set<string> seen(q.cbegin(), q.cend()); for (int i = 0; i < q.size(); ++i) { string url = q[i]; for (auto& child : htmlParser.getUrls(url)) { if (child.find(hostname) == string::npos || seen.count(child)) continue; q.push_back(child); seen.insert(child); } } return q; } };
Python Code for Web Crawler LeetCode Solution
class Solution: def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: visited = {startUrl} domain = startUrl.split("http://")[1].split("/")[0] ans = [startUrl] queue = collections.deque([startUrl]) while queue: for _ in range(len(queue)): url = queue.popleft() check = htmlParser.getUrls(url) for new_url in check: if new_url in visited: continue if new_url.split("http://")[1].split("/")[0] != domain: continue ans.append(new_url) visited.add(new_url) queue.append(new_url) return ans
Complexity Analysis
Time Complexity
O(N) since we would visit each node once.
Space Complexity
O(N) since we would have a queue to keep track of adjacent nodes.
Reference: https://en.wikipedia.org/wiki/Web_crawler