Your Dashboard
Interview Coaching
Learn
System Design
ML System Design
Code
Behavioral
Salary Negotiation
Interview Guides
Multithreaded Web Crawler
Asked at:
OpenAI
Anthropic
DESCRIPTION
Build a web crawler that performs breadth-first search (BFS) traversal starting from a seed URL, discovering and fetching all reachable pages within the same domain. Implement concurrent fetching using multithreading to improve crawling speed while maintaining thread safety for shared data structures like the URL queue and visited set. For example, starting from https://example.com, the crawler should visit https://example.com/about and https://example.com/contact but skip https://other.com.
Input:
Output:
Explanation: The crawler uses BFS to visit pages level-by-level, employs 4 threads for concurrent fetching, and only crawls URLs within the same domain while tracking visited pages to avoid duplicates.
Constraints:
- Only crawl URLs within the same domain as the starting URL
- Use BFS traversal order for discovering URLs
- Implement thread-safe access to shared queue and visited set
- Handle concurrent URL fetching with multiple threads
- Avoid crawling the same URL twice
Understanding the Problem
The challenge combines graph traversal (treating web pages as nodes and links as edges) with concurrent programming. The BFS requirement demands a queue-based approach where URLs are processed level-by-level. The multithreading aspect requires synchronization mechanisms (locks, thread-safe collections) to prevent race conditions when multiple threads access the shared queue and visited set. Additionally, domain filtering must be applied to ensure only same-domain URLs are crawled.
Building Intuition
A naive single-threaded approach would fetch URLs sequentially, making the crawler slow for large sites. A better approach uses a thread pool where multiple worker threads concurrently fetch pages from a shared queue, dramatically reducing total crawl time. For example, fetching 100 pages taking 1 second each would take 100 seconds single-threaded but only ~25 seconds with 4 threads.
Web crawlers are fundamental to search engines, site monitoring tools, and data scraping applications. Efficient crawling with proper concurrency can reduce crawl time from hours to minutes for large sites. Understanding thread safety is crucial for building reliable distributed systems that handle shared state correctly.
Common Pitfalls
Implementation
URL Queue and Domain Management
Implement the core data structures for BFS traversal: a thread-safe queue.Queue for pending URLs and a set with lock protection for visited URLs. Create a domain extraction function using urllib.parse.urlparse() to normalize and compare domains, ensuring only same-domain URLs are queued. For example, https://example.com/page and https://example.com:443/page should be recognized as the same domain.
from queue import Queuefrom threading import Lockfrom urllib.parse import urlparseurl_queue = Queue()visited = set()visited_lock = Lock()def get_domain(url):parsed = urlparse(url)domain = parsed.netloc.lower()if ':' in domain:host, port = domain.rsplit(':', 1)if (parsed.scheme == 'https' and port == '443') or (parsed.scheme == 'http' and port == '80'):domain = hostreturn f"{parsed.scheme}://{domain}"def is_same_domain(url1, url2):return get_domain(url1) == get_domain(url2)
URL Fetching and Parsing Worker
Build the worker thread function that continuously dequeues URLs, fetches page content using requests.get(), and extracts links using BeautifulSoup or regex. Implement error handling for network failures, timeouts, and invalid HTML. Each discovered link should be normalized (converted to absolute URL) and filtered by domain before being added to the queue. For instance, a relative link /about on https://example.com/page becomes https://example.com/about.
import requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoin, urlparsedef worker(queue, visited, visited_lock, domain, timeout=5):while True:with visited_lock:if queue.empty():breakurl = queue.get()try:response = requests.get(url, timeout=timeout)soup = BeautifulSoup(response.text, 'html.parser')for link in soup.find_all('a', href=True):absolute_url = urljoin(url, link['href'])parsed = urlparse(absolute_url)normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"if parsed.netloc == domain:with visited_lock:if normalized not in visited:visited.add(normalized)queue.put(normalized)except (requests.RequestException, Exception):passfinally:queue.task_done()
Thread Pool Coordination and Termination
Create the main crawler orchestration that initializes a thread pool using threading.Thread or concurrent.futures.ThreadPoolExecutor, seeds the queue with the starting URL, and manages worker lifecycle. Implement a termination mechanism using a combination of queue empty checks and active thread counting, or use sentinel values to signal workers to exit. For example, when the queue is empty and all threads are idle, the crawl is complete.
from concurrent.futures import ThreadPoolExecutorimport threadingdef crawl_orchestrator(start_url, max_workers=5):url_queue = Queue()visited = set()lock = threading.Lock()url_queue.put(start_url)with ThreadPoolExecutor(max_workers=max_workers) as executor:futures = []for _ in range(max_workers):future = executor.submit(worker, url_queue, visited, lock)futures.append(future)# Wait for all workers to completefor future in futures:future.result()
Thread-Safe State Management
Implement synchronization primitives to protect shared resources: use threading.Lock for the visited set to prevent duplicate URL processing, and leverage queue.Queue's built-in thread safety for the URL queue. Add atomic check-and-add operations where a thread checks if a URL is visited and marks it visited in one locked section. For instance, wrap if url not in visited: visited.add(url) in a lock context to prevent race conditions.
import threadingfrom queue import Queueclass ThreadSafeState:def __init__(self):self.visited = set()self.visited_lock = threading.Lock()self.url_queue = Queue()def is_visited_and_mark(self, url):with self.visited_lock:if url in self.visited:return Trueself.visited.add(url)return Falsedef add_url(self, url):self.url_queue.put(url)def get_url(self, timeout=1):return self.url_queue.get(timeout=timeout)
What We've Learned
- Pattern: BFS with thread pool - use thread-safe queue for level-order traversal with concurrent workers
- Use Case: Web crawlers, site mappers, link validators, and any graph traversal with I/O-bound operations
- Concurrency: Thread-safe collections + locks prevent race conditions in multi-threaded environments
- Optimization: Multithreading dramatically improves I/O-bound tasks like web crawling by parallelizing network requests
Problems to Practice
This lesson provides foundational understanding of graph traversal using BFS, which is the core algorithm used in the web crawler problem. It covers how to traverse graphs level-by-level, which directly applies to crawling URLs in a breadth-first manner.
medium
This problem involves graph traversal with cycle detection and managing visited nodes, similar to tracking visited URLs in a web crawler. It also requires handling dependencies between nodes, which parallels managing the crawl queue and ensuring no duplicate processing.
easy
This problem requires traversing a graph while maintaining a visited set to avoid infinite loops, which is essential for web crawling. It teaches how to handle graph nodes with multiple connections and track already-processed nodes, directly applicable to URL deduplication in crawlers.
Question Timeline
See when this question was last asked and where, including any notes left by other candidates.
Mid July, 2025
Anthropic
Senior
Mid April, 2025
OpenAI
Staff
Create a web crawler that starts from a given URL and crawls all reachable URLs within the same domain using breadth-first search traversal. Implement multithreading to fetch URLs concurrently for improved efficiency while ensuring thread safety when accessing shared resources like the URL queue and visited set.
Late November, 2024
Anthropic
Mid-level
Comments
Hello Interview Premium
Your account is free and you can post anonymously if you choose.