Search
⌘K

Multithreaded Web Crawler

Asked at:

Anthropic

OpenAI


DESCRIPTION

Build a web crawler that performs breadth-first search (BFS) traversal starting from a seed URL, discovering and fetching all reachable pages within the same domain. Implement concurrent fetching using multithreading to improve crawling speed while maintaining thread safety for shared data structures like the URL queue and visited set. For example, starting from https://example.com, the crawler should visit https://example.com/about and https://example.com/contact but skip https://other.com.

Input:

start_url = "https://example.com"
max_threads = 4

Page structure:
example.com → [example.com/about, example.com/blog]
example.com/about → [example.com/contact]
example.com/blog → [example.com/about, external.com]

Output:

Crawled URLs (in BFS order):
1. https://example.com
2. https://example.com/about
3. https://example.com/blog
4. https://example.com/contact

Total pages crawled: 4
(external.com skipped - different domain)


Explanation: The crawler uses BFS to visit pages level-by-level, employs 4 threads for concurrent fetching, and only crawls URLs within the same domain while tracking visited pages to avoid duplicates.

Constraints:

  • Only crawl URLs within the same domain as the starting URL
  • Use BFS traversal order for discovering URLs
  • Implement thread-safe access to shared queue and visited set
  • Handle concurrent URL fetching with multiple threads
  • Avoid crawling the same URL twice

Understanding the Problem

The challenge combines graph traversal (treating web pages as nodes and links as edges) with concurrent programming. The BFS requirement demands a queue-based approach where URLs are processed level-by-level. The multithreading aspect requires synchronization mechanisms (locks, thread-safe collections) to prevent race conditions when multiple threads access the shared queue and visited set. Additionally, domain filtering must be applied to ensure only same-domain URLs are crawled.

Building Intuition

A naive single-threaded approach would fetch URLs sequentially, making the crawler slow for large sites. A better approach uses a thread pool where multiple worker threads concurrently fetch pages from a shared queue, dramatically reducing total crawl time. For example, fetching 100 pages taking 1 second each would take 100 seconds single-threaded but only ~25 seconds with 4 threads.

Web crawlers are fundamental to search engines, site monitoring tools, and data scraping applications. Efficient crawling with proper concurrency can reduce crawl time from hours to minutes for large sites. Understanding thread safety is crucial for building reliable distributed systems that handle shared state correctly.

Common Pitfalls

Implementation

URL Queue and Domain Management

Implement the core data structures for BFS traversal: a thread-safe queue.Queue for pending URLs and a set with lock protection for visited URLs. Create a domain extraction function using urllib.parse.urlparse() to normalize and compare domains, ensuring only same-domain URLs are queued. For example, https://example.com/page and https://example.com:443/page should be recognized as the same domain.

Solution
from queue import Queue
from threading import Lock
from urllib.parse import urlparse
url_queue = Queue()
visited = set()
visited_lock = Lock()
def get_domain(url):
parsed = urlparse(url)
domain = parsed.netloc.lower()
if ':' in domain:
host, port = domain.rsplit(':', 1)
if (parsed.scheme == 'https' and port == '443') or (parsed.scheme == 'http' and port == '80'):
domain = host
return f"{parsed.scheme}://{domain}"
def is_same_domain(url1, url2):
return get_domain(url1) == get_domain(url2)
URL Fetching and Parsing Worker

Build the worker thread function that continuously dequeues URLs, fetches page content using requests.get(), and extracts links using BeautifulSoup or regex. Implement error handling for network failures, timeouts, and invalid HTML. Each discovered link should be normalized (converted to absolute URL) and filtered by domain before being added to the queue. For instance, a relative link /about on https://example.com/page becomes https://example.com/about.

Solution
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def worker(queue, visited, visited_lock, domain, timeout=5):
while True:
with visited_lock:
if queue.empty():
break
url = queue.get()
try:
response = requests.get(url, timeout=timeout)
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a', href=True):
absolute_url = urljoin(url, link['href'])
parsed = urlparse(absolute_url)
normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
if parsed.netloc == domain:
with visited_lock:
if normalized not in visited:
visited.add(normalized)
queue.put(normalized)
except (requests.RequestException, Exception):
pass
finally:
queue.task_done()
Thread Pool Coordination and Termination

Create the main crawler orchestration that initializes a thread pool using threading.Thread or concurrent.futures.ThreadPoolExecutor, seeds the queue with the starting URL, and manages worker lifecycle. Implement a termination mechanism using a combination of queue empty checks and active thread counting, or use sentinel values to signal workers to exit. For example, when the queue is empty and all threads are idle, the crawl is complete.

Solution
from concurrent.futures import ThreadPoolExecutor
import threading
def crawl_orchestrator(start_url, max_workers=5):
url_queue = Queue()
visited = set()
lock = threading.Lock()
url_queue.put(start_url)
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = []
for _ in range(max_workers):
future = executor.submit(worker, url_queue, visited, lock)
futures.append(future)
# Wait for all workers to complete
for future in futures:
future.result()
Thread-Safe State Management

Implement synchronization primitives to protect shared resources: use threading.Lock for the visited set to prevent duplicate URL processing, and leverage queue.Queue's built-in thread safety for the URL queue. Add atomic check-and-add operations where a thread checks if a URL is visited and marks it visited in one locked section. For instance, wrap if url not in visited: visited.add(url) in a lock context to prevent race conditions.

Solution
import threading
from queue import Queue
class ThreadSafeState:
def __init__(self):
self.visited = set()
self.visited_lock = threading.Lock()
self.url_queue = Queue()
def is_visited_and_mark(self, url):
with self.visited_lock:
if url in self.visited:
return True
self.visited.add(url)
return False
def add_url(self, url):
self.url_queue.put(url)
def get_url(self, timeout=1):
return self.url_queue.get(timeout=timeout)

What We've Learned

  • Pattern: BFS with thread pool - use thread-safe queue for level-order traversal with concurrent workers
  • Use Case: Web crawlers, site mappers, link validators, and any graph traversal with I/O-bound operations
  • Concurrency: Thread-safe collections + locks prevent race conditions in multi-threaded environments
  • Optimization: Multithreading dramatically improves I/O-bound tasks like web crawling by parallelizing network requests

Problems to Practice

Graphs Overview
Lesson
Breadth-First Search

This lesson provides foundational understanding of graph traversal using BFS, which is the core algorithm used in the web crawler problem. It covers how to traverse graphs level-by-level, which directly applies to crawling URLs in a breadth-first manner.

Course Schedule

medium

Graphs

This problem involves graph traversal with cycle detection and managing visited nodes, similar to tracking visited URLs in a web crawler. It also requires handling dependencies between nodes, which parallels managing the crawl queue and ensuring no duplicate processing.

Copy Graph

easy

Depth-First Search

This problem requires traversing a graph while maintaining a visited set to avoid infinite loops, which is essential for web crawling. It teaches how to handle graph nodes with multiple connections and track already-processed nodes, directly applicable to URL deduplication in crawlers.

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Late February, 2026

Anthropic

Staff

Late February, 2026

Anthropic

Senior

Mid July, 2025

Anthropic

Senior

Comments

Your account is free and you can post anonymously if you choose.