30% off Lifetime Premium Access expires December 2nd!

30% off Lifetime Premium Access expires Dec 2nd!

Your Dashboard

Interview Coaching

Mock Interviews

1:1 Mentorship

Salary Negotiation

Practice

Learn

System Design

ML System Design

Code

Behavioral

Salary Negotiation

Interview Guides

Blog

Community

Interview Questions

Interview Experiences

Peer System Design Library

Ask The Community

Discord

Refer a Friend

Pricing

Become a Coach

⌘K

Pricing Become a Coach

⌘K

Get Premium

Multithreaded Web Crawler

Asked at:

OpenAI

Anthropic

DESCRIPTION

Build a web crawler that performs breadth-first search (BFS) traversal starting from a seed URL, discovering and fetching all reachable pages within the same domain. Implement concurrent fetching using multithreading to improve crawling speed while maintaining thread safety for shared data structures like the URL queue and visited set. For example, starting from https://example.com, the crawler should visit https://example.com/about and https://example.com/contact but skip https://other.com.

Input:


start_url = "https://example.com"
max_threads = 4

Page structure:
example.com → [example.com/about, example.com/blog]
example.com/about → [example.com/contact]
example.com/blog → [example.com/about, external.com]

Output:


Crawled URLs (in BFS order):
1. https://example.com
2. https://example.com/about
3. https://example.com/blog
4. https://example.com/contact

Total pages crawled: 4
(external.com skipped - different domain)

Explanation: The crawler uses BFS to visit pages level-by-level, employs 4 threads for concurrent fetching, and only crawls URLs within the same domain while tracking visited pages to avoid duplicates.

Constraints:

Only crawl URLs within the same domain as the starting URL
Use BFS traversal order for discovering URLs
Implement thread-safe access to shared queue and visited set
Handle concurrent URL fetching with multiple threads
Avoid crawling the same URL twice

Understanding the Problem

The challenge combines graph traversal (treating web pages as nodes and links as edges) with concurrent programming. The BFS requirement demands a queue-based approach where URLs are processed level-by-level. The multithreading aspect requires synchronization mechanisms (locks, thread-safe collections) to prevent race conditions when multiple threads access the shared queue and visited set. Additionally, domain filtering must be applied to ensure only same-domain URLs are crawled.

Building Intuition

A naive single-threaded approach would fetch URLs sequentially, making the crawler slow for large sites. A better approach uses a thread pool where multiple worker threads concurrently fetch pages from a shared queue, dramatically reducing total crawl time. For example, fetching 100 pages taking 1 second each would take 100 seconds single-threaded but only ~25 seconds with 4 threads.

Web crawlers are fundamental to search engines, site monitoring tools, and data scraping applications. Efficient crawling with proper concurrency can reduce crawl time from hours to minutes for large sites. Understanding thread safety is crucial for building reliable distributed systems that handle shared state correctly.

Common Pitfalls

Race condition on visited set: Multiple threads may check and add the same URL simultaneously → Use threading.Lock or thread-safe set operations

Queue exhaustion deadlock: Threads wait for queue items that never arrive → Implement proper termination signal when crawling completes

Domain check inconsistency: Forgetting to normalize URLs (www vs non-www, trailing slash) → Use urllib.parse to extract and compare domains consistently

Unbounded queue growth: Adding URLs faster than processing them → Consider max queue size or depth limits

Implementation

URL Queue and Domain Management

Implement the core data structures for BFS traversal: a thread-safe queue.Queue for pending URLs and a set with lock protection for visited URLs. Create a domain extraction function using urllib.parse.urlparse() to normalize and compare domains, ensuring only same-domain URLs are queued. For example, https://example.com/page and https://example.com:443/page should be recognized as the same domain.

from queue import Queue
from threading import Lock
from urllib.parse import urlparse

url_queue = Queue()
visited = set()
visited_lock = Lock()

def get_domain(url):
    parsed = urlparse(url)
    domain = parsed.netloc.lower()
    if ':' in domain:
        host, port = domain.rsplit(':', 1)
        if (parsed.scheme == 'https' and port == '443') or (parsed.scheme == 'http' and port == '80'):
            domain = host
    return f"{parsed.scheme}://{domain}"

def is_same_domain(url1, url2):
    return get_domain(url1) == get_domain(url2)

URL Fetching and Parsing Worker

Build the worker thread function that continuously dequeues URLs, fetches page content using requests.get(), and extracts links using BeautifulSoup or regex. Implement error handling for network failures, timeouts, and invalid HTML. Each discovered link should be normalized (converted to absolute URL) and filtered by domain before being added to the queue. For instance, a relative link /about on https://example.com/page becomes https://example.com/about.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse

def worker(queue, visited, visited_lock, domain, timeout=5):
    while True:
        with visited_lock:
            if queue.empty():
                break
            url = queue.get()
        
        try:
            response = requests.get(url, timeout=timeout)
            soup = BeautifulSoup(response.text, 'html.parser')
            
            for link in soup.find_all('a', href=True):
                absolute_url = urljoin(url, link['href'])
                parsed = urlparse(absolute_url)
                normalized = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
                
                if parsed.netloc == domain:
                    with visited_lock:
                        if normalized not in visited:
                            visited.add(normalized)
                            queue.put(normalized)
        except (requests.RequestException, Exception):
            pass
        finally:
            queue.task_done()

Thread Pool Coordination and Termination

Create the main crawler orchestration that initializes a thread pool using threading.Thread or concurrent.futures.ThreadPoolExecutor, seeds the queue with the starting URL, and manages worker lifecycle. Implement a termination mechanism using a combination of queue empty checks and active thread counting, or use sentinel values to signal workers to exit. For example, when the queue is empty and all threads are idle, the crawl is complete.

from concurrent.futures import ThreadPoolExecutor
import threading

def crawl_orchestrator(start_url, max_workers=5):
    url_queue = Queue()
    visited = set()
    lock = threading.Lock()
    
    url_queue.put(start_url)
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = []
        for _ in range(max_workers):
            future = executor.submit(worker, url_queue, visited, lock)
            futures.append(future)
        
        # Wait for all workers to complete
        for future in futures:
            future.result()

Thread-Safe State Management

Implement synchronization primitives to protect shared resources: use threading.Lock for the visited set to prevent duplicate URL processing, and leverage queue.Queue's built-in thread safety for the URL queue. Add atomic check-and-add operations where a thread checks if a URL is visited and marks it visited in one locked section. For instance, wrap if url not in visited: visited.add(url) in a lock context to prevent race conditions.

import threading
from queue import Queue

class ThreadSafeState:
    def __init__(self):
        self.visited = set()
        self.visited_lock = threading.Lock()
        self.url_queue = Queue()
    
    def is_visited_and_mark(self, url):
        with self.visited_lock:
            if url in self.visited:
                return True
            self.visited.add(url)
            return False
    
    def add_url(self, url):
        self.url_queue.put(url)
    
    def get_url(self, timeout=1):
        return self.url_queue.get(timeout=timeout)

What We've Learned

Pattern: BFS with thread pool - use thread-safe queue for level-order traversal with concurrent workers
Use Case: Web crawlers, site mappers, link validators, and any graph traversal with I/O-bound operations
Concurrency: Thread-safe collections + locks prevent race conditions in multi-threaded environments
Optimization: Multithreading dramatically improves I/O-bound tasks like web crawling by parallelizing network requests

Problems to Practice

Graphs Overview

Lesson

Breadth-First Search

This lesson provides foundational understanding of graph traversal using BFS, which is the core algorithm used in the web crawler problem. It covers how to traverse graphs level-by-level, which directly applies to crawling URLs in a breadth-first manner.

Learn

Course Schedule

medium

Graphs

This problem involves graph traversal with cycle detection and managing visited nodes, similar to tracking visited URLs in a web crawler. It also requires handling dependencies between nodes, which parallels managing the crawl queue and ensuring no duplicate processing.

Practice

Copy Graph

easy

Depth-First Search

This problem requires traversing a graph while maintaining a visited set to avoid infinite loops, which is essential for web crawling. It teaches how to handle graph nodes with multiple connections and track already-processed nodes, directly applicable to URL deduplication in crawlers.

Practice

Question Timeline

See when this question was last asked and where, including any notes left by other candidates.

Company

Level

Region

Mid July, 2025

Anthropic

Senior

Mid April, 2025

OpenAI

Staff

Create a web crawler that starts from a given URL and crawls all reachable URLs within the same domain using breadth-first search traversal. Implement multithreading to fetch URLs concurrently for improved efficiency while ensuring thread safety when accessing shared resources like the URL queue and visited set.

Late November, 2024

Anthropic

Mid-level

Comments

Your account is free and you can post anonymously if you choose.

Hello Interview Premium

Recent interview questions

System Design Guided Practice

Exclusive content

Learn More

Multithreaded Web Crawler

DESCRIPTION

Understanding the Problem

Building Intuition

Common Pitfalls

Implementation

URL Queue and Domain Management

URL Fetching and Parsing Worker

Thread Pool Coordination and Termination

Thread-Safe State Management

What We've Learned

Problems to Practice

Question Timeline

Comments

Questions

Learn

Links

Legal

Contact