Design a Web Crawler

Problem Context

🕷️ A web crawler is a program that systematically browses the internet, downloading pages and following links to discover new content.

The goal: Collect raw text from 10 billion web pages in under 5 days to train a Large Language Model (LLM). That's roughly 23,000 pages per second, sustained, around the clock.

Functional Requirements

Web crawlers can do many things: indexing for search, monitoring price changes, archiving the web. It's important that you narrow down scope with your interviewer first.

Core Functional Requirements

FR1: Accept seed URLs and continuously discover new URLs by extracting links from crawled pages.
FR2: Fetch raw HTML content from web pages across the internet.
FR3: Extract plain text and store it in blob storage (S3) for downstream consumption.

Out of Scope

Search indexing and ranking
Structured data extraction (JSON-LD, schema.org)
Image/video downloading
Real-time crawling or freshness guarantees
JavaScript rendering (we'll crawl static HTML only)

Limiting scope to static HTML is a reasonable trade-off. Most of the web's textual content is accessible without JavaScript execution, and rendering adds enormous complexity.

Non-Functional Requirements

Core Non-Functional Requirements

NFR1: Crawl at ~23,000 pages/second sustained throughput.
NFR2: Never lose crawl progress (failures should not require re-crawling).
NFR3: Respect robots.txt and rate limits (max 1 request/second/domain).
NFR4: Avoid processing duplicate content from mirror sites or protocol variations.

Here are our requirements:

Let's build this.

The Set Up

Planning the Approach

The web crawler is fundamentally a data pipeline. URLs go in, text comes out. Between those endpoints, we need to:

Manage the URL Frontier: Which URLs do we need to visit? Which have we already seen?
Fetch Pages: Download HTML while being a good citizen of the web.
Extract Content: Pull out the text and discover new links to crawl.

Defining the Core Entities

URL: A web address to be crawled. Has a status: queued, fetching, fetched, parsed, failed.
Domain: A hostname (example.com). Groups URLs for rate limiting and robots.txt compliance.
Page: The raw HTML content downloaded from a URL.
Extracted Text: The cleaned text content, stripped of HTML tags and scripts.
Link: A URL discovered within a page that gets added back to the frontier.

API Design

Our crawler has a simple external interface for operators, plus internal communication between services.

Operator APIs (External)

These let the team that runs the crawler monitor and control it.

1. Start a Crawl

POST /api/v1/crawls

Request:
{
  "seedUrls": [
    "https://en.wikipedia.org",
    "https://www.bbc.com",
    "https://news.ycombinator.com"
  ],
  "maxDepth": 10,
  "targetPageCount": 10000000000
}

Response:
{
  "crawlId": "crawl_2024_llm_training",
  "status": "running",
  "createdAt": "2024-01-15T00:00:00Z"
}

What happens? The seed URLs get pushed into our URL frontier, and fetcher workers start pulling from it immediately.

2. Check Crawl Progress

GET /api/v1/crawls/{crawlId}/status

Response:
{
  "crawlId": "crawl_2024_llm_training",
  "status": "running",
  "stats": {
    "urlsQueued": 15234567890,
    "urlsFetched": 8234567890,
    "urlsParsed": 8134567890,
    "urlsFailed": 12345678,
    "textStoredGB": 47234
  },
  "throughput": {
    "currentPagesPerSecond": 24103,
    "averagePagesPerSecond": 23456
  }
}

3. Pause or Resume

POST /api/v1/crawls/{crawlId}/pause
POST /api/v1/crawls/{crawlId}/resume

This would be useful for throttling during maintenance or if we're hitting infrastructure limits.

High-Level Design

Let's build this step by step, starting from the simplest working implementation and rebuilding for scale

1) The Simplest Crawler

Starting off, we will have one loop and one URL at a time.

This actually works for a few hundred pages. Then it doesn't. What breaks?

Sequential execution: One URL at a time means ~1 page/second. At that rate, 10 billion pages takes 317 years.
Memory overflow: The URL list grows unbounded as we discover millions of links.
No crash recovery: If the process dies, we start over from scratch.

We need to handle more than one URL.

2) Add a URL Queue: FR1

Instead of an in-memory list, let's use a proper queue. This allows for parallelism.

This is better. The queue persists even if the fetcher crashes. We can run multiple fetchers.

What breaks?

Where does the text go? We're still saving to local disk, which fills up fast.
Where does the HTML go? If parsing fails, we have to re-fetch the whole page.

We need proper storage.

3) Add Blob Storage: FR2, FR3

Let's separate storage from compute. Raw HTML and extracted text both go to S3.

Now we have durable storage! But there's a tiny problem.

What breaks?

Coupled pipeline: The fetcher does BOTH fetching AND parsing. If parsing fails (malformed HTML), we have to re-fetch the page.
Blocked on slow parsing: Fetchers wait while extracting text instead of fetching more pages.

We need to decouple fetch from parse.

4) Separate Fetch and Parse: FR2, FR3

Here, we split the monolithic fetcher into two services.

Why is this better?

Fetcher crashes? HTML is already in S3. Parser picks it up later.
Parser crashes? HTML is still there. Another parser instance retries.
Fetcher can stay busy: It doesn't wait for parsing.

S3 event notifications trigger the parser automatically when new HTML appears. This is cleaner than polling and scales naturally.

What breaks?

No visibility: We have no idea which URLs have been fetched, parsed, or failed.

We need a state tracker.

5) Add Metadata Database: FR1, FR2, FR3

Every URL should have a status. Let's track it with a Metadata DB (DynamoDB), a NoSQL key-value store that handles high write throughput and scales horizontally via consistent hashing.

Now we can:

Query how many URLs are in each state
Resume from exactly where we left off after a crash
Detect stuck URLs that never completed

What breaks?

Still too slow: One fetcher can do maybe 100 pages/second. We need 23,000.

Time to scale horizontally.

6) Scale Fetchers Horizontally: FR2

More workers mean more throughput.

SQS visibility timeouts are key here. When a fetcher pulls a URL, it becomes invisible to other fetchers for N seconds. If the fetcher doesn't delete it (because it crashed), the URL reappears for retry.

What breaks?

We're attacking websites: 230 fetchers hitting the same domain simultaneously will get us blocked.

We have to ensure politeness for our crawler.

7) Complete System: FR1, FR2, FR3 ✅

Let's put it all together. Here's the complete working system:

This is our baseline architecture. We have a complete, working system that satisfies all functional requirements (FR1, FR2, FR3). Now we can address our non-functional requirements and more in the deep dives:

NFR1 (Efficiency): How do we scale DNS lookups?
NFR2 (Fault Tolerance): How do we recover from failures?
NFR3 (Politeness): How do we respect websites?
NFR4 (Deduplication): How do we avoid duplicate content?

Potential Deep Dives

1) How do we respect websites (Politeness)?: NFR3

Every request we make costs the target server CPU, bandwidth, and money. If we crawl too aggressively, websites will:

Block our IP addresses
Return error pages instead of content
(rarely but still possible) take legal action

Robots.txt Compliance

Every website can publish a robots.txt file at its root that specifies crawling rules:

# example.com/robots.txt

User-agent: *
Disallow: /private/
Disallow: /admin/
Crawl-delay: 2

User-agent: GPTBot
Disallow: /

Our crawler must:

Fetch robots.txt before crawling any URL on a domain
Parse and cache the rules in Redis
Check every URL against the rules before fetching

What about domains without robots.txt?

If robots.txt returns 404, assume everything is allowed. If it times out, treat the domain as temporarily unavailable and retry later.

Rate Limiting

Even if robots.txt allows access, we shouldn't flood a domain with requests. Standard practice is 1 request per second per domain.

We use Redis as a per-domain counter. Each domain gets a key that tracks how many requests have been made in the current second.

How it works:

Before fetching example.com, the fetcher runs INCR example.com in Redis
INCR atomically adds 1 to the key and returns the new value. If the key doesn't exist, Redis creates it with value 1
The fetcher also sets EXPIRE 1s on the key (meaning it auto-deletes after 1 second)
If INCR returns 1: You're the first to request this domain in the current second. Proceed with the fetch.
If INCR returns > 1: Another fetcher already hit this domain this second. Wait and retry later.

The INCR and EXPIRE commands must happen atomically (using MULTI/EXEC), otherwise a crash between them could leave a key that never expires, permanently blocking that domain.

What about Crawl-delay in robots.txt?

Some sites specify a custom delay (Crawl-delay: 5 means wait 5 seconds between requests). To respect this:

When parsing robots.txt, store the crawl-delay per domain in Redis (crawl_delay:cnn.com is 5)
When rate limiting, instead of hardcoding EXPIRE 1, look up that domain's delay
Use EXPIRE {crawl_delay} instead

Jitter (Avoiding Thundering Herds)

Imagine the frontier queue suddenly has 100 CNN URLs (from parsing a popular page). 100 fetchers pull them simultaneously, all get rate-limited (from Redis), and all retry after exactly 1 second, creating the same spike again. Add random jitter to each retry delay so fetchers wake up at staggered times.

Implementation: sleep(base_delay + random(0, 500ms)) before each request.

2) How do we recover from failures?: NFR2 (Fault Tolerance)

At 23,000 pages/second with hundreds of workers, failures are guaranteed. Servers crash, but we need to handle this gracefully.

SQS Visibility Timeouts

The URL Frontier uses SQS, which has built-in fault tolerance via visibility timeouts. When a fetcher pulls a message, it becomes invisible to others for N seconds. If the fetcher doesn't delete it (because it crashed), the message reappears automatically.

Decoupling Saves Expensive Work

Remember when we split fetch and parse? This matters for fault tolerance because the network fetch is the expensive part.

Dead Letter Queues

Some URLs will always fail (site permanently down, malformed HTML that crashes the parser). After N retries, move them to a dead letter queue for human review.

Exponential Backoff

When a site returns 429 (rate limited) or 5xx (server error), don't immediately retry. Back off exponentially:

   1st failure: wait 1s
   2nd failure: wait 2s
   3rd failure: wait 4s
   4th failure: wait 8s
   ... and so on

This prevents hammering a struggling server and gives it time to recover.

3) How do we avoid duplicate content?: NFR4 (Deduplication)

The web is full of duplicates:

http://example.com vs https://example.com
example.com/page vs example.com/page?utm_source=twitter
Mirror sites hosting identical content

Crawling duplicates wastes resources and pollutes our training data with repetition.

URL Normalization

Before adding any URL to the frontier, normalize it to catch obvious duplicates.

   BEFORE: HTTP://EXAMPLE.COM:80/path/../other?b=2&a=1#section

   AFTER:  http://example.com/other?a=1&b=2

Normalization rules: lowercase host, remove default port, resolve .., sort query params, strip fragments, remove tracking params (utm_*, fbclid).

4) How do we avoid getting trapped?: NFR1 (Efficiency)

Some websites create infinite loops: calendar pages (/2024/01/01, /2024/01/02, ...), session IDs in URLs, infinite pagination.

One trade-off is over-filtering. There are legitimate sites with millions of pages, such as Wikipedia and Amazon.

What to Expect?

This system is deceptively deep. The basic architecture is straightforward, but the devil is in operational details.

Mid-Level

Breadth-oriented (80/20): Focus on getting the complete pipeline working: URL Frontier to Fetcher to Storage to Parser to Frontier.
Expect guidance on trade-offs: The interviewer will prompt you on questions like "What happens if a fetcher crashes?"
Must demonstrate understanding of: Why we need a queue, why we separate fetch from parse, what robots.txt looks like.
The bar: Complete the HLD, explain the data flow clearly, acknowledge politeness is needed even if details aren't very concrete.

Senior

Balanced (60/40): Go deeper on areas you've worked with. If you've built data pipelines, flex that. If you've dealt with external APIs, talk about rate limiting in depth.
Justify technology choices: "I'd use SQS over Kafka here because we need visibility timeouts for automatic retry, and we don't need message replay or ordering guarantees."
Address failure modes proactively: Before the interviewer asks "What if X fails?", you should be mentioning it.
The bar: Complete system through Diagram 7, deep dive into 2-3 areas (Politeness + Fault Tolerance are essential), discuss scaling numbers (how many fetchers for 23K/sec?).

Staff

Depth-oriented (40/60): Breeze through the architecture in 15 minutes. Spend the rest on the hard parts.
Production-ready thinking: Talk about DNS rate limits before being asked. Mention jitter for thundering herd prevention. Discuss how you'd handle different regions or detect crawler traps.
Drive the conversation: You should be asking clarifying questions: "Do we care about JavaScript-rendered content? What's our budget for DNS resolvers? How fresh does the data need to be?"
The bar: All deep dives addressed substantively. Discussion of operational concerns: monitoring, alerting, cost estimation. Ability to adapt the design for different constraints (stricter politeness, real-time requirements, specific data quality needs).

Do a mock interview of this question with AI & pass your real interview. Good luck! 🕷️

Design Web Crawler

Design a Web Crawler

Problem Context

Functional Requirements

Core Functional Requirements

Out of Scope

Non-Functional Requirements

Core Non-Functional Requirements

The Set Up

Planning the Approach

Defining the Core Entities

API Design

Operator APIs (External)

High-Level Design

1) The Simplest Crawler

2) Add a URL Queue: FR1

3) Add Blob Storage: FR2, FR3

4) Separate Fetch and Parse: FR2, FR3

5) Add Metadata Database: FR1, FR2, FR3

6) Scale Fetchers Horizontally: FR2

7) Complete System: FR1, FR2, FR3 ✅

Potential Deep Dives

1) How do we respect websites (Politeness)?: NFR3

Robots.txt Compliance

Rate Limiting

Jitter (Avoiding Thundering Herds)

2) How do we recover from failures?: NFR2 (Fault Tolerance)

SQS Visibility Timeouts

Decoupling Saves Expensive Work

Dead Letter Queues

Exponential Backoff

3) How do we avoid duplicate content?: NFR4 (Deduplication)

URL Normalization

4) How do we avoid getting trapped?: NFR1 (Efficiency)

What to Expect?

Mid-Level

Senior

Staff