Design Yelp

Problem Context

🍽️ Yelp is a local business discovery platform where users find restaurants, shops, and services based on location, ratings, and reviews.

Functional Requirements

Yelp does a lot: reservations, food delivery, check-ins, photos. But the main product is discovering local businesses and reading reviews. We will focus on that for our article.

Core Functional Requirements

FR1: Users should be able to search for businesses by location and keywords.
FR2: Users should be able to view business details, including aggregated ratings and reviews.
FR3: Users should be able to write reviews and rate businesses.

Out of Scope:

User authentication and profiles.
Reservations and waitlist management (Yelp Reserve).
Food ordering and delivery (Yelp Eat24).
Business owner tools and advertising.
Check-ins and social features.

Note: Yelp displays business photos, but we won't cover image storage or CDN delivery here. Those patterns are explained in YouTube and Instagram. This article focuses on geospatial search and indexing businesses, but if your interviewer asks about image storage, you can take that into account.

Non-Functional Requirements

Core Non-Functional Requirements

NFR1: Search results should return in < 200ms.
NFR2: System should be highly available (99.99%+).
NFR3: System should handle 50K+ searches per second during peak hours.
NFR4: New reviews should be reflected in ratings within 1 minute.

Here's what we have so far:

Let's build Yelp.

The Set Up

The Core Challenge

When someone searches "sushi near downtown Seattle," we need to:

Find all businesses within a geographic boundary (location filter)
Filter to those matching "sushi" (text search)
Rank by some combination of rating, distance, and relevance
Return results in under 200ms

With 200M+ businesses worldwide, we can't scan everything. We need indexes that understand both geography and text.

Core Entities

These entities form the foundation of our data model. The relationship between businesses and reviews drives many of our design decisions.

Business

{
  businessId: "biz_abc123",
  name: "Sakura Sushi",
  categories: ["Japanese", "Sushi Bars"],
  location: {
    address: "123 Pike St, Seattle, WA",
    lat: 47.6097,
    lng: -122.3331
  },
  avgRating: 4.3,
  reviewCount: 847,
  priceRange: "$$"
}

Review

{
  reviewId: "rev_xyz789",
  businessId: "biz_abc123",
  userId: "user_456",
  rating: 5,
  text: "Best salmon sashimi in Seattle! Fresh fish...",
  createdAt: "2026-01-09T14:30:00Z",
  usefulCount: 23
}

Search Query

{
  query: "sushi",
  location: {lat: 47.6062, lng: -122.3321},
  radiusMeters: 5000,
  sortBy: "best_match",      // best_match | rating | distance
  limit: 20
}

API Interface

We can split our APIs into Search (FR1), Business Details (FR2), and Reviews (FR3).

Search API: FR1

GET /api/v1/search?term=sushi&lat=47.6062&lng=-122.3321&radius=5000&limit=20&cursor={cursor}

Response:
{
  "businesses": [
    { "businessId", "name", "rating", "reviewCount", "distance", ... }
  ],
  "nextCursor": "cursor_page2"
}

The search service parses coordinates, queries the geospatial index, filters by text match, ranks by relevance, and returns paginated results.

Business Details API: FR2

GET /api/v1/businesses/{businessId}

Response:
{
  "businessId": "biz_abc123",
  "name": "Sakura Sushi",
  "rating": 4.3,
  "reviewCount": 847,
  "ratingBreakdown": { "5": 412, "4": 289, ... },
  "location": { "address", "city", "lat", "lng" },
  ...
}

GET /api/v1/businesses/{businessId}/reviews?sortBy=recent&limit=20&cursor={cursor}

Response:
{
  "reviews": [
    { "reviewId", "userId", "rating", "text", "createdAt", ... }
  ],
  "nextCursor": "cursor_next"
}

Review API: FR3

POST /api/v1/businesses/{businessId}/reviews
Body: { userId, rating, text }

Response: { reviewId, status, createdAt }

The rating aggregation step is trickier than it looks. Recomputing an average across 10,000 reviews on every new review is expensive. We'll address this in the deep dives.

High-Level Design

We'll start with the simplest working system and fix problems as they arrive.

In an interview, acknowledge you're starting simple. You need to get a working system before optimizing.

1) The Simplest Search: FR1 (Basic Search)

When someone searches "sushi near Seattle," just query the database:

Why does this break?

Full table scan: Every search scans 200M rows
LIKE '%sushi%': Can't use indexes (must check every row)
Distance calculation: Computing haversine for 200M rows is incredibly slow

This might take 30 seconds per query. We need specialized indexes.

2) Add Geospatial Index: FR1 (Location Search)

The key insight: we can divide the world into a grid and index businesses by grid cell. When searching, we only look at cells within the search radius.

What's a geohash?

A geohash encodes latitude/longitude into a short string. Nearby locations share a common prefix:

c23nb62w = exact location
c23nb = city block (~150m * 150m)
c23n = neighborhood (~1km * 1km)
c23 = city area (~5km * 5km)

By indexing businesses by geohash prefix, we can quickly find all businesses in a geographic area.

What breaks?

We can find businesses in an area, but we still need text search for "sushi"
Two separate indexes (geo + text) means we're doing work twice

3) Add Text Search: FR1 (Keyword Matching)

Now we need to filter by keywords. We add a text search layer:

Elasticsearch handles both geospatial queries (geo_distance) and text search (match) in a single index. This is our search layer.

What breaks?

We have search, but no data for business details or reviews
Search returns IDs, but we need full business data

We need to separate our data stores by use case.

4) Separate Read Paths: FR1 + FR2 (Search vs. Details)

Search and details have different access patterns:

Search: Filter millions of records, return 20 results (needs specialized indexes)
Details: Fetch one business by ID (needs fast key-value lookup)

What we added:

Elasticsearch: Handles geographic and text search efficiently
DynamoDB: Fast key-value lookups for business details
Hydration: Search returns IDs, then we fetch full details

What breaks?

Where do reviews come from?
How do new businesses and reviews get into the search index?

5) Add Review Storage: FR2 + FR3 (Reviews)

Reviews are a high-volume write path. Users read reviews on business pages and write new ones. Let's add the review layer:

DynamoDB design for reviews:

Partition Key: businessId (all reviews for one business)
Sort Key: createdAt (sorted by date for pagination)

This lets us efficiently query (Get reviews for business X, sorted by date, page 2)

What breaks?

Updates to business rating and search index are synchronous
If Elasticsearch is slow, review submission takes a long time
How does the rating updater know when a new review is made?

We need asynchronous processing.

6) Add Async Processing: FR3 + NFR4 (Write Path)

Decouple the write path so review submission is fast:

The flow now:

User submits review, we store it in the Reviews DB
Event published to Kafka, we return immediately
Rating Worker recalculates business average (async)
Index Worker updates Elasticsearch (async)

New reviews are reflected in search results swiftly.

Complete System

This is our baseline architecture!

We satisfy all functional requirements:

FR1 ✅ Search businesses by location and keywords (Elasticsearch with geo and text)
FR2 ✅ View business details and reviews (DynamoDB lookups)
FR3 ✅ Write reviews (async rating updates via Kafka)

Now we can address our non-functional requirements in the deep dives:

NFR1 (Latency): How does geospatial indexing actually work?
NFR1 (Latency): How do we paginate results efficiently?
NFR3 (Scale): How do we handle 50K searches/sec?
NFR2 (Availability): What happens when things fail?

Potential Deep Dives

1) How does geospatial search actually work?: NFR1 (Latency)

In Diagram 2, we mentioned geohashing. Let's go deeper.

The problem: Given coordinates (47.6062, -122.3321) and radius 5km, find all businesses in that circle. With 200M businesses, we can't check each one.

Geohashing explained:

Geohashing divides the world into a grid. Each grid cell gets a string identifier. Importantly, nearby cells share prefixes.

Index structure:

Edge case: partial cells

Some cells are fully inside your search radius (include all businesses). Some are fully outside (skip entirely). But cells at the edge are only partially inside that radius, so you must fetch all businesses from that cell and check each one's exact distance.

Elasticsearch handles this automatically with geo_distance queries. Under the hood, it uses a similar grid-based approach.

2) How do we rank search results?: NFR1 (Latency + Relevance)

Returning results in the order they appear in the geo index isn't useful. Users expect the best results first.

For a typical "sushi near me" query (5km radius), we might get 100-500 matching businesses in a dense city. Ranking a few hundred candidates is straightforward. Elasticsearch scores them based on multiple factors:

Ranking factors:

Two-phase ranking (for broader searches):

For a 5km radius search, we rank all 100-500 candidates directly. But for city-wide searches or vague queries like "food" that match thousands of businesses, we use two phases:

Phase 1 (Elasticsearch): Fast filter 8,000 candidates down to 200 using simple scoring.
Phase 2 (Application): Rich ranking of 200 candidates down to 20 results (add personalization).

Most queries don't need this, but two-phase ranking handles edge cases of broader searches.

3) How do we paginate search results?: NFR1 (Latency)

Users scroll through results. We can't return all 500 sushi restaurants at once. There are two approaches:

Offset pagination:

GET /api/v1/search?term=sushi&lat=47.60&lng=-122.33&offset=20&limit=20

"Give me results 21-40"

Simple, but breaks at scale. To get page 50, the database must skip 1000 rows then return 20. The deeper you go, the slower it gets.

Cursor pagination:

GET /api/v1/search?term=sushi&lat=47.60&lng=-122.33&cursor=biz_xyz&limit=20

"Give me 20 results after biz_xyz"

The cursor encodes where we left off (the last result's ID). Elasticsearch jumps directly to that position.

4) How do we handle 50K searches per second?: NFR3 (Scale)

Our search path goes: User to Search Service to Elasticsearch to Business DB to Response.

At 50K requests/sec, we need to scale every layer.

Layer 1: Caching popular queries

We have many searches in for popular areas. "pizza in Manhattan" is searched thousands of times per hour, so we can cache these!

Cache key normalization:

We need to normalize coordinates, or cache misses pile up:

40.7128, -74.0060 normalizes to 40.71, -74.01
This way, nearby searches share cache entries

Layer 2: Elasticsearch sharding

Elasticsearch distributes data across multiple nodes. Each node handles a subset of businesses:

5) What happens when things fail?: NFR2 (Availability)

At 99.99% availability, we can only have 52 minutes of downtime per year. We need redundancy at every layer.

Elasticsearch failover:

DynamoDB availability:

DynamoDB is fully managed with built-in replication across availability zones. If one AZ fails, reads automatically route to another.

Graceful degradation:

If Elasticsearch is slow/down, we can:

Return cached results (even if slightly stale)
Show popular nearby from a pre-computed list
Reduce result count (return 10 instead of 20)

6) How do we efficiently update ratings?: NFR4 (Freshness)

In the HLD, the Index Worker for rankings recalculates the business average. But recomputing AVG(rating) across 10,000 reviews on every new review is O(n) and expensive.

The Solution: Running Totals

Store two more values per business in DynamoDB:

totalRatingSum: Running sum of all ratings
reviewCount: Number of reviews

On new review (rating = 5):

UPDATE businesses 
SET totalRatingSum = totalRatingSum + 5,
    reviewCount = reviewCount + 1,
    avgRating = (totalRatingSum + 5) / (reviewCount + 1)
WHERE businessId = ?

Result:

O(1) update instead of O(n) aggregation
No need to read all reviews
The index worker just does a simple increment and division

What to Expect?

That covered a lot. Here's what matters at each level.

Mid-level

Breadth over Depth (80/20): Cover the end-to-end flow from search query to results. Know that geospatial indexing is required and know the technical details behind it.
Expect Basic Probing: "Why can't we just use SQL for this?" "How do ratings get updated?"
Assisted Driving: You design the search path first, and the interviewer may guide you through the write path.
The Bar: Complete the HLD with complete understanding. Explain why we need specialized indexes (geo + text). Know that async processing exists for writes, and walk through deep dives, even if they aren't super in depth.

Senior

Balanced Breadth & Depth (60/40): You should comfortably explain geohashing, ranking factors, and rating aggregation. Pick 2-3 deep dives to discuss.
Proactive Problem-Solving: You identify the rating aggregation problem before asked. You mention caching for popular queries without prompting.
Articulate Trade-offs: "We could use a QuadTree instead of geohash, but geohash strings work with standard indexes while QuadTrees need custom code."
The Bar: Complete the HLD, and proactively discuss at least 2 deep dives.

Staff

Depth over Breadth (40/60): Breeze through the HLD in 15 minutes (your interviewer assumes you know search basics). Spend time on the hard problems: ranking at scale, handling write spikes, and consistency.
Experience-Backed Decisions: "In my experience with Elasticsearch geo queries, we found that 6-character geohashes work best for 5km searches, but need refinement for exact boundaries."
Full Proactivity: You drive the conversation. You bring up problems the interviewer didn't ask: "What about fake reviews? We need rate limiting and fraud detection on the write path."
The Bar: Address ALL deep dives without prompting. Discuss operational concerns (monitoring search latency p99, when caching would be unnecessary, etc).

Do a mock interview of this question with AI & pass your real interview. Good luck! 🍽️

Design Yelp (Reviews + Search)

Design Yelp

Problem Context

Functional Requirements

Core Functional Requirements

Out of Scope:

Non-Functional Requirements

Core Non-Functional Requirements

The Set Up

The Core Challenge

Core Entities

API Interface

Search API: FR1

Business Details API: FR2

Review API: FR3

High-Level Design

1) The Simplest Search: FR1 (Basic Search)

2) Add Geospatial Index: FR1 (Location Search)

3) Add Text Search: FR1 (Keyword Matching)

4) Separate Read Paths: FR1 + FR2 (Search vs. Details)

5) Add Review Storage: FR2 + FR3 (Reviews)

6) Add Async Processing: FR3 + NFR4 (Write Path)

Complete System

Potential Deep Dives

1) How does geospatial search actually work?: NFR1 (Latency)

2) How do we rank search results?: NFR1 (Latency + Relevance)

3) How do we paginate search results?: NFR1 (Latency)

4) How do we handle 50K searches per second?: NFR3 (Scale)

5) What happens when things fail?: NFR2 (Availability)

6) How do we efficiently update ratings?: NFR4 (Freshness)

What to Expect?

Mid-level

Senior

Staff