Design Job Scheduler (Distributed CRON)

Problem Context

⏰ A job scheduler is a system that executes tasks at specified times or intervals.

Functional Requirements

Core Functional Requirements

FR1: Users should be able to create, update, and delete scheduled jobs.
FR2: The system should trigger jobs at their scheduled time (one-time or recurring).
FR3: Users should be able to view job execution history and status.

Out of Scope:

The actual job execution logic (we trigger a callback URL).
Complex DAG dependencies (that's Airflow's domain).
User authentication and authorization.
Job prioritization and resource allocation.

In an interview, explicitly stating what's out of scope for this system prevents scope creep and shows product thinking to your interviewer.

Non-Functional Requirements

Core Non-Functional Requirements

NFR1: Jobs should fire within 1 second of their scheduled time.
NFR2: Jobs must execute exactly once (no duplicates or misses).
NFR3: System should handle 10M+ scheduled jobs.
NFR4: 99.99% availability with no single point of failure.

Here's what we have so far:

Let's build this system.

The Set Up

Planning the Approach

We have two main paths through the system:

Job Management: Users create and configure jobs (CRUD operations).
Job Execution: The scheduler detects due jobs and triggers them reliably.

The complexity lives in execution: ensuring jobs run on time, exactly once, even when machines fail.

In an interview, start with a working system before optimizing. Note the flaws to you interviewer but also that you will address them in the deep dives.

Defining the Core Entities

Job: The scheduled task itself (ID, name, schedule, callback URL, payload).
Schedule: When to run. This can be either a CRON expression (0 9 * * * = 9am daily) or a one-time timestamp.
Execution: A single run of a job (execution ID, status, start time, end time, result).

CRON expressions follow the format: minute hour day-of-month month day-of-week. For example, */5 * * * * means "every 5 minutes."

API Interface

Our APIs serve two purposes: Job Management (external, for users) and Execution Callbacks (internal, how we trigger jobs).

Job Management APIs (External): FR1, FR3

These are how users interact with the scheduler.

1. Create Job

POST /api/v1/jobs

Request:
{
  "name": "daily-report-generator",
  "schedule": "0 9 * * *",           // CRON: 9am every day
  "callback_url": "https://reports.internal/generate",
  "payload": { "report_type": "sales" }
}

Response:
{
  "job_id": "job_abc123",
  "next_run_at": "2024-01-12T09:00:00Z"
}

The callback_url is where we send an HTTP POST when the job is due.

2. Get Job

GET /api/v1/jobs/{job_id}

Response:
{
  "job_id": "job_abc123",
  "name": "daily-report-generator",
  "schedule": "0 9 * * *",
  "next_run_at": "2024-01-12T09:00:00Z"
}

3. Update Job

PUT /api/v1/jobs/{job_id}

Request:
{
  "schedule": "0 10 * * *"           // Changed to 10am
}

Response:
{
  "next_run_at": "2024-01-12T10:00:00Z"
}

4. List Executions

GET /api/v1/jobs/{job_id}/executions?limit=10

Response:
{
  "executions": [
    {
      "execution_id": "exec_xyz789",
      "status": "success",
      "started_at": "2024-01-11T09:00:00Z",
      "http_status": 200
    },
    {
      "execution_id": "exec_xyz788",
      "status": "failed",
      "started_at": "2024-01-10T09:00:00Z",
      "error": "Connection timeout"
    }
  ]
}

This supports FR3 (viewing execution history).

High-Level Design

Let's build a HLD that satisfies all functional requirements:

FR1: Create/update/delete jobs
FR2: Trigger at scheduled time
FR3: View execution history

We'll start from the basics to get a working system and then we will rebuild to address scale.

1) Single Server Polling

Let's start with one server and a database. The server wakes up periodically, checks for due jobs, and executes them.

This works! Users create jobs (FR1), the server triggers them (FR2), and we could log executions (FR3).

But what breaks?

Polling is inefficient: We're querying ALL jobs every second. At 10M jobs, this is expensive.
Single point of failure: If the server dies, nothing runs.
No exactly-once guarantee: If the server crashes after calling the callback but before updating the database, restarting could trigger the job again.

2) Priority Queue

Instead of scanning all jobs, we keep them sorted by next_run_at. The soonest job is always at the top, so we just check if the top job is due. We use a min-heap (priority queue) to keep jobs sorted.

Now we only process jobs that are actually due efficiently.

But what breaks?

Still a single point of failure: One server crash stops everything.
In-memory queue is volatile: Server restart loses state.
Can't scale: One server can only trigger so many callbacks per second.

3) Distributed Workers: Multiple Schedulers

Let's add multiple scheduler instances for availability and throughput.

Each worker polls the database independently. More workers means more throughput, and if one dies, others continue.

But what breaks?

Duplicate execution: If Worker 1 and Worker 2 both poll at 09:00:00, they might both see job_abc is due and both execute it. Sending two daily reports is not good.

This is the main challenge of distributed schedulers.

4) Distributed Locking

To prevent duplicates, workers must claim a job before executing it. We use a lock so that only one worker can hold the lock at a time.

Redis SET NX EX is atomic, so only one worker wins the lock. The expiration (EX 30) is a safety net. If Worker 1 crashes, the lock auto-releases after 30 seconds.

But what breaks?

Lock contention at scale: With 10M jobs, many becoming due at once (all "hourly" jobs at :00), workers fight over the same jobs. Redis becomes a bottleneck.
Thundering herd: At popular times (top of the hour), many jobs become due simultaneously.

5) Scaling: Partition Jobs Across Workers

Instead of all workers competing for all jobs, we partition jobs so each worker owns a subset. Workers only process their assigned partitions.

Now Worker 1 only queries: SELECT * FROM jobs WHERE partition_id = 0 AND next_run_at <= NOW(). No more lock contention!

But what breaks?

What if Worker 1 dies? Partition 0 jobs stop running.
How do workers know their assignments? We need coordination.
What if we reassign partitions? During rebalancing, both the old worker (zombie) and new owner might execute the same job, causing duplicates.

We'll address these in the deep dives.

6) Complete HLD: FR1, FR2, FR3 ✅

Database Tables:

Jobs Table: job_id, name, schedule, callback_url, payload, next_run_at, partition_id, status, created_at
- The partition_id is calculated using consistent hashing when the job is created
Executions: Not shown in the main diagram, but we can asynchronously save execution results (status, retries, etc.) to a separate table or log store for history and debugging.

This is our baseline architecture. We've satisfied:

FR1 ✅ CRUD via API Service
FR2 ✅ Workers trigger jobs at scheduled time
FR3 ✅ Executions logged and queryable

Now we can address our non-functional requirements in the deep dives:

NFR1 (Accuracy): How do we ensure jobs fire within 1 second?
NFR2 (Exactly-once): How do we guarantee no duplicates even with failures?
NFR3 (Scale): How do we handle 10M+ jobs?
NFR4 (Availability): What happens when workers or the coordinator fail?

Potential Deep Dives

1) How do we guarantee exactly-once execution?: NFR2

Partitions ensure only one worker processes each job, but what if a worker crashes and the partition is reassigned?

The callback service must implement idempotency: store the idempotency key and response for ~24 hours. If the same key arrives again, return the cached response without re-executing.

2) How do we maintain timing accuracy at scale?: NFR1

With 10M jobs, we can't poll them all every second. And jobs clustered at popular times (top of the hour) create spikes.

Problem: Thundering herd at :00

Fix: Time-bucketed pre-fetching

Instead of waiting until 09:00:00 to discover due jobs, workers fetch upcoming jobs in advance and hold them in memory.

Workers continuously pre-fetch. By the time 09:00:00 arrives, jobs are already loaded.

This shifts database load from execution time (when precision matters) to earlier (when we have slack).

Handling updates: If a user updates a job's schedule, the API Service can push an invalidation message to workers. Workers discard the stale pre-fetched job and re-query if needed.

3) How do we handle 10M+ jobs?: NFR3

We covered partitioning in the HLD. Let's go deeper on the strategy.

When jobs are created, we assign them a partition_id using consistent hashing and store it in the jobs table. Workers query only their assigned partitions: SELECT * FROM jobs WHERE partition_id IN (0, 1, 2) AND next_run_at <= NOW().

Option 1: Consistent hashing to distribute jobs across workers (what we use)

Consistent hashing ensures minimal data movement when scaling. See the Distributed KV Store article for a deep dive on how consistent hashing works.

Option 2: Time-range partitioning

Hybrid: Use consistent hashing with enough partitions (like 1024) to spread load. Combined with time-bucketed pre-fetching, this handles scale well.

Database Sharding: The database itself can also be sharded by partition_id using consistent hashing. Each database shard contains jobs for specific partitions. Workers query the appropriate shard based on their assigned partitions.

4) What happens when a worker dies?: NFR4

The coordinator (ZooKeeper/etcd) detects failures and reassigns partitions.

Heartbeats: Workers send periodic heartbeats to the coordinator (ex. every 3 seconds). If the coordinator misses N consecutive heartbeats, it considers the worker dead.

There's a trade-off: shorter heartbeat intervals mean faster failure detection but more overhead. Typical values are 3s heartbeat, 10s failure threshold.

What to Expect?

Let's summarize what you should cover at each level.

Mid-level

Breadth over Depth (80/20): Get to a working system (the HLD). Understand why we need partitioning, even if you can't explain all edge cases.
Expect Probing: The interviewer may ask, "What happens if a worker dies?" or "How do you prevent running a job twice?"
Assisted Driving: The interviewer may guide you through trickier parts.
The Bar: Complete the HLD and demonstrate you understand the key challenges in the deep dives (exactly-once, scale, timing accuracy).

Senior

Balanced Breadth & Depth (60/40): Quickly build the HLD, then dive into 2-3 critical areas based on interviewer questions or NFRs you identify as most important.
Proactive Problem-Solving: You identify issues before the interviewer: "At scale, we'll have thundering herds at popular times. Here's how we handle that..."
Articulate Trade-offs: "Consistent hashing gives us better scaling properties when adding partitions, but time-range partitioning would spread thundering herds more naturally at the cost of complexity."
The Bar: Design the full system and explain solutions to the key challenges (exactly-once semantics, timing accuracy at scale, and failure recovery).

Staff

Depth over Breadth (40/60): Breeze through the HLD in ~15 minutes. Spend time on the hard problems: exactly-once in distributed systems, handling thundering herds, or coordinator failure modes.
Experience-Backed Decisions: Draw from real-world: "At my previous company, we used ___ and hit X problem. Here's how we'd avoid that."
Full Proactivity: You drive the conversation. "Before we go further, let me address how we guarantee exactly-once. Here's what could go wrong and how we'd handle it..."
The Bar: Address all major challenges unprompted. Discuss factors like why idempotency keys are critical even with partitioning.

Practice with AI mock interviews and nail your real one. Good luck! ⏰

Design Job Scheduler (Distributed Cron)

Design Job Scheduler (Distributed CRON)

Problem Context

Functional Requirements

Core Functional Requirements

Out of Scope:

Non-Functional Requirements

Core Non-Functional Requirements

The Set Up

Planning the Approach

Defining the Core Entities

API Interface

Job Management APIs (External): FR1, FR3

High-Level Design

1) Single Server Polling

2) Priority Queue

3) Distributed Workers: Multiple Schedulers

4) Distributed Locking

5) Scaling: Partition Jobs Across Workers

6) Complete HLD: FR1, FR2, FR3 ✅

Potential Deep Dives

1) How do we guarantee exactly-once execution?: NFR2

2) How do we maintain timing accuracy at scale?: NFR1

3) How do we handle 10M+ jobs?: NFR3

4) What happens when a worker dies?: NFR4

What to Expect?

Mid-level

Senior

Staff