Build an AI Agent Task Queue with Resumability

distributed-systems reliability scalability

System Design Deep Dive

AI Agent Task Queue

When agents run for minutes, not milliseconds, every failure must be recoverable from where it stopped

⏱ 14 min read📐 Advanced🏗️ Agentic AI

A user asks an AI agent: “Research our competitor’s pricing, summarize their feature differences, and draft a competitive analysis document.” This is not a 200ms API call. It involves 10-20 tool invocations, multiple LLM completions, file reads, web searches, and intermediate reasoning - spanning 30 seconds to 5 minutes. Midway through step 14 of 20, your Kubernetes pod gets evicted during a node rotation. Without a recovery mechanism, the user waits 5 minutes, gets no result, and the task never completes.

Traditional request-response HTTP is the wrong model for agent workloads. Think of it like the difference between asking a barista for a coffee (30 seconds, synchronous is fine) and commissioning a contractor to renovate your kitchen (weeks, they need to be able to pick up where they left off after a weekend). Agent tasks are contracting work: long-lived, multi-step, interruptible, and expensive enough that you cannot afford to re-run from scratch.

The engineering challenge is threefold: agent state persistence (what has the agent done so far, what messages have been exchanged, what tool results are cached), resumable execution (if a worker dies at step 14, restart from step 14 not step 1), and cost cap enforcement (prevent a runaway agent from spending $500 in one task before anyone notices). Add parallel sub-agent execution for decomposable tasks, and you have a non-trivial distributed systems problem wrapped in an AI infrastructure problem.

We need to solve for durability, resumability, cost control, and parallel execution simultaneously, without making the developer API complex.

Requirements and Constraints

Functional Requirements

Accept task submissions with a goal, constraints (max tokens, max steps, timeout), and optional tool allowlist
Execute agent tasks asynchronously, returning a task_id immediately
Persist full agent state (conversation history, tool call results, step index) between execution steps
Resume task execution from the last successful checkpoint on worker failure or restart
Support spawning parallel sub-agents for decomposable sub-tasks
Enforce per-task token budget and step count caps with graceful shutdown
Provide real-time task status and result retrieval
Produce a complete execution trace for debugging and billing

Non-Functional Requirements

Task pickup latency: p99 under 5 seconds from submission
Resumption latency: p99 under 10 seconds after worker failure
Concurrent active tasks: 10,000 across all workers
Maximum task duration: 24 hours
Checkpoint durability: no more than 1 step of work lost on crash
Token accounting accuracy: within 1% of actual LLM usage
Trace completeness: 100% of tool calls and LLM calls captured

Constraints and Assumptions

Workers are stateless containers (Kubernetes pods) that can be killed anytime
The LLM API is external (OpenAI, Anthropic) - calls are not idempotent by default
Sub-agents share the parent task’s token budget
No human-in-the-loop required for tool approval (configurable but off by default)

High-Level Architecture

AI agent task queue full architecture with orchestrator, workers, and tool layer

The system has four logical layers. The orchestration layer contains the task queue (Redis or SQS), the orchestrator process (which decomposes goals and spawns sub-agents), the state store (durable checkpoint storage), and the cost tracker (token ledger). The agent worker pool is a fleet of stateless containers, each capable of executing any agent task, scaling horizontally based on queue depth. The tool execution layer provides sandboxed access to external capabilities: web search, code execution, file I/O, and external APIs. The trace layer captures every LLM call, tool call, and state transition with full context.

When a task arrives, it enters the queue with a priority level. A worker dequeues it, creates the initial checkpoint, and begins the agent execution loop: call the LLM, receive a tool call or final answer, execute the tool (if any), checkpoint the state, repeat. If the worker dies, the task returns to the queue with its checkpoint pointer intact. The next worker to pick it up reads the checkpoint and continues from the last saved step.

Key Insight

Agent resumability is fundamentally a checkpoint-before-commit pattern: write the checkpoint to durable storage before executing each tool call. If the worker dies after writing the checkpoint but before completing the tool call, the tool call is re-executed with an idempotency key - the result may be a cache hit rather than a new call.

Agent State Persistence

Agent state is everything the agent has “seen” since the task started: the full conversation history (system prompt, user message, all LLM completions, all tool results), the current step index, the accumulated token count, and metadata about each tool call.

import json
from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any

@dataclass
class ToolCallRecord:
    call_id: str
    tool_name: str
    input_args: Dict[str, Any]
    output: Optional[str] = None
    status: str = "pending"  # pending | completed | failed
    tokens_used: int = 0

@dataclass
class AgentCheckpoint:
    task_id: str
    step_index: int
    messages: List[Dict]          # full conversation history
    tool_call_records: List[ToolCallRecord]
    tokens_used: int
    status: str                   # running | paused | completed | failed | cost_exceeded
    created_at: float
    worker_id: Optional[str] = None

def save_checkpoint(checkpoint: AgentCheckpoint, db) -> None:
    db.execute(
        """
        INSERT INTO task_checkpoints (task_id, step_index, state_json, tokens_used, status, updated_at)
        VALUES (%s, %s, %s, %s, %s, NOW())
        ON CONFLICT (task_id) DO UPDATE SET
          step_index  = EXCLUDED.step_index,
          state_json  = EXCLUDED.state_json,
          tokens_used = EXCLUDED.tokens_used,
          status      = EXCLUDED.status,
          updated_at  = EXCLUDED.updated_at
        """,
        (
            checkpoint.task_id,
            checkpoint.step_index,
            json.dumps({
                "messages": checkpoint.messages,
                "tool_call_records": [vars(r) for r in checkpoint.tool_call_records],
            }),
            checkpoint.tokens_used,
            checkpoint.status,
        )
    )

The ON CONFLICT DO UPDATE pattern is critical: checkpoints are upserted, not inserted, so a crash between two checkpoint writes cannot leave two conflicting rows.

Watch Out

Never checkpoint after the LLM call but before the tool call completes. If the worker crashes between those two events, on resume the agent will see the LLM’s tool request in conversation history but no tool result - it will re-request the same tool call, which is correct behavior only if the tool call was idempotent. Write-destructive tool calls (sending emails, making payments) must use idempotency keys checked before execution.

Resumable Execution

The execution loop is the heart of the agent worker. It must be restartable from any checkpoint and handle the state machine transitions correctly.

import anthropic
from typing import Optional

client = anthropic.Anthropic()

def execute_agent_task(task_id: str, db, tool_registry) -> None:
    checkpoint = load_checkpoint(task_id, db)
    if checkpoint is None:
        checkpoint = AgentCheckpoint(
            task_id=task_id, step_index=0, messages=load_task_messages(task_id, db),
            tool_call_records=[], tokens_used=0, status="running", created_at=time.time()
        )

    task_config = load_task_config(task_id, db)
    max_tokens = task_config["max_tokens"]
    max_steps = task_config["max_steps"]

    while checkpoint.step_index < max_steps:
        # Check cost cap before calling LLM
        if checkpoint.tokens_used >= max_tokens:
            checkpoint.status = "cost_exceeded"
            save_checkpoint(checkpoint, db)
            return

        # Call LLM
        response = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=4096,
            messages=checkpoint.messages,
            tools=tool_registry.get_tool_definitions(),
        )

        # Account tokens immediately
        checkpoint.tokens_used += response.usage.input_tokens + response.usage.output_tokens
        checkpoint.step_index += 1

        # Append assistant message to conversation
        checkpoint.messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            checkpoint.status = "completed"
            save_checkpoint(checkpoint, db)
            store_result(task_id, response.content, db)
            return

        if response.stop_reason == "tool_use":
            for block in response.content:
                if block.type != "tool_use":
                    continue
                tool_record = ToolCallRecord(
                    call_id=block.id,
                    tool_name=block.name,
                    input_args=block.input
                )

                # Checkpoint BEFORE tool execution (idempotency safety)
                checkpoint.tool_call_records.append(tool_record)
                save_checkpoint(checkpoint, db)

                # Check if we already have a result (resume case)
                cached = find_tool_result(block.id, checkpoint.tool_call_records)
                if cached is not None:
                    tool_output = cached
                else:
                    tool_output = tool_registry.execute(block.name, block.input, idempotency_key=block.id)
                    tool_record.output = tool_output
                    tool_record.status = "completed"
                    save_checkpoint(checkpoint, db)

                # Append tool result to conversation
                checkpoint.messages.append({
                    "role": "user",
                    "content": [{"type": "tool_result", "tool_use_id": block.id, "content": tool_output}]
                })

The loop checkpoints at two points per step: once before tool execution (recording the tool call intent) and once after (recording the result). On resume, find_tool_result checks if a previous execution already ran this tool call by its call_id - if so, it returns the cached result rather than re-executing.

Real World

Anthropic’s Claude API assigns stable tool_use_id values within a message, which makes them natural idempotency keys for re-execution detection. OpenAI’s Assistants API uses a similar “step” abstraction with persistent step_id values that survive across thread runs.

Tool Call Lifecycle

Tools are the agent’s actuators. Each tool invocation must be: isolated (one tool failure should not kill the agent), observable (full input/output captured in the trace), bounded (timeout enforced), and idempotent where possible.

import subprocess
import uuid
from concurrent.futures import ThreadPoolExecutor, TimeoutError

TOOL_TIMEOUT_SECONDS = 30

def execute_tool_with_isolation(
    tool_name: str,
    tool_input: dict,
    idempotency_key: str,
    tool_registry: dict
) -> str:
    # Check idempotency cache
    cached = idempotency_cache.get(idempotency_key)
    if cached:
        return cached

    tool_fn = tool_registry.get(tool_name)
    if tool_fn is None:
        return f"Error: Unknown tool '{tool_name}'"

    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(tool_fn, **tool_input)
        try:
            result = future.result(timeout=TOOL_TIMEOUT_SECONDS)
        except TimeoutError:
            result = f"Error: Tool '{tool_name}' timed out after {TOOL_TIMEOUT_SECONDS}s"
        except Exception as e:
            result = f"Error: {type(e).__name__}: {str(e)}"

    idempotency_cache.set(idempotency_key, result, ttl=3600)
    return result

Code execution tools need stronger isolation: run in a container with no network access, CPU limits, and a process tree kill on timeout. The tool registry defines what each tool is allowed to do:

TOOL_DEFINITIONS = [
    {
        "name": "web_search",
        "description": "Search the web for current information",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "num_results": {"type": "integer", "default": 5}
            },
            "required": ["query"]
        }
    },
    {
        "name": "execute_python",
        "description": "Execute Python code in a sandboxed environment",
        "input_schema": {
            "type": "object",
            "properties": {
                "code": {"type": "string"},
                "timeout_seconds": {"type": "integer", "default": 10}
            },
            "required": ["code"]
        }
    }
]

Cost Cap Enforcement

Cost enforcement must be proactive (check before each LLM call) and global (count tokens across all sub-agents sharing a parent task budget). A naively implemented cap that only checks after the call is complete will let one expensive completion overshoot the cap by 50-100K tokens.

import redis

r = redis.Redis()

def check_and_reserve_tokens(task_id: str, estimated_tokens: int, hard_cap: int) -> bool:
    """
    Atomically check if adding estimated_tokens would exceed hard_cap.
    Uses Redis INCRBY with a Lua script for atomicity.
    Returns True if reservation succeeded, False if cap would be exceeded.
    """
    lua_script = """
    local current = tonumber(redis.call('GET', KEYS[1])) or 0
    local requested = tonumber(ARGV[1])
    local cap = tonumber(ARGV[2])
    if current + requested > cap then
        return 0
    end
    redis.call('INCRBY', KEYS[1], requested)
    return 1
    """
    key = f"token_budget:{task_id}"
    result = r.eval(lua_script, 1, key, estimated_tokens, hard_cap)
    return result == 1

def release_token_reservation(task_id: str, actual_tokens: int, estimated_tokens: int) -> None:
    """
    After the LLM call, replace the estimate with the actual usage.
    """
    delta = actual_tokens - estimated_tokens
    if delta != 0:
        r.incrby(f"token_budget:{task_id}", delta)

Sub-agents inherit the parent task’s task_id for token accounting. When a sub-agent calls check_and_reserve_tokens, it checks the same Redis key as the parent. This ensures the combined token usage across the entire agent tree stays within the configured budget.

Key Insight

Use an optimistic reservation approach: estimate token usage before the LLM call (based on conversation history length), reserve that estimate, then reconcile with actual usage after the call. This prevents overshoot while allowing forward progress even when estimates are slightly off.

Parallel Sub-Agent Spawning

Long tasks decompose naturally into parallelizable sub-tasks. A research task might split into “search for competitor A pricing,” “search for competitor B pricing,” and “search for competitor C pricing” - three independent web searches that can run simultaneously.

import asyncio
from dataclasses import dataclass

@dataclass
class SubAgentTask:
    parent_task_id: str
    sub_task_id: str
    goal: str
    tool_allowlist: list
    max_tokens: int  # fraction of parent budget

async def spawn_parallel_sub_agents(
    parent_task_id: str,
    sub_goals: list[str],
    tool_allowlist: list,
    per_sub_budget: int,
    db,
    queue
) -> list[str]:
    """
    Returns list of sub_task_ids. Parent waits on all of them.
    """
    sub_task_ids = []
    for goal in sub_goals:
        sub_task = SubAgentTask(
            parent_task_id=parent_task_id,
            sub_task_id=str(uuid.uuid4()),
            goal=goal,
            tool_allowlist=tool_allowlist,
            max_tokens=per_sub_budget
        )
        db.execute(
            "INSERT INTO tasks (id, parent_id, goal, config, status) VALUES (%s, %s, %s, %s, 'pending')",
            (sub_task.sub_task_id, parent_task_id, goal, json.dumps(vars(sub_task)))
        )
        queue.enqueue(sub_task.sub_task_id, priority="normal")
        sub_task_ids.append(sub_task.sub_task_id)

    return sub_task_ids

async def wait_for_sub_agents(
    sub_task_ids: list[str],
    timeout_seconds: int,
    db
) -> dict[str, str]:
    """
    Poll for sub-agent completion. Returns {sub_task_id: result}.
    """
    results = {}
    deadline = asyncio.get_event_loop().time() + timeout_seconds
    pending = set(sub_task_ids)

    while pending and asyncio.get_event_loop().time() < deadline:
        for task_id in list(pending):
            status, result = db.fetchone(
                "SELECT status, result FROM tasks WHERE id = %s", (task_id,)
            )
            if status in ("completed", "failed", "cost_exceeded"):
                results[task_id] = result or f"[{status}]"
                pending.remove(task_id)
        if pending:
            await asyncio.sleep(0.5)

    for task_id in pending:
        results[task_id] = "[timeout]"
    return results

Watch Out

Unlimited sub-agent spawning creates a token amplification attack: a task with a 100K token budget could spawn 100 sub-agents each reserving 100K tokens, consuming 10M tokens before any cap fires. Enforce a maximum sub-agent depth (typically 2-3 levels) and a maximum fan-out per level. Track total budget allocation across all live sub-agents at spawn time, not just actual usage.

Agent Tracing

Every LLM call, tool call, and state transition must be recorded for debugging, billing, and safety auditing. Use OpenTelemetry spans with structured attributes:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer("agent-worker")

def traced_llm_call(messages: list, task_id: str, step_index: int) -> dict:
    with tracer.start_as_current_span("llm.call") as span:
        span.set_attribute("task.id", task_id)
        span.set_attribute("agent.step", step_index)
        span.set_attribute("llm.model", "claude-opus-4-8")
        span.set_attribute("llm.input_tokens_estimate", estimate_tokens(messages))

        response = client.messages.create(
            model="claude-opus-4-8",
            max_tokens=4096,
            messages=messages,
        )

        span.set_attribute("llm.input_tokens", response.usage.input_tokens)
        span.set_attribute("llm.output_tokens", response.usage.output_tokens)
        span.set_attribute("llm.stop_reason", response.stop_reason)
        return response

def traced_tool_call(tool_name: str, tool_input: dict, task_id: str) -> str:
    with tracer.start_as_current_span(f"tool.{tool_name}") as span:
        span.set_attribute("task.id", task_id)
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("tool.input", json.dumps(tool_input)[:1000])

        result = execute_tool_with_isolation(tool_name, tool_input, task_id, TOOL_REGISTRY)

        span.set_attribute("tool.output_length", len(result))
        span.set_attribute("tool.success", not result.startswith("Error:"))
        return result

Data Model

-- Task registry
CREATE TABLE tasks (
  id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  parent_id    UUID REFERENCES tasks(id),
  goal         TEXT NOT NULL,
  config       JSONB NOT NULL,          -- max_tokens, max_steps, tool_allowlist
  status       TEXT NOT NULL DEFAULT 'pending',
  result       TEXT,
  error        TEXT,
  created_at   TIMESTAMPTZ NOT NULL DEFAULT now(),
  started_at   TIMESTAMPTZ,
  completed_at TIMESTAMPTZ,
  depth        INT NOT NULL DEFAULT 0   -- sub-agent nesting depth
);

-- Durable checkpoint per task (one row per task, upserted on each step)
CREATE TABLE task_checkpoints (
  task_id     UUID PRIMARY KEY REFERENCES tasks(id) ON DELETE CASCADE,
  step_index  INT NOT NULL DEFAULT 0,
  state_json  JSONB NOT NULL,           -- messages + tool_call_records
  tokens_used INT NOT NULL DEFAULT 0,
  status      TEXT NOT NULL DEFAULT 'running',
  worker_id   TEXT,
  updated_at  TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Token usage ledger for billing
CREATE TABLE token_usage (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  task_id     UUID NOT NULL REFERENCES tasks(id),
  root_task_id UUID NOT NULL,           -- top-level task for billing aggregation
  step_index  INT NOT NULL,
  model       TEXT NOT NULL,
  input_tokens  INT NOT NULL,
  output_tokens INT NOT NULL,
  recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX token_usage_root_idx ON token_usage (root_task_id);
CREATE INDEX tasks_parent_idx ON tasks (parent_id);
CREATE INDEX tasks_status_idx ON tasks (status) WHERE status IN ('pending', 'running');

The root_task_id on token_usage makes billing aggregation a single indexed query: SELECT SUM(input_tokens + output_tokens) FROM token_usage WHERE root_task_id = $1.

Key Algorithms and Protocols

Visibility Timeout (Queue Lease)

The task queue uses a visibility timeout pattern: when a worker dequeues a task, the task becomes invisible to other workers for visibility_timeout seconds. If the worker crashes without completing or extending the timeout, the task becomes visible again and another worker can pick it up.

VISIBILITY_TIMEOUT = 120  # seconds

def heartbeat_loop(task_id: str, queue, stop_event):
    """
    Extend the queue message visibility every 60s to prevent task re-queuing
    while the worker is actively processing.
    """
    receipt = queue.current_receipt(task_id)
    while not stop_event.is_set():
        queue.extend_visibility(receipt, VISIBILITY_TIMEOUT)
        stop_event.wait(timeout=60)  # extend every 60s

Failure Recovery Protocol

1. Worker W1 dequeues task T at time=0
2. W1 starts heartbeat loop, begins execution
3. At step=8, W1 crashes (pod evicted)
4. Heartbeat stops; visibility timeout expires at time=120
5. Worker W2 dequeues task T (same message, same receipt)
6. W2 calls load_checkpoint(T): gets step_index=8, messages=[...8 steps...]
7. W2 continues from step 9, not step 0
8. W2 completes task, calls queue.delete(receipt)

Key Insight

The checkpoint and the queue message are separate durability domains: the checkpoint lives in Postgres (durable, transactional) and the queue message lives in Redis/SQS (delivery guarantee only). The worker reads the queue message to know which task to pick up, but reads Postgres to know where to resume from.

Scaling and Performance

Worker pool scaling and priority queue architecture

Workers are stateless and scale horizontally. The scale signal is queue depth per priority tier - not CPU or memory. Use a dedicated worker pool per priority tier to prevent high-priority tasks from waiting behind a batch workload.

Given:
  - 10,000 concurrent active tasks
  - Average task duration: 2 minutes
  - Average 10 LLM calls per task, each 2,000 tokens in + 500 out

Token throughput: 10,000 tasks * 10 calls * 2,500 tokens = 250M tokens active
LLM calls/second: 10,000 tasks * 10 calls / 120s = ~833 LLM calls/s
Checkpoint writes/second: 833 * 2 = 1,666 writes/s to Postgres

Postgres checkpoint load:
  - Row size: ~100KB per checkpoint (20 messages * 5KB avg)
  - 1,666 writes/s * 100KB = ~160MB/s write throughput
  - Solution: use a write-heavy Postgres instance (io2 EBS) or partition by task_id

Queue throughput:
  - 10,000 active tasks, avg 120s duration -> 83 task completions/second
  - Redis throughput: easily handles 10K+ operations/second per node

The Postgres checkpoint table is the system’s write bottleneck. Partition by task_id hash (8-16 partitions) to distribute write load. Alternatively, use a document store (DynamoDB, MongoDB) where task_id is the partition key - single-item writes are O(1) regardless of table size.

Real World

Temporal.io is an open-source workflow orchestration system that implements exactly this pattern - durable task execution with automatic replay from the last successful step. Many AI agent frameworks (LangGraph, Inngest) build on similar ideas: write-ahead event logs with deterministic replay semantics.

Failure Modes and Recovery

Agent task state machine with transitions

Failure	Detection	Impact	Recovery
Worker pod eviction	Visibility timeout expires (120s)	Task stalls up to 120s	Re-queued automatically; resumes from last checkpoint
Checkpoint write failure	Exception in save_checkpoint	Task may lose one step	Retry checkpoint write with backoff; halt task if 3 consecutive failures
LLM API rate limit	429 response	Task paused mid-step	Exponential backoff with jitter; max 5 retries before marking failed
Tool execution timeout	TimeoutError after 30s	Single tool step fails	Return error string to agent; agent decides whether to retry
Cost cap exceeded	check_and_reserve_tokens returns False	Task halted	Status set to cost_exceeded; partial result stored; user notified
Sub-agent timeout	wait_for_sub_agents deadline passed	Missing sub-results	Parent continues with available results; marks missing as [timeout]

Watch Out

The most dangerous failure is a “zombie worker”: a worker that lost its queue heartbeat lease (due to network partition) but is still executing and checkpointing. Worker W1 and worker W2 can now both hold checkpoints for the same task and both make LLM calls against its token budget. Fix this with a worker_id claim in the checkpoint: the checkpoint includes the claiming worker’s ID, and a worker always verifies it still holds the claim before writing.

Comparison of Approaches

Approach	Durability	Resume Granularity	Complexity	Best Fit
Stateless HTTP with retry	None	Re-run from start	Low	Sub-5s tasks only
Queue + full state in message	Low (msg size limit)	Per-message (coarse)	Low	Simple pipelines, no sub-agents
Queue + checkpoint in DB	High	Per-step (fine)	Medium	General agent workloads
Event sourcing (Temporal/Inngest)	Very high	Per-event (finest)	High	Complex multi-agent, auditable
Actor model (Erlang/Akka)	High	Per-message	Very high	Telecom / streaming agents

For most enterprise AI agent deployments, the queue + checkpoint in DB approach is the right tradeoff. It handles the durability requirements with familiar tooling (Postgres + Redis), avoids the operational complexity of Temporal, and scales to 10,000+ concurrent tasks on commodity infrastructure.

Key Takeaways

Agent tasks are jobs, not requests: design the execution model around queues, workers, and checkpoints rather than HTTP handlers.
Checkpoint before tool execution: write the intent before the action - this is what makes resumption safe for non-idempotent tool calls.
Cost caps must be atomic: use Redis Lua scripts or database transactions to check-and-increment the token counter; non-atomic implementations will overshoot under concurrent sub-agents.
Sub-agent depth limits prevent token amplification: a budget of 100K tokens should not be recursively assignable to 100 sub-agents each inheriting 100K tokens.
Visibility timeout is your crash detector: set it 2x your expected step duration and run a heartbeat extension loop to prevent false positives during normal long tool calls.
The trace is a first-class artifact: every LLM call and tool call needs a span with input/output captured - it is the primary debugging surface for “why did the agent do that.”
Worker claim IDs prevent zombie execution: the checkpoint must include the ID of the worker holding the execution lease to detect and resolve split-brain worker scenarios.

The non-obvious lesson is that the hard part of building a reliable agent framework is not the AI part - it is the distributed systems part. The same problems that make distributed databases hard (crash recovery, exactly-once semantics, concurrent writers, cost attribution) apply to agent task queues, just wrapped in a new API surface.

Frequently Asked Questions

Q: Why not just use a serverless function with a long timeout instead of a dedicated queue? A: Serverless functions (Lambda, Cloud Run) have maximum execution time limits (15 minutes for Lambda) and no native mechanism for resuming from a checkpoint. They work for short agent tasks but fail for tasks that take 30+ minutes or need to pause for human approval. Queue + worker gives you unlimited task duration, explicit resumability, and per-priority scaling that serverless cannot match.

Q: How do you handle the case where a tool call has side effects (sending an email, creating a ticket)? A: Use idempotency keys derived from the tool_use_id provided by the LLM API. Before executing a write-side-effect tool, check if that idempotency key has already been used (stored in Redis with a 24-hour TTL). If it has, return the cached result without re-executing. The LLM generates stable tool call IDs within a session, making them reliable idempotency keys.

Q: How do you prevent one tenant’s runaway agent from consuming all shared token budget? A: Implement per-tenant token rate limits enforced at the token counter level. Use a Redis key per (tenant_id, time_window) and apply sliding window rate limiting before allowing token reservation. Set separate per-task hard caps and per-tenant-per-hour soft caps. Alert operations when a tenant approaches 80% of their hourly cap.

Q: Why store conversation history in the checkpoint rather than in the LLM provider’s conversation API? A: Provider-hosted conversation state (OpenAI Threads, Claude Projects) creates vendor lock-in and has their own rate limits and pricing. Owning the conversation history gives you: portability across LLM providers, the ability to truncate and summarize old messages (critical for very long tasks), and full audibility without depending on a vendor’s logging API.

Q: How do you handle tasks that need human approval mid-execution? A: Add a requires_approval tool that the LLM can call when it reaches a decision point needing human judgment. Calling this tool transitions the task to the PAUSED state and sends a notification (email, Slack) with the decision context. The task stays paused until a human calls POST /tasks/{id}/resume with an approval or rejection. The worker’s next heartbeat extension fails (task is PAUSED), causing it to gracefully exit. On approval, the task re-enters the queue with the human’s decision appended to the conversation history.

Q: What is the right checkpoint granularity - per LLM call, per tool call, or per multi-step segment? A: Per tool call is the right granularity. Per LLM call is too coarse - you lose the tool intent if the worker crashes before the tool executes. Per multi-step segment is also too coarse and increases retry blast radius. The overhead of writing a Postgres row per tool call is negligible (microseconds) compared to the cost of re-running multiple LLM calls from an older checkpoint.

Interview Questions

Q: Walk me through what happens when a worker pod running an agent task is evicted mid-execution.

Expected depth: Describe the visibility timeout mechanism, checkpoint durability guarantee, and how the next worker reads the checkpoint to determine the resume point. Discuss the zombie worker risk and how worker_id claims in checkpoints resolve it. Address the case where the eviction happens between writing the tool intent checkpoint and receiving the tool result.

Q: How would you enforce a $10 per-task cost cap without allowing overshoot by more than $0.10?

Expected depth: Explain the check-and-reserve pattern using Redis atomic operations. Discuss why non-atomic check-then-increment fails under concurrent sub-agents. Describe the optimistic reservation approach (estimate before call, reconcile after). Calculate the maximum overshoot given average completion token sizes.

Q: Design the token accounting system for a task that spawns 5 parallel sub-agents, each of which spawns 2 more sub-agents.

Expected depth: Explain the root_task_id column for billing aggregation. Discuss budget allocation at spawn time (each sub-agent gets 1/N of remaining parent budget). Describe how a sub-agent at depth 2 checks the root task’s Redis counter. Address the race condition where two sub-agents simultaneously check a counter that is close to the cap.

Q: How would you implement “replay” - letting a user re-run a failed task from step 5 with a different LLM?

Expected depth: Describe loading the checkpoint at step 5, slicing the conversation history to that point, updating the model configuration, and re-queueing the task. Discuss the challenge of replaying tool calls that had side effects (idempotency key collision). Explain why the trace store (not the checkpoint) is needed for perfect replay - the checkpoint only stores what the agent “saw,” not full timing and metadata.

Q: How does your architecture handle a sub-agent that exceeds its token budget while the parent is waiting for its result?

Expected depth: Sub-agent status transitions to cost_exceeded and stores a partial result. Parent’s wait_for_sub_agents receives the [cost_exceeded] marker as the result. Parent LLM sees this in the tool result and decides whether to proceed with partial data, request a re-run with more budget, or fail the overall task. Budget escalation flow: user can approve additional budget via the approval endpoint, which increments the Redis counter.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article