Build an AI Agent Task Queue with Resumability
distributed-systems reliability scalability
System Design Deep Dive
AI Agent Task Queue
When agents run for minutes, not milliseconds, every failure must be recoverable from where it stopped
A user asks an AI agent: “Research our competitor’s pricing, summarize their feature differences, and draft a competitive analysis document.” This is not a 200ms API call. It involves 10-20 tool invocations, multiple LLM completions, file reads, web searches, and intermediate reasoning - spanning 30 seconds to 5 minutes. Midway through step 14 of 20, your Kubernetes pod gets evicted during a node rotation. Without a recovery mechanism, the user waits 5 minutes, gets no result, and the task never completes.
Traditional request-response HTTP is the wrong model for agent workloads. Think of it like the difference between asking a barista for a coffee (30 seconds, synchronous is fine) and commissioning a contractor to renovate your kitchen (weeks, they need to be able to pick up where they left off after a weekend). Agent tasks are contracting work: long-lived, multi-step, interruptible, and expensive enough that you cannot afford to re-run from scratch.
The engineering challenge is threefold: agent state persistence (what has the agent done so far, what messages have been exchanged, what tool results are cached), resumable execution (if a worker dies at step 14, restart from step 14 not step 1), and cost cap enforcement (prevent a runaway agent from spending $500 in one task before anyone notices). Add parallel sub-agent execution for decomposable tasks, and you have a non-trivial distributed systems problem wrapped in an AI infrastructure problem.
We need to solve for durability, resumability, cost control, and parallel execution simultaneously, without making the developer API complex.
Requirements and Constraints
Functional Requirements
- Accept task submissions with a goal, constraints (max tokens, max steps, timeout), and optional tool allowlist
- Execute agent tasks asynchronously, returning a
task_idimmediately - Persist full agent state (conversation history, tool call results, step index) between execution steps
- Resume task execution from the last successful checkpoint on worker failure or restart
- Support spawning parallel sub-agents for decomposable sub-tasks
- Enforce per-task token budget and step count caps with graceful shutdown
- Provide real-time task status and result retrieval
- Produce a complete execution trace for debugging and billing
Non-Functional Requirements
- Task pickup latency: p99 under 5 seconds from submission
- Resumption latency: p99 under 10 seconds after worker failure
- Concurrent active tasks: 10,000 across all workers
- Maximum task duration: 24 hours
- Checkpoint durability: no more than 1 step of work lost on crash
- Token accounting accuracy: within 1% of actual LLM usage
- Trace completeness: 100% of tool calls and LLM calls captured
Constraints and Assumptions
- Workers are stateless containers (Kubernetes pods) that can be killed anytime
- The LLM API is external (OpenAI, Anthropic) - calls are not idempotent by default
- Sub-agents share the parent task’s token budget
- No human-in-the-loop required for tool approval (configurable but off by default)
High-Level Architecture
The system has four logical layers. The orchestration layer contains the task queue (Redis or SQS), the orchestrator process (which decomposes goals and spawns sub-agents), the state store (durable checkpoint storage), and the cost tracker (token ledger). The agent worker pool is a fleet of stateless containers, each capable of executing any agent task, scaling horizontally based on queue depth. The tool execution layer provides sandboxed access to external capabilities: web search, code execution, file I/O, and external APIs. The trace layer captures every LLM call, tool call, and state transition with full context.
When a task arrives, it enters the queue with a priority level. A worker dequeues it, creates the initial checkpoint, and begins the agent execution loop: call the LLM, receive a tool call or final answer, execute the tool (if any), checkpoint the state, repeat. If the worker dies, the task returns to the queue with its checkpoint pointer intact. The next worker to pick it up reads the checkpoint and continues from the last saved step.
Agent resumability is fundamentally a checkpoint-before-commit pattern: write the checkpoint to durable storage before executing each tool call. If the worker dies after writing the checkpoint but before completing the tool call, the tool call is re-executed with an idempotency key - the result may be a cache hit rather than a new call.
Agent State Persistence
Agent state is everything the agent has “seen” since the task started: the full conversation history (system prompt, user message, all LLM completions, all tool results), the current step index, the accumulated token count, and metadata about each tool call.
import json
from dataclasses import dataclass, field
from typing import List, Optional, Dict, Any
@dataclass
class ToolCallRecord:
call_id: str
tool_name: str
input_args: Dict[str, Any]
output: Optional[str] = None
status: str = "pending" # pending | completed | failed
tokens_used: int = 0
@dataclass
class AgentCheckpoint:
task_id: str
step_index: int
messages: List[Dict] # full conversation history
tool_call_records: List[ToolCallRecord]
tokens_used: int
status: str # running | paused | completed | failed | cost_exceeded
created_at: float
worker_id: Optional[str] = None
def save_checkpoint(checkpoint: AgentCheckpoint, db) -> None:
db.execute(
"""
INSERT INTO task_checkpoints (task_id, step_index, state_json, tokens_used, status, updated_at)
VALUES (%s, %s, %s, %s, %s, NOW())
ON CONFLICT (task_id) DO UPDATE SET
step_index = EXCLUDED.step_index,
state_json = EXCLUDED.state_json,
tokens_used = EXCLUDED.tokens_used,
status = EXCLUDED.status,
updated_at = EXCLUDED.updated_at
""",
(
checkpoint.task_id,
checkpoint.step_index,
json.dumps({
"messages": checkpoint.messages,
"tool_call_records": [vars(r) for r in checkpoint.tool_call_records],
}),
checkpoint.tokens_used,
checkpoint.status,
)
)
The ON CONFLICT DO UPDATE pattern is critical: checkpoints are upserted, not inserted, so a crash between two checkpoint writes cannot leave two conflicting rows.
Never checkpoint after the LLM call but before the tool call completes. If the worker crashes between those two events, on resume the agent will see the LLM’s tool request in conversation history but no tool result - it will re-request the same tool call, which is correct behavior only if the tool call was idempotent. Write-destructive tool calls (sending emails, making payments) must use idempotency keys checked before execution.
Resumable Execution
The execution loop is the heart of the agent worker. It must be restartable from any checkpoint and handle the state machine transitions correctly.
import anthropic
from typing import Optional
client = anthropic.Anthropic()
def execute_agent_task(task_id: str, db, tool_registry) -> None:
checkpoint = load_checkpoint(task_id, db)
if checkpoint is None:
checkpoint = AgentCheckpoint(
task_id=task_id, step_index=0, messages=load_task_messages(task_id, db),
tool_call_records=[], tokens_used=0, status="running", created_at=time.time()
)
task_config = load_task_config(task_id, db)
max_tokens = task_config["max_tokens"]
max_steps = task_config["max_steps"]
while checkpoint.step_index < max_steps:
# Check cost cap before calling LLM
if checkpoint.tokens_used >= max_tokens:
checkpoint.status = "cost_exceeded"
save_checkpoint(checkpoint, db)
return
# Call LLM
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
messages=checkpoint.messages,
tools=tool_registry.get_tool_definitions(),
)
# Account tokens immediately
checkpoint.tokens_used += response.usage.input_tokens + response.usage.output_tokens
checkpoint.step_index += 1
# Append assistant message to conversation
checkpoint.messages.append({"role": "assistant", "content": response.content})
if response.stop_reason == "end_turn":
checkpoint.status = "completed"
save_checkpoint(checkpoint, db)
store_result(task_id, response.content, db)
return
if response.stop_reason == "tool_use":
for block in response.content:
if block.type != "tool_use":
continue
tool_record = ToolCallRecord(
call_id=block.id,
tool_name=block.name,
input_args=block.input
)
# Checkpoint BEFORE tool execution (idempotency safety)
checkpoint.tool_call_records.append(tool_record)
save_checkpoint(checkpoint, db)
# Check if we already have a result (resume case)
cached = find_tool_result(block.id, checkpoint.tool_call_records)
if cached is not None:
tool_output = cached
else:
tool_output = tool_registry.execute(block.name, block.input, idempotency_key=block.id)
tool_record.output = tool_output
tool_record.status = "completed"
save_checkpoint(checkpoint, db)
# Append tool result to conversation
checkpoint.messages.append({
"role": "user",
"content": [{"type": "tool_result", "tool_use_id": block.id, "content": tool_output}]
})
The loop checkpoints at two points per step: once before tool execution (recording the tool call intent) and once after (recording the result). On resume, find_tool_result checks if a previous execution already ran this tool call by its call_id - if so, it returns the cached result rather than re-executing.
Anthropic’s Claude API assigns stable tool_use_id values within a message, which makes them natural idempotency keys for re-execution detection. OpenAI’s Assistants API uses a similar “step” abstraction with persistent step_id values that survive across thread runs.
Tool Call Lifecycle
Tools are the agent’s actuators. Each tool invocation must be: isolated (one tool failure should not kill the agent), observable (full input/output captured in the trace), bounded (timeout enforced), and idempotent where possible.
import subprocess
import uuid
from concurrent.futures import ThreadPoolExecutor, TimeoutError
TOOL_TIMEOUT_SECONDS = 30
def execute_tool_with_isolation(
tool_name: str,
tool_input: dict,
idempotency_key: str,
tool_registry: dict
) -> str:
# Check idempotency cache
cached = idempotency_cache.get(idempotency_key)
if cached:
return cached
tool_fn = tool_registry.get(tool_name)
if tool_fn is None:
return f"Error: Unknown tool '{tool_name}'"
with ThreadPoolExecutor(max_workers=1) as executor:
future = executor.submit(tool_fn, **tool_input)
try:
result = future.result(timeout=TOOL_TIMEOUT_SECONDS)
except TimeoutError:
result = f"Error: Tool '{tool_name}' timed out after {TOOL_TIMEOUT_SECONDS}s"
except Exception as e:
result = f"Error: {type(e).__name__}: {str(e)}"
idempotency_cache.set(idempotency_key, result, ttl=3600)
return result
Code execution tools need stronger isolation: run in a container with no network access, CPU limits, and a process tree kill on timeout. The tool registry defines what each tool is allowed to do:
TOOL_DEFINITIONS = [
{
"name": "web_search",
"description": "Search the web for current information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string"},
"num_results": {"type": "integer", "default": 5}
},
"required": ["query"]
}
},
{
"name": "execute_python",
"description": "Execute Python code in a sandboxed environment",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string"},
"timeout_seconds": {"type": "integer", "default": 10}
},
"required": ["code"]
}
}
]
Cost Cap Enforcement
Cost enforcement must be proactive (check before each LLM call) and global (count tokens across all sub-agents sharing a parent task budget). A naively implemented cap that only checks after the call is complete will let one expensive completion overshoot the cap by 50-100K tokens.
import redis
r = redis.Redis()
def check_and_reserve_tokens(task_id: str, estimated_tokens: int, hard_cap: int) -> bool:
"""
Atomically check if adding estimated_tokens would exceed hard_cap.
Uses Redis INCRBY with a Lua script for atomicity.
Returns True if reservation succeeded, False if cap would be exceeded.
"""
lua_script = """
local current = tonumber(redis.call('GET', KEYS[1])) or 0
local requested = tonumber(ARGV[1])
local cap = tonumber(ARGV[2])
if current + requested > cap then
return 0
end
redis.call('INCRBY', KEYS[1], requested)
return 1
"""
key = f"token_budget:{task_id}"
result = r.eval(lua_script, 1, key, estimated_tokens, hard_cap)
return result == 1
def release_token_reservation(task_id: str, actual_tokens: int, estimated_tokens: int) -> None:
"""
After the LLM call, replace the estimate with the actual usage.
"""
delta = actual_tokens - estimated_tokens
if delta != 0:
r.incrby(f"token_budget:{task_id}", delta)
Sub-agents inherit the parent task’s task_id for token accounting. When a sub-agent calls check_and_reserve_tokens, it checks the same Redis key as the parent. This ensures the combined token usage across the entire agent tree stays within the configured budget.
Use an optimistic reservation approach: estimate token usage before the LLM call (based on conversation history length), reserve that estimate, then reconcile with actual usage after the call. This prevents overshoot while allowing forward progress even when estimates are slightly off.
Parallel Sub-Agent Spawning
Long tasks decompose naturally into parallelizable sub-tasks. A research task might split into “search for competitor A pricing,” “search for competitor B pricing,” and “search for competitor C pricing” - three independent web searches that can run simultaneously.
import asyncio
from dataclasses import dataclass
@dataclass
class SubAgentTask:
parent_task_id: str
sub_task_id: str
goal: str
tool_allowlist: list
max_tokens: int # fraction of parent budget
async def spawn_parallel_sub_agents(
parent_task_id: str,
sub_goals: list[str],
tool_allowlist: list,
per_sub_budget: int,
db,
queue
) -> list[str]:
"""
Returns list of sub_task_ids. Parent waits on all of them.
"""
sub_task_ids = []
for goal in sub_goals:
sub_task = SubAgentTask(
parent_task_id=parent_task_id,
sub_task_id=str(uuid.uuid4()),
goal=goal,
tool_allowlist=tool_allowlist,
max_tokens=per_sub_budget
)
db.execute(
"INSERT INTO tasks (id, parent_id, goal, config, status) VALUES (%s, %s, %s, %s, 'pending')",
(sub_task.sub_task_id, parent_task_id, goal, json.dumps(vars(sub_task)))
)
queue.enqueue(sub_task.sub_task_id, priority="normal")
sub_task_ids.append(sub_task.sub_task_id)
return sub_task_ids
async def wait_for_sub_agents(
sub_task_ids: list[str],
timeout_seconds: int,
db
) -> dict[str, str]:
"""
Poll for sub-agent completion. Returns {sub_task_id: result}.
"""
results = {}
deadline = asyncio.get_event_loop().time() + timeout_seconds
pending = set(sub_task_ids)
while pending and asyncio.get_event_loop().time() < deadline:
for task_id in list(pending):
status, result = db.fetchone(
"SELECT status, result FROM tasks WHERE id = %s", (task_id,)
)
if status in ("completed", "failed", "cost_exceeded"):
results[task_id] = result or f"[{status}]"
pending.remove(task_id)
if pending:
await asyncio.sleep(0.5)
for task_id in pending:
results[task_id] = "[timeout]"
return results
Unlimited sub-agent spawning creates a token amplification attack: a task with a 100K token budget could spawn 100 sub-agents each reserving 100K tokens, consuming 10M tokens before any cap fires. Enforce a maximum sub-agent depth (typically 2-3 levels) and a maximum fan-out per level. Track total budget allocation across all live sub-agents at spawn time, not just actual usage.
Agent Tracing
Every LLM call, tool call, and state transition must be recorded for debugging, billing, and safety auditing. Use OpenTelemetry spans with structured attributes:
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
tracer = trace.get_tracer("agent-worker")
def traced_llm_call(messages: list, task_id: str, step_index: int) -> dict:
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("task.id", task_id)
span.set_attribute("agent.step", step_index)
span.set_attribute("llm.model", "claude-opus-4-8")
span.set_attribute("llm.input_tokens_estimate", estimate_tokens(messages))
response = client.messages.create(
model="claude-opus-4-8",
max_tokens=4096,
messages=messages,
)
span.set_attribute("llm.input_tokens", response.usage.input_tokens)
span.set_attribute("llm.output_tokens", response.usage.output_tokens)
span.set_attribute("llm.stop_reason", response.stop_reason)
return response
def traced_tool_call(tool_name: str, tool_input: dict, task_id: str) -> str:
with tracer.start_as_current_span(f"tool.{tool_name}") as span:
span.set_attribute("task.id", task_id)
span.set_attribute("tool.name", tool_name)
span.set_attribute("tool.input", json.dumps(tool_input)[:1000])
result = execute_tool_with_isolation(tool_name, tool_input, task_id, TOOL_REGISTRY)
span.set_attribute("tool.output_length", len(result))
span.set_attribute("tool.success", not result.startswith("Error:"))
return result
Data Model
-- Task registry
CREATE TABLE tasks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
parent_id UUID REFERENCES tasks(id),
goal TEXT NOT NULL,
config JSONB NOT NULL, -- max_tokens, max_steps, tool_allowlist
status TEXT NOT NULL DEFAULT 'pending',
result TEXT,
error TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
depth INT NOT NULL DEFAULT 0 -- sub-agent nesting depth
);
-- Durable checkpoint per task (one row per task, upserted on each step)
CREATE TABLE task_checkpoints (
task_id UUID PRIMARY KEY REFERENCES tasks(id) ON DELETE CASCADE,
step_index INT NOT NULL DEFAULT 0,
state_json JSONB NOT NULL, -- messages + tool_call_records
tokens_used INT NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'running',
worker_id TEXT,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Token usage ledger for billing
CREATE TABLE token_usage (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
task_id UUID NOT NULL REFERENCES tasks(id),
root_task_id UUID NOT NULL, -- top-level task for billing aggregation
step_index INT NOT NULL,
model TEXT NOT NULL,
input_tokens INT NOT NULL,
output_tokens INT NOT NULL,
recorded_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX token_usage_root_idx ON token_usage (root_task_id);
CREATE INDEX tasks_parent_idx ON tasks (parent_id);
CREATE INDEX tasks_status_idx ON tasks (status) WHERE status IN ('pending', 'running');
The root_task_id on token_usage makes billing aggregation a single indexed query: SELECT SUM(input_tokens + output_tokens) FROM token_usage WHERE root_task_id = $1.
Key Algorithms and Protocols
Visibility Timeout (Queue Lease)
The task queue uses a visibility timeout pattern: when a worker dequeues a task, the task becomes invisible to other workers for visibility_timeout seconds. If the worker crashes without completing or extending the timeout, the task becomes visible again and another worker can pick it up.
VISIBILITY_TIMEOUT = 120 # seconds
def heartbeat_loop(task_id: str, queue, stop_event):
"""
Extend the queue message visibility every 60s to prevent task re-queuing
while the worker is actively processing.
"""
receipt = queue.current_receipt(task_id)
while not stop_event.is_set():
queue.extend_visibility(receipt, VISIBILITY_TIMEOUT)
stop_event.wait(timeout=60) # extend every 60s
Failure Recovery Protocol
1. Worker W1 dequeues task T at time=0
2. W1 starts heartbeat loop, begins execution
3. At step=8, W1 crashes (pod evicted)
4. Heartbeat stops; visibility timeout expires at time=120
5. Worker W2 dequeues task T (same message, same receipt)
6. W2 calls load_checkpoint(T): gets step_index=8, messages=[...8 steps...]
7. W2 continues from step 9, not step 0
8. W2 completes task, calls queue.delete(receipt)
The checkpoint and the queue message are separate durability domains: the checkpoint lives in Postgres (durable, transactional) and the queue message lives in Redis/SQS (delivery guarantee only). The worker reads the queue message to know which task to pick up, but reads Postgres to know where to resume from.
Scaling and Performance
Workers are stateless and scale horizontally. The scale signal is queue depth per priority tier - not CPU or memory. Use a dedicated worker pool per priority tier to prevent high-priority tasks from waiting behind a batch workload.
Given:
- 10,000 concurrent active tasks
- Average task duration: 2 minutes
- Average 10 LLM calls per task, each 2,000 tokens in + 500 out
Token throughput: 10,000 tasks * 10 calls * 2,500 tokens = 250M tokens active
LLM calls/second: 10,000 tasks * 10 calls / 120s = ~833 LLM calls/s
Checkpoint writes/second: 833 * 2 = 1,666 writes/s to Postgres
Postgres checkpoint load:
- Row size: ~100KB per checkpoint (20 messages * 5KB avg)
- 1,666 writes/s * 100KB = ~160MB/s write throughput
- Solution: use a write-heavy Postgres instance (io2 EBS) or partition by task_id
Queue throughput:
- 10,000 active tasks, avg 120s duration -> 83 task completions/second
- Redis throughput: easily handles 10K+ operations/second per node
The Postgres checkpoint table is the system’s write bottleneck. Partition by task_id hash (8-16 partitions) to distribute write load. Alternatively, use a document store (DynamoDB, MongoDB) where task_id is the partition key - single-item writes are O(1) regardless of table size.
Temporal.io is an open-source workflow orchestration system that implements exactly this pattern - durable task execution with automatic replay from the last successful step. Many AI agent frameworks (LangGraph, Inngest) build on similar ideas: write-ahead event logs with deterministic replay semantics.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Worker pod eviction | Visibility timeout expires (120s) | Task stalls up to 120s | Re-queued automatically; resumes from last checkpoint |
| Checkpoint write failure | Exception in save_checkpoint | Task may lose one step | Retry checkpoint write with backoff; halt task if 3 consecutive failures |
| LLM API rate limit | 429 response | Task paused mid-step | Exponential backoff with jitter; max 5 retries before marking failed |
| Tool execution timeout | TimeoutError after 30s | Single tool step fails | Return error string to agent; agent decides whether to retry |
| Cost cap exceeded | check_and_reserve_tokens returns False | Task halted | Status set to cost_exceeded; partial result stored; user notified |
| Sub-agent timeout | wait_for_sub_agents deadline passed | Missing sub-results | Parent continues with available results; marks missing as [timeout] |
The most dangerous failure is a “zombie worker”: a worker that lost its queue heartbeat lease (due to network partition) but is still executing and checkpointing. Worker W1 and worker W2 can now both hold checkpoints for the same task and both make LLM calls against its token budget. Fix this with a worker_id claim in the checkpoint: the checkpoint includes the claiming worker’s ID, and a worker always verifies it still holds the claim before writing.
Comparison of Approaches
| Approach | Durability | Resume Granularity | Complexity | Best Fit |
|---|---|---|---|---|
| Stateless HTTP with retry | None | Re-run from start | Low | Sub-5s tasks only |
| Queue + full state in message | Low (msg size limit) | Per-message (coarse) | Low | Simple pipelines, no sub-agents |
| Queue + checkpoint in DB | High | Per-step (fine) | Medium | General agent workloads |
| Event sourcing (Temporal/Inngest) | Very high | Per-event (finest) | High | Complex multi-agent, auditable |
| Actor model (Erlang/Akka) | High | Per-message | Very high | Telecom / streaming agents |
For most enterprise AI agent deployments, the queue + checkpoint in DB approach is the right tradeoff. It handles the durability requirements with familiar tooling (Postgres + Redis), avoids the operational complexity of Temporal, and scales to 10,000+ concurrent tasks on commodity infrastructure.
Key Takeaways
- Agent tasks are jobs, not requests: design the execution model around queues, workers, and checkpoints rather than HTTP handlers.
- Checkpoint before tool execution: write the intent before the action - this is what makes resumption safe for non-idempotent tool calls.
- Cost caps must be atomic: use Redis Lua scripts or database transactions to check-and-increment the token counter; non-atomic implementations will overshoot under concurrent sub-agents.
- Sub-agent depth limits prevent token amplification: a budget of 100K tokens should not be recursively assignable to 100 sub-agents each inheriting 100K tokens.
- Visibility timeout is your crash detector: set it 2x your expected step duration and run a heartbeat extension loop to prevent false positives during normal long tool calls.
- The trace is a first-class artifact: every LLM call and tool call needs a span with input/output captured - it is the primary debugging surface for “why did the agent do that.”
- Worker claim IDs prevent zombie execution: the checkpoint must include the ID of the worker holding the execution lease to detect and resolve split-brain worker scenarios.
The non-obvious lesson is that the hard part of building a reliable agent framework is not the AI part - it is the distributed systems part. The same problems that make distributed databases hard (crash recovery, exactly-once semantics, concurrent writers, cost attribution) apply to agent task queues, just wrapped in a new API surface.
Frequently Asked Questions
Q: Why not just use a serverless function with a long timeout instead of a dedicated queue? A: Serverless functions (Lambda, Cloud Run) have maximum execution time limits (15 minutes for Lambda) and no native mechanism for resuming from a checkpoint. They work for short agent tasks but fail for tasks that take 30+ minutes or need to pause for human approval. Queue + worker gives you unlimited task duration, explicit resumability, and per-priority scaling that serverless cannot match.
Q: How do you handle the case where a tool call has side effects (sending an email, creating a ticket)?
A: Use idempotency keys derived from the tool_use_id provided by the LLM API. Before executing a write-side-effect tool, check if that idempotency key has already been used (stored in Redis with a 24-hour TTL). If it has, return the cached result without re-executing. The LLM generates stable tool call IDs within a session, making them reliable idempotency keys.
Q: How do you prevent one tenant’s runaway agent from consuming all shared token budget?
A: Implement per-tenant token rate limits enforced at the token counter level. Use a Redis key per (tenant_id, time_window) and apply sliding window rate limiting before allowing token reservation. Set separate per-task hard caps and per-tenant-per-hour soft caps. Alert operations when a tenant approaches 80% of their hourly cap.
Q: Why store conversation history in the checkpoint rather than in the LLM provider’s conversation API? A: Provider-hosted conversation state (OpenAI Threads, Claude Projects) creates vendor lock-in and has their own rate limits and pricing. Owning the conversation history gives you: portability across LLM providers, the ability to truncate and summarize old messages (critical for very long tasks), and full audibility without depending on a vendor’s logging API.
Q: How do you handle tasks that need human approval mid-execution?
A: Add a requires_approval tool that the LLM can call when it reaches a decision point needing human judgment. Calling this tool transitions the task to the PAUSED state and sends a notification (email, Slack) with the decision context. The task stays paused until a human calls POST /tasks/{id}/resume with an approval or rejection. The worker’s next heartbeat extension fails (task is PAUSED), causing it to gracefully exit. On approval, the task re-enters the queue with the human’s decision appended to the conversation history.
Q: What is the right checkpoint granularity - per LLM call, per tool call, or per multi-step segment? A: Per tool call is the right granularity. Per LLM call is too coarse - you lose the tool intent if the worker crashes before the tool executes. Per multi-step segment is also too coarse and increases retry blast radius. The overhead of writing a Postgres row per tool call is negligible (microseconds) compared to the cost of re-running multiple LLM calls from an older checkpoint.
Interview Questions
Q: Walk me through what happens when a worker pod running an agent task is evicted mid-execution.
Expected depth: Describe the visibility timeout mechanism, checkpoint durability guarantee, and how the next worker reads the checkpoint to determine the resume point. Discuss the zombie worker risk and how worker_id claims in checkpoints resolve it. Address the case where the eviction happens between writing the tool intent checkpoint and receiving the tool result.
Q: How would you enforce a $10 per-task cost cap without allowing overshoot by more than $0.10?
Expected depth: Explain the check-and-reserve pattern using Redis atomic operations. Discuss why non-atomic check-then-increment fails under concurrent sub-agents. Describe the optimistic reservation approach (estimate before call, reconcile after). Calculate the maximum overshoot given average completion token sizes.
Q: Design the token accounting system for a task that spawns 5 parallel sub-agents, each of which spawns 2 more sub-agents.
Expected depth: Explain the root_task_id column for billing aggregation. Discuss budget allocation at spawn time (each sub-agent gets 1/N of remaining parent budget). Describe how a sub-agent at depth 2 checks the root task’s Redis counter. Address the race condition where two sub-agents simultaneously check a counter that is close to the cap.
Q: How would you implement “replay” - letting a user re-run a failed task from step 5 with a different LLM?
Expected depth: Describe loading the checkpoint at step 5, slicing the conversation history to that point, updating the model configuration, and re-queueing the task. Discuss the challenge of replaying tool calls that had side effects (idempotency key collision). Explain why the trace store (not the checkpoint) is needed for perfect replay - the checkpoint only stores what the agent “saw,” not full timing and metadata.
Q: How does your architecture handle a sub-agent that exceeds its token budget while the parent is waiting for its result?
Expected depth: Sub-agent status transitions to cost_exceeded and stores a partial result. Parent’s wait_for_sub_agents receives the [cost_exceeded] marker as the result. Parent LLM sees this in the tool result and decides whether to proceed with partial data, request a re-run with more budget, or fail the overall task. Budget escalation flow: user can approve additional budget via the approval endpoint, which increments the Redis counter.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.