System Design - reliability

Build an AI Agent Task Queue with Resumability Premium

Design an agent execution framework that handles long-running tool calls, persists agent state so tasks can be paused and resumed across server restarts, enforces token and cost caps, and supports parallel sub-agent execution.

Read

#02 Deployment

Build a Blue-Green Deployment System Premium

Design a deployment system that can switch 100% of production traffic from the old version to the new version in under 30 seconds and roll back instantly if error rates spike.

Read

#03 Reliability

Build a Chaos Engineering Platform Premium

Design a controlled failure injection platform that safely introduces latency, packet loss, and resource exhaustion into production services, enforces blast radius limits, and automatically halts experiments when SLOs degrade.

Read

#04 Deployment

Build a CI/CD Pipeline Orchestrator Premium

Design a system that handles 10,000 concurrent build and test jobs, assigns them to workers, streams logs in real-time, and ensures no job is lost even if a worker crashes mid-run.

Read

#05 Distributed Systems

Build a Concurrent Device Session Manager Premium

Design a session enforcement system that limits streaming to N concurrent devices per subscription plan, kicks the oldest session when the limit is exceeded, and works correctly across distributed servers.

Read

#06 Distributed Systems

Build a Distributed Cron Job Scheduler Premium

Design a cron scheduler that guarantees exactly-once execution of jobs across a cluster of nodes, handles missed executions during downtime, and scales to millions of scheduled tasks.

Read

#07 Distributed Systems

Build a Distributed Key-Value Store Premium

Design a distributed KV store that partitions data using consistent hashing, replicates each key across N nodes for durability, and handles node failures with tunable consistency guarantees.

Read

#08 Distributed Systems

Build a Distributed Lock Manager and Leader Election Service Premium

Design a distributed coordination service that provides mutual exclusion across a cluster, implements lease-based locks that expire on node failure, and enables leader election for singleton workloads.

Read

#09 Distributed Systems

Build a Distributed Unique ID Generator Free

Design a system that generates globally unique, roughly time-sortable IDs across thousands of nodes with no coordination overhead and zero collisions under any failure scenario.

Read

#10 Databases

Build a Distributed Write-Ahead Log Premium

Design a durable WAL that survives single-node crashes, replicates log entries across nodes before acknowledging writes, and supports point-in-time recovery by replaying from any checkpoint.

Read

Older Posts