Build a Hot-Cold Tiered Storage System
databases performance cost-optimization
System Design Deep Dive
Hot-Cold Tiered Storage
When NVMe cost meets object storage scale - how to make the tiers invisible to your application
Think of your data as a library. When you first check out a book you keep it on your desk - instantly reachable. As weeks pass and you stop reading it, it migrates to your bookshelf. Eventually it goes to the basement archive. The key insight: when someone desperately needs a basement book, a librarian fetches it for them. The visitor never needs to know it was in the basement at all.
That is exactly the contract a hot-cold tiered storage system must deliver. Applications issue reads and writes against a single logical namespace. Underneath, a policy engine continuously classifies objects by access temperature, NVMe SSDs hold the hot fraction, object storage holds the cold fraction, and a transparent retrieval path handles cache misses without a single line of application code changing.
The engineering tension here is real and non-trivial. NVMe SSDs deliver sub-millisecond random reads and cost roughly $0.10-0.30 per GB per month. S3 Standard-IA costs $0.0125 per GB per month - a 10-24x difference. A system that stores 100 TB total but serves 90% of read traffic from the hot 5% of data can cut storage costs by 80% while keeping p99 read latency under 2ms for the overwhelming majority of requests. But achieving this without leaking complexity to callers requires solving five hard problems simultaneously: access frequency tracking at scale, a tiering policy engine that makes correct promotion and demotion decisions, transparent retrieval for cold objects, migration job design that doesn’t saturate the network, and careful management of rehydration latency for cold misses.
We need to solve for correctness of tier placement, operational safety during migration, and zero application coupling - simultaneously.
Requirements and Constraints
Functional Requirements
- Applications read and write objects via a single unified API endpoint regardless of which tier the data lives on
- The system automatically migrates objects from hot (NVMe SSD) to cold (object storage) based on access patterns and configurable age policies
- Cold objects are transparently retrieved and promoted to the hot tier on read access, with the application receiving the data without error
- A tiering policy engine evaluates objects on a configurable schedule and emits promotion/demotion decisions
- Access frequency counters are maintained per object with sliding-window decay
- The hot tier enforces a capacity ceiling with LFU-based eviction when utilization exceeds a threshold
- Object integrity is verified via checksums before deletion from source tier after copy
Non-Functional Requirements
- Hot tier read latency: p50 < 0.5ms, p99 < 2ms for objects on NVMe SSD
- Cold tier rehydration latency: p50 < 200ms, p99 < 800ms for objects fetched from S3 Standard-IA
- Migration throughput: sustain 100 MB/s of background migration traffic without impacting read SLA
- Metadata store: < 5ms for tier lookup on the read path
- Policy evaluation: full object corpus scanned within a 15-minute window at 10 billion objects
- Durability: 11 nines (S3 SLA) for cold tier; hot tier survives single node failures via RAID-10 or replica
- Availability: 99.99% read availability, cold miss returns degraded response within 30 seconds
- Scale: support up to 10 billion objects totaling 100 TB on the hot tier and 10 PB in cold storage
Constraints and Assumptions
- Application code cannot change - the system must be a drop-in layer below existing storage clients
- Object sizes range from 4 KB to 5 GB; the tiering policy is object-level, not block-level
- Write path always lands on the hot tier first - there is no direct cold write path
- The system is not a filesystem; it does not support
renameor hard links as atomic operations - Cost optimization targets S3 storage costs; S3 API request costs are treated as secondary
- A single region is the primary design scope; multi-region replication is a stretch goal
High-Level Architecture
The system divides cleanly into five layers: the Storage Proxy that intercepts all I/O, the Access Tracker that maintains frequency counters, the Tiering Policy Engine that makes classification decisions, the Migration Engine that physically moves data, and the Metadata Store that records where each object lives.
Every read starts at the Storage Proxy. The proxy performs a metadata lookup to determine the object’s current tier. If the object is hot, the proxy issues a direct read against the NVMe SSD pool and returns data in under 1ms. If the object is cold, the proxy enqueues a rehydration job, optionally returns a 503 with a Retry-After header, and a background worker fetches the object from S3, writes it to the hot tier, and updates metadata. Subsequent reads for that object hit the hot tier.
Every write lands on the hot tier first. The proxy writes to NVMe SSD, records tier=hot in the Metadata Store, and increments the object’s heat_score. The Access Tracker maintains a sliding window of access events per object using a decay function; objects that go unaccessed for configurable periods see their score drop below the cold migration threshold. The Policy Engine, running on a fixed interval, scans objects whose scores have crossed tier boundaries and emits migration decisions to the Migration Scheduler. The scheduler batches these into migration jobs, rate-limits them to avoid network saturation, and hands them to Downgrade or Rehydration Workers depending on direction.
The proxy must never let a cold miss block the read indefinitely - returning a 503 with Retry-After lets the caller retry after rehydration completes, making the cold path observable and bounded rather than a silent multi-second hang.
The Storage Proxy
The Storage Proxy is the only component that callers talk to - it is the system’s face and must add no more than 1-2ms of overhead on the hot path.
The naive design would be to make the proxy stateless: look up every object’s tier in the metadata store on every request, then route accordingly. At 50,000 reads/second, that is 50,000 metadata lookups/second - completely tractable with a hot metadata cache. The non-obvious failure is stale routing: if the metadata cache holds a tier=hot entry for an object that was just evicted, the proxy will attempt to read from NVMe SSD and get a miss. The recovery path must handle this gracefully by falling back to the metadata store and then to cold storage before returning an error.
# Storage proxy read path with tier-aware routing and fallback
import time
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class Tier(Enum):
HOT = "hot"
COLD = "cold"
ARCHIVE = "archive"
@dataclass
class ObjectMeta:
key: str
tier: Tier
size_bytes: int
heat_score: float
last_accessed: float
checksum_sha256: str
class StorageProxy:
def __init__(self, metadata_store, nvme_pool, object_store, rehydration_queue, access_tracker):
self.metadata = metadata_store
self.nvme = nvme_pool
self.s3 = object_store
self.rehydration_q = rehydration_queue
self.access_tracker = access_tracker
# Local LRU cache for metadata - 1M entries, 30s TTL
self._meta_cache = LRUCache(capacity=1_000_000, ttl_seconds=30)
def read(self, key: str, allow_stale: bool = False) -> tuple[bytes, int]:
"""Returns (data, http_status). 503 if cold miss enqueued."""
meta = self._get_meta(key)
if meta is None:
return b"", 404
# Track access regardless of tier
self.access_tracker.record_access(key, meta.size_bytes)
if meta.tier == Tier.HOT:
data = self.nvme.read(key)
if data is not None:
return data, 200
# Stale cache: object was evicted but metadata not yet updated
# Refresh metadata from source of truth and try cold path
meta = self.metadata.get(key, bypass_cache=True)
self._meta_cache.set(key, meta)
if meta.tier in (Tier.COLD, Tier.ARCHIVE):
# Enqueue rehydration and return 503
rehydration_eta_seconds = self._estimate_rehydration_time(meta)
job_id = self.rehydration_q.enqueue(key, priority="interactive")
return b"", 503 # caller checks Retry-After header
return b"", 500
def write(self, key: str, data: bytes) -> int:
"""Always writes to hot tier. Returns HTTP status."""
import hashlib
checksum = hashlib.sha256(data).hexdigest()
# Write to NVMe first
self.nvme.write(key, data)
# Record metadata
meta = ObjectMeta(
key=key,
tier=Tier.HOT,
size_bytes=len(data),
heat_score=100.0, # new writes start at max heat
last_accessed=time.time(),
checksum_sha256=checksum,
)
self.metadata.put(key, meta)
self._meta_cache.set(key, meta)
self.access_tracker.record_write(key)
return 201
def _get_meta(self, key: str) -> Optional[ObjectMeta]:
cached = self._meta_cache.get(key)
if cached:
return cached
meta = self.metadata.get(key)
if meta:
self._meta_cache.set(key, meta)
return meta
def _estimate_rehydration_time(self, meta: ObjectMeta) -> int:
"""Estimate seconds for rehydration based on size and tier."""
if meta.tier == Tier.COLD:
return max(5, meta.size_bytes // (50 * 1024 * 1024)) # ~50MB/s from S3
return max(60, meta.size_bytes // (10 * 1024 * 1024)) # ~10MB/s from Glacier
The proxy must also handle the write-after-eviction race: if a write arrives for a key that the Migration Engine is currently copying to cold storage, the proxy must either block the migration or accept the write and mark the in-progress migration as stale. We use optimistic concurrency via a generation counter in the metadata store: the migration worker only deletes from the hot tier if the generation matches what it read at job creation time.
Never delete the hot copy until S3 PUT has returned 200 and you have verified the checksum. A worker crash between delete and upload confirmation causes permanent data loss. Always copy-then-verify-then-delete, never move.
The Access Tracker
The Access Tracker’s job is simple to state and surprisingly tricky to implement correctly at scale: maintain a heat score per object that decays over time and increases with each access.
Think of the heat score like a radioactive isotope with a configurable half-life. Each access adds energy; time causes exponential decay. An object accessed 100 times last week but zero times this week should rank below an object accessed 5 times today. Simple count-since-creation doesn’t capture this recency weighting.
The standard implementation uses exponential decay: score = score * e^(-lambda * delta_t) + access_increment. At 10 billion objects, maintaining per-object scores in memory is infeasible (even at 16 bytes per score, that is 160 GB). The solution is a Count-Min Sketch for approximate frequency, backed by a time-decayed sliding window using Redis sorted sets for the top-N hottest objects.
# Access frequency tracking with sliding window decay
import time
import math
import redis
class AccessTracker:
def __init__(self, redis_client: redis.Redis, decay_lambda: float = 0.1):
"""
decay_lambda: decay rate per hour. 0.1 = half-life of ~7 hours.
Scores decay as: score *= exp(-lambda * hours_elapsed)
"""
self.r = redis_client
self.decay_lambda = decay_lambda
# Redis sorted set: key=object_key, score=heat_score
self.HEAT_ZSET = "tier:heat_scores"
# Hash: key=object_key, value=last_update_ts (unix float)
self.LAST_UPDATE_HASH = "tier:last_update"
def record_access(self, object_key: str, size_bytes: int) -> float:
"""Increment heat score with exponential decay applied first."""
now = time.time()
pipe = self.r.pipeline()
last_update_raw = self.r.hget(self.LAST_UPDATE_HASH, object_key)
last_update = float(last_update_raw) if last_update_raw else now
hours_elapsed = (now - last_update) / 3600.0
# Fetch current score (default 0 if new)
current_score_raw = self.r.zscore(self.HEAT_ZSET, object_key)
current_score = float(current_score_raw) if current_score_raw else 0.0
# Apply decay then add access increment
# Access increment is 1.0 for small objects, scaled for large ones
access_weight = 1.0 + math.log10(max(1, size_bytes / 4096))
decayed_score = current_score * math.exp(-self.decay_lambda * hours_elapsed)
new_score = decayed_score + access_weight
pipe.zadd(self.HEAT_ZSET, {object_key: new_score})
pipe.hset(self.LAST_UPDATE_HASH, object_key, now)
pipe.execute()
return new_score
def get_score(self, object_key: str) -> float:
"""Get current decayed heat score."""
now = time.time()
score_raw = self.r.zscore(self.HEAT_ZSET, object_key)
if score_raw is None:
return 0.0
last_update_raw = self.r.hget(self.LAST_UPDATE_HASH, object_key)
if last_update_raw is None:
return float(score_raw)
hours_elapsed = (now - float(last_update_raw)) / 3600.0
return float(score_raw) * math.exp(-self.decay_lambda * hours_elapsed)
def get_cold_candidates(self, threshold: float, limit: int = 10000) -> list[str]:
"""Return object keys whose decayed score is below the cold threshold."""
# Use ZRANGEBYSCORE to get lowest-scoring objects efficiently
# Score of 0 = never accessed or fully decayed
return self.r.zrangebyscore(self.HEAT_ZSET, 0, threshold, start=0, num=limit)
def get_hot_candidates_for_promotion(self, threshold: float, limit: int = 5000) -> list[str]:
"""Return cold objects whose score has risen above the hot threshold."""
return self.r.zrangebyscore(self.HEAT_ZSET, threshold, "+inf", start=0, num=limit)
Netflix’s EVCache and Facebook’s Memcached fleet use virtually identical decay-weighted scoring to decide which objects to evict from memory. At Netflix’s scale, the decay lambda is tuned per content type - live-stream chunks decay faster than back-catalog movies because the access tail is much shorter.
For the metadata path, the Access Tracker also feeds a Count-Min Sketch running in each proxy instance for approximate real-time frequency. This lets the proxy make local eviction decisions without a round-trip to Redis for every access. The Redis sorted set is the authoritative source used by the Policy Engine’s periodic scan.
The Tiering Policy Engine
The Policy Engine is the system’s brain: it reads heat scores from the Access Tracker, applies configured rules, and emits tier change decisions. The component internals are shown below.
The non-obvious design requirement is hysteresis: without it, an object oscillating around the hot/cold threshold boundary will be migrated back and forth continuously, burning migration bandwidth and creating metadata churn. Like a thermostat with a dead band, the policy uses separate thresholds for promotion and demotion: an object must fall well below the HOT_DEMOTE_THRESHOLD to be evicted, and it must rise well above the COLD_PROMOTE_THRESHOLD to be rehydrated. The gap between these thresholds is the hysteresis band.
# Tiering policy engine with hysteresis and configurable rules
from dataclasses import dataclass
from enum import Enum
import time
class TierDecision(Enum):
STAY_HOT = "stay_hot"
DEMOTE = "demote"
PROMOTE = "promote"
STAY_COLD = "stay_cold"
@dataclass
class TierPolicy:
# Score thresholds (use separate values to create hysteresis band)
hot_demote_threshold: float = 2.0 # below this: demote hot -> cold
cold_promote_threshold: float = 8.0 # above this: promote cold -> hot
# Age constraints
min_hot_age_seconds: float = 3600.0 # 1 hour min on hot tier
min_cold_age_seconds: float = 86400.0 * 30 # 30 days min on cold tier
# Capacity-triggered eviction (bypass age check)
hot_capacity_override_pct: float = 0.85 # evict if hot tier > 85% full
@dataclass
class ObjectRecord:
key: str
tier: str
heat_score: float
tier_assigned_at: float
size_bytes: int
class PolicyEngine:
def __init__(self, policy: TierPolicy, access_tracker, metadata_store, capacity_monitor):
self.policy = policy
self.tracker = access_tracker
self.metadata = metadata_store
self.capacity = capacity_monitor
# Cooldown registry: key -> last_moved_ts. Prevents rapid flip-flop.
self._cooldown: dict[str, float] = {}
self.COOLDOWN_SECONDS = 1800 # 30 minutes between any moves
def evaluate(self, record: ObjectRecord) -> TierDecision:
now = time.time()
score = self.tracker.get_score(record.key)
time_on_tier = now - record.tier_assigned_at
# Enforce cooldown regardless of other conditions
last_move = self._cooldown.get(record.key, 0)
if now - last_move < self.COOLDOWN_SECONDS:
return TierDecision.STAY_HOT if record.tier == "hot" else TierDecision.STAY_COLD
if record.tier == "hot":
# Normal demotion: score below threshold AND object has been hot long enough
if score < self.policy.hot_demote_threshold and time_on_tier >= self.policy.min_hot_age_seconds:
self._cooldown[record.key] = now
return TierDecision.DEMOTE
# Capacity-triggered eviction: bypass age check but respect cooldown
hot_util = self.capacity.hot_tier_utilization()
if hot_util > self.policy.hot_capacity_override_pct:
# Only evict if this object is below the demote threshold
if score < self.policy.hot_demote_threshold:
self._cooldown[record.key] = now
return TierDecision.DEMOTE
return TierDecision.STAY_HOT
elif record.tier == "cold":
# Promote: score crossed hot threshold AND object has aged enough on cold
if score >= self.policy.cold_promote_threshold and time_on_tier >= self.policy.min_cold_age_seconds:
self._cooldown[record.key] = now
return TierDecision.PROMOTE
return TierDecision.STAY_COLD
# Fallback - unknown tier, treat as cold
return TierDecision.STAY_COLD
def run_sweep(self, batch_size: int = 5000) -> dict[str, int]:
"""Scan objects and emit migration decisions. Returns counts per decision."""
counts = {d.value: 0 for d in TierDecision}
# Process hot candidates for demotion
cold_candidates = self.tracker.get_cold_candidates(
threshold=self.policy.hot_demote_threshold,
limit=batch_size
)
for key in cold_candidates:
record = self.metadata.get(key)
if record and record.tier == "hot":
decision = self.evaluate(record)
counts[decision.value] += 1
if decision == TierDecision.DEMOTE:
self.metadata.enqueue_migration(key, direction="demote")
# Process cold candidates for promotion
hot_candidates = self.tracker.get_hot_candidates_for_promotion(
threshold=self.policy.cold_promote_threshold,
limit=batch_size // 4 # promotions are rare
)
for key in hot_candidates:
record = self.metadata.get(key)
if record and record.tier == "cold":
decision = self.evaluate(record)
counts[decision.value] += 1
if decision == TierDecision.PROMOTE:
self.metadata.enqueue_migration(key, direction="promote")
return counts
The hysteresis band is not optional. Set hot_demote_threshold and cold_promote_threshold at least 4x apart. If an object’s score oscillates around a single shared threshold, you get migration thrashing that can consume 30% of your migration bandwidth on churn with zero net benefit.
Migration Job Design
The Migration Engine physically moves data between tiers. Think of it like a postal sorting system: individual packages (objects) are batched into delivery runs (jobs), each run has a rate limit, and packages are confirmed delivered before being removed from the origin post office.
The data flow through the full request and migration path is illustrated below.
The migration worker must be safe to kill at any point without corrupting data. The key is treating migration as a two-phase operation with a metadata commit:
COPY: Read from source, write to destination with checksum verificationCOMMIT: Update metadata store to reflect new tier and new locationDELETE: Remove from source tier only after metadata commit succeeds
If the worker dies after COPY but before COMMIT, the object exists in both tiers. A reconciliation scan detects this via a migration_state = pending_commit flag in metadata and completes the commit. If the worker dies after COMMIT but before DELETE, the object has been safely moved but the hot tier holds a stale copy. The reconciliation scan detects orphaned hot copies (metadata says cold, hot tier contains the key) and deletes them.
// Migration worker: copy-verify-commit-delete pattern
package migration
import (
"context"
"crypto/sha256"
"fmt"
"io"
"log"
"time"
)
type MigrationJob struct {
JobID string
ObjectKey string
Direction string // "demote" or "promote"
SourceTier string
DestTier string
ExpectedSize int64
ExpectedSHA string
GenerationID int64 // for optimistic concurrency check
CreatedAt time.Time
}
type MigrationWorker struct {
nvme NVMePool
s3 S3Client
metadata MetadataStore
jobQueue JobQueue
auditLog AuditLogger
}
func (w *MigrationWorker) ProcessJob(ctx context.Context, job MigrationJob) error {
// Step 1: Mark job as in-progress
if err := w.metadata.SetMigrationState(job.ObjectKey, "in_progress", job.JobID); err != nil {
return fmt.Errorf("failed to mark in_progress: %w", err)
}
// Step 2: Read from source
var data []byte
var readErr error
if job.Direction == "demote" {
data, readErr = w.nvme.Read(ctx, job.ObjectKey)
} else {
data, readErr = w.s3.GetObject(ctx, job.ObjectKey)
}
if readErr != nil {
w.metadata.SetMigrationState(job.ObjectKey, "failed", job.JobID)
return fmt.Errorf("source read failed: %w", readErr)
}
// Step 3: Verify source checksum
hash := sha256.Sum256(data)
actualSHA := fmt.Sprintf("%x", hash)
if actualSHA != job.ExpectedSHA {
return fmt.Errorf("source checksum mismatch: expected %s got %s", job.ExpectedSHA, actualSHA)
}
// Step 4: Write to destination
var writeErr error
if job.Direction == "demote" {
writeErr = w.s3.PutObject(ctx, job.ObjectKey, data, job.ExpectedSHA)
} else {
writeErr = w.nvme.Write(ctx, job.ObjectKey, data)
}
if writeErr != nil {
w.metadata.SetMigrationState(job.ObjectKey, "failed", job.JobID)
return fmt.Errorf("destination write failed: %w", writeErr)
}
// Step 5: COMMIT - update metadata tier (optimistic concurrency check)
newTier := job.DestTier
if err := w.metadata.CommitTierChange(
job.ObjectKey, newTier, job.GenerationID,
); err != nil {
// GenerationID mismatch = a write arrived for this object during migration
// The source copy is still valid; abort migration to avoid data loss
log.Printf("migration aborted due to concurrent write: key=%s err=%v", job.ObjectKey, err)
// Clean up the destination copy we just wrote
w.cleanupDestination(ctx, job)
return fmt.Errorf("concurrent write conflict: %w", err)
}
// Step 6: DELETE source - only safe after metadata commit
if job.Direction == "demote" {
if err := w.nvme.Delete(ctx, job.ObjectKey); err != nil {
// Non-fatal: orphan cleanup scan will handle this
log.Printf("warning: failed to delete hot copy after demote: key=%s err=%v", job.ObjectKey, err)
}
} else {
// For promotions, keep cold copy for durability - S3 lifecycle rule handles eventual deletion
}
w.auditLog.LogMigration(job, "completed", time.Now())
return nil
}
func (w *MigrationWorker) cleanupDestination(ctx context.Context, job MigrationJob) {
if job.Direction == "demote" {
w.s3.DeleteObject(ctx, job.ObjectKey)
} else {
w.nvme.Delete(ctx, job.ObjectKey)
}
}
Rate limiting the migration workers is critical. Without it, a backlog of demotion jobs can saturate your NVMe read bandwidth or your S3 upload bandwidth, causing hot-tier read latency to spike. The scheduler uses a token bucket with separate buckets for read bandwidth (NVMe) and write bandwidth (S3 egress), with the hot-path reads getting higher priority tokens.
Apache Ozone, the HDFS successor used at scale by Alibaba and Tencent, uses a nearly identical migration job design with generation-based optimistic concurrency. When a file is modified during replication, the replication coordinator detects the generation mismatch and requeues the job rather than committing a stale copy.
Data Model
The metadata store is the system’s single source of truth for object location. Every read and migration decision flows through it, so it must be fast, consistent, and partition-tolerant.
-- Core metadata table: one row per stored object
CREATE TABLE object_metadata (
object_key TEXT NOT NULL,
namespace TEXT NOT NULL DEFAULT 'default',
tier TEXT NOT NULL CHECK (tier IN ('hot', 'cold', 'archive')),
size_bytes BIGINT NOT NULL,
checksum_sha256 CHAR(64) NOT NULL,
generation BIGINT NOT NULL DEFAULT 1, -- for optimistic concurrency
tier_assigned_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
last_accessed_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
migration_state TEXT CHECK (migration_state IN ('idle', 'in_progress', 'pending_commit', 'failed')),
migration_job_id TEXT, -- active job ID if in_progress
hot_path TEXT, -- NVMe node + local path
cold_path TEXT, -- S3 URI (s3://bucket/prefix/key)
archive_path TEXT, -- Glacier vault ARN + archive ID
PRIMARY KEY (namespace, object_key)
);
-- Index for policy engine: scan objects by tier + last_accessed
CREATE INDEX idx_metadata_tier_accessed
ON object_metadata (tier, last_accessed_at)
WHERE migration_state = 'idle';
-- Index for orphan detection: find objects stuck in migration
CREATE INDEX idx_metadata_migration_state
ON object_metadata (migration_state, tier_assigned_at)
WHERE migration_state IN ('in_progress', 'pending_commit');
-- Partition by namespace for multi-tenant isolation
-- In production this would be partitioned by namespace hash range
-- Access event log (append-only, used by access tracker reconciliation)
CREATE TABLE access_events (
event_id BIGSERIAL PRIMARY KEY,
object_key TEXT NOT NULL,
namespace TEXT NOT NULL,
event_type TEXT NOT NULL CHECK (event_type IN ('read', 'write', 'rehydration')),
occurred_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
size_bytes BIGINT,
source_ip INET,
requester_id TEXT
) PARTITION BY RANGE (occurred_at);
CREATE TABLE access_events_2026_06 PARTITION OF access_events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
-- Migration audit log (append-only)
CREATE TABLE migration_audit (
job_id TEXT PRIMARY KEY,
object_key TEXT NOT NULL,
namespace TEXT NOT NULL,
direction TEXT NOT NULL CHECK (direction IN ('demote', 'promote')),
from_tier TEXT NOT NULL,
to_tier TEXT NOT NULL,
status TEXT NOT NULL CHECK (status IN ('started', 'completed', 'failed', 'aborted')),
bytes_moved BIGINT,
duration_ms INT,
error_message TEXT,
started_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
completed_at TIMESTAMPTZ
);
CREATE INDEX idx_audit_key_time ON migration_audit (object_key, started_at DESC);
The partitioning strategy for the metadata table is critical at 10 billion objects. A single object_metadata table at that scale needs to be sharded. The natural sharding key is (namespace, object_key) with consistent hashing across shards. Within a shard, the policy engine scan uses the idx_metadata_tier_accessed index to efficiently find hot objects that have gone cold without a full table scan.
Store hot_path, cold_path, and archive_path as separate nullable columns, not a single location field. This makes tier transitions atomic - you can verify the cold path exists before nulling the hot path, and the database row always has a valid pointer to at least one tier.
The generation column provides optimistic concurrency for the migration worker. When a migration job is created, the worker reads the current generation. At the COMMIT phase, it executes:
-- Atomic tier commit with optimistic concurrency check
UPDATE object_metadata
SET tier = 'cold',
cold_path = $1,
hot_path = NULL,
migration_state = 'idle',
migration_job_id = NULL,
tier_assigned_at = NOW(),
generation = generation + 1
WHERE namespace = $2
AND object_key = $3
AND generation = $4 -- fails if a concurrent write bumped the generation
AND migration_state = 'in_progress';
-- If 0 rows updated: concurrent write conflict, abort migration
Key Algorithms and Protocols
Exponential Decay Heat Scoring
The heat score algorithm must balance recency and frequency. A pure frequency count favors historically popular objects that are no longer accessed. A pure recency count ignores that an object accessed 1000 times recently is much hotter than one accessed once. Exponential decay combines both:
# Heat score with exponential decay - time complexity O(1) per access
import math
import time
def compute_decayed_score(
current_score: float,
last_update_ts: float,
access_weight: float = 1.0,
decay_lambda: float = 0.1, # per-hour decay rate
) -> float:
"""
Computes new heat score after applying time decay and adding access weight.
Half-life = ln(2) / decay_lambda hours.
With lambda=0.1: half-life ~7 hours.
With lambda=0.02: half-life ~35 hours (for slower-moving data).
"""
hours_elapsed = (time.time() - last_update_ts) / 3600.0
decay_factor = math.exp(-decay_lambda * hours_elapsed)
return current_score * decay_factor + access_weight
def score_after_inactivity(initial_score: float, hours: float, decay_lambda: float = 0.1) -> float:
"""
How much score remains after N hours of inactivity?
Used to pre-compute demotion time without scanning all objects.
Example: score=50, lambda=0.1, hours=69.3 -> score ~= 0.05 (near zero)
"""
return initial_score * math.exp(-decay_lambda * hours)
def time_to_cold_threshold(current_score: float, cold_threshold: float, decay_lambda: float) -> float:
"""
How many hours until this object's score decays to the cold threshold?
Useful for scheduling future policy scans.
Returns hours until demotion if no more accesses occur.
"""
if current_score <= cold_threshold:
return 0.0
# Solve: current_score * exp(-lambda * t) = threshold
# t = -ln(threshold / current_score) / lambda
return -math.log(cold_threshold / current_score) / decay_lambda
The time_to_cold_threshold function is particularly useful for the Policy Engine scheduler: rather than scanning all hot objects every 15 minutes, we can pre-compute when each hot object is predicted to cross the threshold and schedule the scan accordingly. Objects with predicted_demote_time > now + 24h don’t need to be scanned this cycle.
LFU Eviction Under Capacity Pressure
When the hot tier exceeds its capacity ceiling, the eviction logic must select victims quickly without iterating every object. The standard O(1) LFU implementation uses a doubly-linked list of frequency buckets:
# O(1) LFU cache for hot tier eviction tracking
from collections import defaultdict, OrderedDict
class LFUEvictionTracker:
"""
Tracks the minimum-frequency objects for eviction candidates.
Uses the O(1) LFU algorithm by Ketan Shah et al.
"""
def __init__(self, capacity: int):
self.capacity = capacity
self.size = 0
self.min_freq = 0
# key -> (value, freq)
self.key_freq: dict[str, int] = {}
# freq -> OrderedDict of keys (maintains insertion order for same-freq tie-break)
self.freq_keys: dict[int, OrderedDict] = defaultdict(OrderedDict)
def access(self, key: str) -> None:
"""Record an access for key, updating its frequency bucket."""
if key in self.key_freq:
freq = self.key_freq[key]
del self.freq_keys[freq][key]
if not self.freq_keys[freq] and freq == self.min_freq:
self.min_freq += 1
self.key_freq[key] = freq + 1
self.freq_keys[freq + 1][key] = True
else:
self.key_freq[key] = 1
self.freq_keys[1][key] = True
self.min_freq = 1
self.size += 1
def evict(self) -> str:
"""Evict the least-frequently-used key. Returns the evicted key."""
if self.size == 0:
raise ValueError("Nothing to evict")
# Get LFU key (oldest among ties)
lfu_key, _ = self.freq_keys[self.min_freq].popitem(last=False)
del self.key_freq[lfu_key]
self.size -= 1
return lfu_key
def get_eviction_candidates(self, n: int) -> list[str]:
"""Get top-N eviction candidates without actually evicting."""
candidates = []
freq = self.min_freq
while len(candidates) < n:
bucket = self.freq_keys.get(freq, {})
for key in bucket:
candidates.append(key)
if len(candidates) >= n:
break
freq += 1
if freq > max(self.freq_keys.keys(), default=freq):
break
return candidates
The LFU eviction tracker runs in the proxy layer, not in the Policy Engine. The proxy makes instant local eviction decisions under capacity pressure using the in-memory LFU structure, then asynchronously notifies the Migration Scheduler to back-fill with S3 copies. This decouples the write-hot fast path from the slower background migration system.
Consistent Hashing for NVMe Shard Routing
The hot tier is sharded across multiple NVMe SSD nodes using consistent hashing. The key property is that adding or removing a node only redistributes 1/N of the keyspace, not everything:
# Consistent hash ring for NVMe shard routing
import hashlib
import bisect
class ConsistentHashRing:
def __init__(self, virtual_nodes: int = 150):
"""
virtual_nodes: higher = more even distribution, higher memory usage.
150 virtual nodes per physical node gives ~2% standard deviation in load.
"""
self.virtual_nodes = virtual_nodes
self.ring: dict[int, str] = {}
self.sorted_keys: list[int] = []
self.nodes: set[str] = set()
def _hash(self, key: str) -> int:
return int(hashlib.md5(key.encode()).hexdigest(), 16)
def add_node(self, node_id: str) -> None:
"""Add a physical node with virtual_nodes replicas on the ring."""
self.nodes.add(node_id)
for i in range(self.virtual_nodes):
vnode_key = self._hash(f"{node_id}:{i}")
self.ring[vnode_key] = node_id
bisect.insort(self.sorted_keys, vnode_key)
def remove_node(self, node_id: str) -> None:
"""Remove a node - only affects its portion of the ring."""
self.nodes.discard(node_id)
for i in range(self.virtual_nodes):
vnode_key = self._hash(f"{node_id}:{i}")
if vnode_key in self.ring:
del self.ring[vnode_key]
idx = bisect.bisect_left(self.sorted_keys, vnode_key)
self.sorted_keys.pop(idx)
def get_node(self, object_key: str) -> str:
"""Find the NVMe node responsible for this object key."""
if not self.ring:
raise ValueError("Ring is empty")
hash_val = self._hash(object_key)
idx = bisect.bisect(self.sorted_keys, hash_val)
# Wrap around
if idx == len(self.sorted_keys):
idx = 0
return self.ring[self.sorted_keys[idx]]
Scaling and Performance
The system scales differently across its components. The Storage Proxy scales horizontally - add proxy instances behind a load balancer, each with its own local metadata cache. The Access Tracker’s Redis sorted set is the first bottleneck: at 50,000 accesses/second, all updating the same sorted set, Redis single-threaded command processing saturates around 100K ops/second. The solution is key-partitioned Redis shards: partition the sorted set by murmur3(object_key) % N_SHARDS, with each shard running on a separate Redis instance.
The metadata store scales via horizontal sharding by (namespace, object_key). At 10 billion objects with 500 bytes per row, that is 5 TB of metadata - requiring 10-20 shards of a database like CockroachDB or Spanner. The policy engine scan is the most sensitive: it must evaluate billions of objects every 15 minutes. At 10B objects / 900 seconds = 11 million objects/second, which requires the scan to be parallelized across all metadata shards simultaneously with each shard owning its policy evaluation independently.
Capacity Estimation:
Given:
- 10 billion objects total
- Average object size: 1 MB (mix of 4KB configs to 5GB media)
- Hot tier: 5% of objects = 500M objects
- Hot tier storage: 500M objects * 1 MB avg = 500 TB NVMe
(In practice, objects on hot tier skew smaller: ~200 GB avg = 100 TB actual)
- Cold tier: 95% = 9.5B objects = 9.5 PB
- Read QPS: 50,000/s (p99 hot reads), 500/s cold rehydrations
NVMe SSD pool sizing (hot tier):
- 100 TB data
- IOPS needed: 50,000 reads/s at avg 16KB random read
- NVMe: 500K IOPS per node
- Min nodes: 50K / 500K = 0.1 nodes for IOPS (bandwidth-bound, not IOPS-bound)
- Bandwidth: 50K * 16KB = 800 MB/s; NVMe: 3.5 GB/s per node
- Min nodes for bandwidth: 800 / 3500 = 0.23 nodes
- Storage: 100 TB / 4 TB per NVMe node = 25 nodes minimum
- With replication (RAID-10): 50 NVMe nodes
Migration bandwidth:
- Target: demote 1% of hot tier per day = 1 TB/day
- Required: 1 TB / 86400s = 11.6 MB/s sustained
- With burst headroom: provision 100 MB/s (8.7x headroom)
- S3 multi-part upload: 100 MB/s easily achieved with 4 workers
Metadata store sizing:
- 10B objects * 500 bytes/row = 5 TB metadata
- With indexes: ~12 TB
- 16 CockroachDB nodes at 1 TB each with 3x replication = 48 nodes
- Or 4 Spanner instances at 3 TB each
Cost comparison (annual):
- All NVMe: 100 TB * $0.20/GB/mo * 12 = $245,760/yr
- Tiered (5% hot + 95% cold Standard-IA):
Hot: 5 TB * $0.20 * 12 = $12,288
Cold: 95 TB * $0.0125 * 12 = $14,250
Total: $26,538/yr -- 90% savings
The dominant bottleneck at high write volume is the metadata store write path: every write requires a synchronous metadata insert. At 10,000 writes/second, that is 10,000 inserts/second into the metadata table. Batch inserts (group 100 writes per transaction) reduce this to 100 transactions/second at the cost of 50-100ms write latency increase.
Dropbox Magic Pocket, their custom object store, uses precisely this tiering architecture at exabyte scale. The hot tier is a fleet of custom storage servers with NVMe SSDs; the cold tier is S3-compatible storage. Their migration daemon processes roughly 1 PB of demotions per day using a rate-limited job queue that yields to read traffic when the hot-tier disk queue depth exceeds a threshold.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| NVMe node crash | Health check timeout < 5s; read errors on proxy | Hot objects on that node return errors; hit cold fallback | RAID-10 survives single drive; node replacement triggers re-replication from cold tier |
| Migration worker crash | Job heartbeat missing > 60s; migration_state='in_progress' for > 5min | Object left in pending migration state; possible duplicate in both tiers | Reconciliation scan resets migration_state to idle after 5min; orphan detector removes duplicates |
| Metadata store partition | Read timeout > 5ms; quorum loss alert | Proxy cannot route requests; all reads return errors | Route to read replicas for reads; queue writes to WAL; restore from Raft quorum |
| S3 outage / throttling | 503 SlowDown from S3 API; rehydration queue depth spike | Cold misses cannot complete; rehydration queue backs up | Exponential backoff on S3 calls; serve cached hot objects normally; alert on queue depth > 10K |
| Hot tier capacity 100% | disk_full errors from NVMe writes; capacity monitor alert | New writes fail; migration jobs cannot complete | Emergency LFU eviction of bottom 10% of hot objects; temporarily suspend promotions |
| Policy engine bug: wrong threshold | Objects rapidly migrating back and forth; high migration bandwidth | CPU and network burn on churn; latency impact on hot tier | Hysteresis cooldown limits damage; circuit breaker on migration rate > 500/min triggers alert |
The most common operational failure is the reconciliation scan falling behind during a mass migration event. If you demote 10% of your hot tier in a short window (e.g., after a decay lambda change), the orphan objects on the hot tier accumulate faster than the reconciliation scan clears them. This causes the hot tier to appear full even though metadata shows objects as cold. Always rate-limit configuration changes to tier thresholds.
Comparison of Approaches
| Approach | Tier Boundary Mechanism | Rehydration Strategy | Complexity | Best Fit |
|---|---|---|---|---|
| Time-based tiering | Fixed age threshold (e.g., demote after 30d) | On-demand async; caller waits | Low | Compliance/archive data with predictable access patterns |
| Frequency-based tiering (this design) | Decayed heat score thresholds | Transparent async with 503 + Retry-After | Medium | Mixed workloads with Zipf-distributed access |
| Workload-hint tiering | Application sets storage-class header | Immediate synchronous fetch | Low | Workloads where the application knows data temperature |
| ML-predicted tiering | Trained model predicts next-access time | Proactive prefetch before access | High | Media platforms with predictable content lifecycle |
| Cost-optimized tiering | Price/performance score including retrieval cost | Budget-aware: cold if retrieval < threshold | Medium | Cost-sensitive workloads with variable access frequency |
The frequency-based approach (this design) is the right default for most engineering teams. Time-based tiering is simpler but creates latency cliffs - data that was accessed yesterday at day 29 and today at day 31 gets a 500ms penalty that confuses callers. Workload-hint tiering requires application changes, violating our zero-coupling constraint. ML-predicted tiering adds operational complexity (model drift, retraining pipelines) that rarely justifies the marginal improvement in tier placement accuracy over a well-tuned decay function.
For workloads with highly predictable access patterns - video platforms where new uploads are hot for 24h then cold forever - time-based tiering is actually superior because it eliminates the policy evaluation overhead and is trivially explainable.
Key Takeaways
- Access frequency tracking requires exponential decay, not cumulative counts - cumulative counts permanently bias toward historically popular objects regardless of current access patterns.
- Hysteresis bands between promotion and demotion thresholds are mandatory - a single shared threshold creates migration thrashing that consumes bandwidth with no net benefit.
- Transparent retrieval means the proxy must handle cold misses gracefully via
503 + Retry-After, not by blocking the read thread for hundreds of milliseconds while rehydration runs. - Migration job safety depends on the copy-verify-commit-delete sequence with optimistic concurrency checks - any shortcut here risks permanent data loss when writes race with migrations.
- Rehydration latency is a user-visible SLA, not a background concern - S3 Standard-IA returns in 50-500ms, S3 Glacier Instant Retrieval in milliseconds to seconds, and S3 Glacier Deep Archive in hours; tier your cold tiers by retrieval SLA requirements.
- Eviction on hot tier during capacity pressure requires a separate fast path - the LFU eviction tracker in the proxy handles immediate capacity relief while the background migration system handles the slow copy-to-cold.
- Cost vs latency tradeoff is the central knob of this system - a lower cold threshold aggressively moves data cold (lower cost, higher cold-miss rate), a higher threshold keeps more data hot (higher cost, fewer cold misses).
- Metadata consistency is the hardest correctness problem - a stale metadata entry pointing to a nonexistent hot-tier location is more dangerous than a cold miss, because it returns an error rather than a latency penalty.
The counter-intuitive lesson from building this system is that the storage tiers themselves are not the hard problem. S3 and NVMe SSDs both work reliably. The hard problem is the metadata plane - specifically, keeping metadata strongly consistent with actual object locations while two workers (migration and rehydration) are simultaneously modifying the same objects. Teams that underinvest in metadata consistency find themselves with phantom objects, duplicate copies, and mysterious 404s on objects they know they wrote.
Frequently Asked Questions
Q: Why not just use S3 Intelligent Tiering and skip building this entirely? A: S3 Intelligent Tiering is excellent for pure S3 workloads and costs $0.0025/1000 objects/month for monitoring. The gap is the hot tier itself: S3 Standard has 50-100ms P99 read latency, while NVMe SSD delivers sub-1ms. For applications with latency SLAs under 5ms, S3 alone cannot serve the hot fraction. Additionally, S3 Intelligent Tiering does not give you control over the tiering thresholds, the policy logic, or the rehydration behavior - this design gives you all three.
Q: What happens when a rehydration is in-progress and the same object is written?
A: The write must win. When the proxy receives a write for a key that has migration_state='in_progress', it writes to the NVMe SSD and bumps the generation counter. When the rehydration worker tries to COMMIT, its optimistic concurrency check fails because the generation no longer matches. The worker aborts the promotion and cleans up the destination copy. The object is now fresh on the hot tier with the new content.
Q: Why not use a local filesystem on the NVMe nodes instead of a distributed pool? A: A local filesystem per NVMe node is simpler to operate but creates hot spots: if one NVMe node holds the top-100 most accessed objects, its bandwidth is saturated while other nodes sit idle. Consistent hashing across the NVMe pool distributes load based on object key, not object temperature. The tradeoff is an extra network hop for hot reads, but at 25 Gbps inter-node bandwidth, this is negligible compared to NVMe local reads.
Q: How do you handle the metadata store becoming the bottleneck?
A: Three mitigations in order of deployment complexity: (1) The proxy-local LRU cache absorbs 95%+ of reads - most objects don’t change tier frequently. (2) Read replicas handle metadata reads, leaving the primary for writes only. (3) Shard the metadata table by consistent hash of (namespace, object_key) - with 16 shards, each shard sees 1/16 of write throughput.
Q: What’s the right decay lambda value and how do you tune it?
A: Lambda controls the effective memory window of the heat score. lambda = 0.1/hour gives a half-life of ~7 hours - roughly two working sessions. For media platforms where content is either viral (accessed millions of times in 24h) or forgotten, use lambda = 0.5/hour (half-life ~1.4h). For enterprise storage where access patterns are more uniform, use lambda = 0.02/hour (half-life ~35h). The right signal is migration churn rate: if more than 5% of your daily migration volume is objects that were migrated in the past 48h, your lambda is too high.
Q: How does this handle multi-tenant workloads where one tenant’s hot objects shouldn’t evict another’s?
A: Use namespace-scoped capacity quotas: each namespace gets a configured allocation of the hot tier. The Policy Engine evaluates eviction pressure per namespace independently. A tenant that fills their hot quota gets namespace-local LFU eviction rather than competing globally. The namespace column in the metadata schema is the partition key for this purpose.
Interview Questions
Q: Walk me through what happens end-to-end when an application reads a file that was migrated to cold storage 2 hours ago.
Expected depth: Cover the full proxy read path: metadata lookup (tier=cold), decision to return 503 vs synchronous fetch based on object size and SLA tier, enqueue of rehydration job with priority=interactive, rehydration worker fetching from S3 with retry logic, checksum verification, write to NVMe hot tier, metadata COMMIT with generation check, and what the application does on receiving 503 (polling, exponential backoff). Discuss the latency budget: S3 Standard-IA fetch at 50-300ms, NVMe write at 1-5ms, metadata update at 2-5ms, total 60-320ms for rehydration before the read can succeed.
Q: The hot tier just hit 95% capacity and new writes are starting to fail. Walk me through your emergency response.
Expected depth: Distinguish immediate mitigation (LFU eviction to drop to 80% - which objects to evict and why LFU not LRU), from root cause investigation (policy engine not demoting fast enough? sudden write spike? decay lambda misconfigured?). Discuss rate of emergency eviction - evicting too fast risks demoting objects that are still actively accessed, causing a cold miss storm. Cover the capacity circuit breaker that should have triggered at 85% and why it might have failed. Discuss the longer-term fix: adjust hot tier capacity or tighten demote threshold.
Q: How would you design the migration rate limiter to be fair across multiple tenants without letting one tenant’s migration storm impact another’s read latency?
Expected depth: Describe a two-level token bucket: global bucket for total NVMe read bandwidth used by migration (e.g., 100 MB/s budget), and per-namespace sub-buckets that share from the global pool. Cover work-stealing when one namespace has no pending migrations - another namespace can borrow their tokens. Discuss the feedback loop: monitor hot-tier read p99 latency in real-time; if p99 spikes above 2ms, the rate limiter automatically reduces the migration token rate by 50% until latency recovers. Name that this is the same backpressure pattern used in Kafka’s consumer group coordinator.
Q: You need to support a “pin” API where applications can mark specific objects as permanently hot (never demote). How does this change the data model and policy engine?
Expected depth: Add pinned=boolean and pin_expires_at=timestamptz columns to object_metadata. The Policy Engine’s evaluate() function checks pinned=true first and returns STAY_HOT unconditionally. Discuss the capacity implications: if 30% of your hot tier is pinned, effective evictable capacity drops to 70% - you must account for pinned bytes in capacity planning. Cover expiring pins (TTL-based) as a safety valve to prevent hot tier from filling with forgotten pins. Discuss the API surface: should pin be a storage-class hint at write time or a separate API call? Consider what happens when a pinned object’s content is updated - the pin should carry forward to the new version.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.