Build a Hot-Cold Tiered Storage System


databases performance cost-optimization

System Design Deep Dive

Hot-Cold Tiered Storage

When NVMe cost meets object storage scale - how to make the tiers invisible to your application

⏱ 14 min read📐 Advanced🏗️ Tiered Storage

Think of your data as a library. When you first check out a book you keep it on your desk - instantly reachable. As weeks pass and you stop reading it, it migrates to your bookshelf. Eventually it goes to the basement archive. The key insight: when someone desperately needs a basement book, a librarian fetches it for them. The visitor never needs to know it was in the basement at all.

That is exactly the contract a hot-cold tiered storage system must deliver. Applications issue reads and writes against a single logical namespace. Underneath, a policy engine continuously classifies objects by access temperature, NVMe SSDs hold the hot fraction, object storage holds the cold fraction, and a transparent retrieval path handles cache misses without a single line of application code changing.

The engineering tension here is real and non-trivial. NVMe SSDs deliver sub-millisecond random reads and cost roughly $0.10-0.30 per GB per month. S3 Standard-IA costs $0.0125 per GB per month - a 10-24x difference. A system that stores 100 TB total but serves 90% of read traffic from the hot 5% of data can cut storage costs by 80% while keeping p99 read latency under 2ms for the overwhelming majority of requests. But achieving this without leaking complexity to callers requires solving five hard problems simultaneously: access frequency tracking at scale, a tiering policy engine that makes correct promotion and demotion decisions, transparent retrieval for cold objects, migration job design that doesn’t saturate the network, and careful management of rehydration latency for cold misses.

We need to solve for correctness of tier placement, operational safety during migration, and zero application coupling - simultaneously.

Requirements and Constraints

Functional Requirements

  • Applications read and write objects via a single unified API endpoint regardless of which tier the data lives on
  • The system automatically migrates objects from hot (NVMe SSD) to cold (object storage) based on access patterns and configurable age policies
  • Cold objects are transparently retrieved and promoted to the hot tier on read access, with the application receiving the data without error
  • A tiering policy engine evaluates objects on a configurable schedule and emits promotion/demotion decisions
  • Access frequency counters are maintained per object with sliding-window decay
  • The hot tier enforces a capacity ceiling with LFU-based eviction when utilization exceeds a threshold
  • Object integrity is verified via checksums before deletion from source tier after copy

Non-Functional Requirements

  • Hot tier read latency: p50 < 0.5ms, p99 < 2ms for objects on NVMe SSD
  • Cold tier rehydration latency: p50 < 200ms, p99 < 800ms for objects fetched from S3 Standard-IA
  • Migration throughput: sustain 100 MB/s of background migration traffic without impacting read SLA
  • Metadata store: < 5ms for tier lookup on the read path
  • Policy evaluation: full object corpus scanned within a 15-minute window at 10 billion objects
  • Durability: 11 nines (S3 SLA) for cold tier; hot tier survives single node failures via RAID-10 or replica
  • Availability: 99.99% read availability, cold miss returns degraded response within 30 seconds
  • Scale: support up to 10 billion objects totaling 100 TB on the hot tier and 10 PB in cold storage

Constraints and Assumptions

  • Application code cannot change - the system must be a drop-in layer below existing storage clients
  • Object sizes range from 4 KB to 5 GB; the tiering policy is object-level, not block-level
  • Write path always lands on the hot tier first - there is no direct cold write path
  • The system is not a filesystem; it does not support rename or hard links as atomic operations
  • Cost optimization targets S3 storage costs; S3 API request costs are treated as secondary
  • A single region is the primary design scope; multi-region replication is a stretch goal

High-Level Architecture

The system divides cleanly into five layers: the Storage Proxy that intercepts all I/O, the Access Tracker that maintains frequency counters, the Tiering Policy Engine that makes classification decisions, the Migration Engine that physically moves data, and the Metadata Store that records where each object lives.

Hot-cold tiered storage system architecture overview showing all major components and data flow

Every read starts at the Storage Proxy. The proxy performs a metadata lookup to determine the object’s current tier. If the object is hot, the proxy issues a direct read against the NVMe SSD pool and returns data in under 1ms. If the object is cold, the proxy enqueues a rehydration job, optionally returns a 503 with a Retry-After header, and a background worker fetches the object from S3, writes it to the hot tier, and updates metadata. Subsequent reads for that object hit the hot tier.

Every write lands on the hot tier first. The proxy writes to NVMe SSD, records tier=hot in the Metadata Store, and increments the object’s heat_score. The Access Tracker maintains a sliding window of access events per object using a decay function; objects that go unaccessed for configurable periods see their score drop below the cold migration threshold. The Policy Engine, running on a fixed interval, scans objects whose scores have crossed tier boundaries and emits migration decisions to the Migration Scheduler. The scheduler batches these into migration jobs, rate-limits them to avoid network saturation, and hands them to Downgrade or Rehydration Workers depending on direction.

Key Insight

The proxy must never let a cold miss block the read indefinitely - returning a 503 with Retry-After lets the caller retry after rehydration completes, making the cold path observable and bounded rather than a silent multi-second hang.

The Storage Proxy

The Storage Proxy is the only component that callers talk to - it is the system’s face and must add no more than 1-2ms of overhead on the hot path.

The naive design would be to make the proxy stateless: look up every object’s tier in the metadata store on every request, then route accordingly. At 50,000 reads/second, that is 50,000 metadata lookups/second - completely tractable with a hot metadata cache. The non-obvious failure is stale routing: if the metadata cache holds a tier=hot entry for an object that was just evicted, the proxy will attempt to read from NVMe SSD and get a miss. The recovery path must handle this gracefully by falling back to the metadata store and then to cold storage before returning an error.

# Storage proxy read path with tier-aware routing and fallback
import time
from enum import Enum
from dataclasses import dataclass
from typing import Optional

class Tier(Enum):
    HOT = "hot"
    COLD = "cold"
    ARCHIVE = "archive"

@dataclass
class ObjectMeta:
    key: str
    tier: Tier
    size_bytes: int
    heat_score: float
    last_accessed: float
    checksum_sha256: str

class StorageProxy:
    def __init__(self, metadata_store, nvme_pool, object_store, rehydration_queue, access_tracker):
        self.metadata = metadata_store
        self.nvme = nvme_pool
        self.s3 = object_store
        self.rehydration_q = rehydration_queue
        self.access_tracker = access_tracker
        # Local LRU cache for metadata - 1M entries, 30s TTL
        self._meta_cache = LRUCache(capacity=1_000_000, ttl_seconds=30)

    def read(self, key: str, allow_stale: bool = False) -> tuple[bytes, int]:
        """Returns (data, http_status). 503 if cold miss enqueued."""
        meta = self._get_meta(key)
        if meta is None:
            return b"", 404

        # Track access regardless of tier
        self.access_tracker.record_access(key, meta.size_bytes)

        if meta.tier == Tier.HOT:
            data = self.nvme.read(key)
            if data is not None:
                return data, 200
            # Stale cache: object was evicted but metadata not yet updated
            # Refresh metadata from source of truth and try cold path
            meta = self.metadata.get(key, bypass_cache=True)
            self._meta_cache.set(key, meta)

        if meta.tier in (Tier.COLD, Tier.ARCHIVE):
            # Enqueue rehydration and return 503
            rehydration_eta_seconds = self._estimate_rehydration_time(meta)
            job_id = self.rehydration_q.enqueue(key, priority="interactive")
            return b"", 503  # caller checks Retry-After header

        return b"", 500

    def write(self, key: str, data: bytes) -> int:
        """Always writes to hot tier. Returns HTTP status."""
        import hashlib
        checksum = hashlib.sha256(data).hexdigest()
        # Write to NVMe first
        self.nvme.write(key, data)
        # Record metadata
        meta = ObjectMeta(
            key=key,
            tier=Tier.HOT,
            size_bytes=len(data),
            heat_score=100.0,  # new writes start at max heat
            last_accessed=time.time(),
            checksum_sha256=checksum,
        )
        self.metadata.put(key, meta)
        self._meta_cache.set(key, meta)
        self.access_tracker.record_write(key)
        return 201

    def _get_meta(self, key: str) -> Optional[ObjectMeta]:
        cached = self._meta_cache.get(key)
        if cached:
            return cached
        meta = self.metadata.get(key)
        if meta:
            self._meta_cache.set(key, meta)
        return meta

    def _estimate_rehydration_time(self, meta: ObjectMeta) -> int:
        """Estimate seconds for rehydration based on size and tier."""
        if meta.tier == Tier.COLD:
            return max(5, meta.size_bytes // (50 * 1024 * 1024))  # ~50MB/s from S3
        return max(60, meta.size_bytes // (10 * 1024 * 1024))  # ~10MB/s from Glacier

The proxy must also handle the write-after-eviction race: if a write arrives for a key that the Migration Engine is currently copying to cold storage, the proxy must either block the migration or accept the write and mark the in-progress migration as stale. We use optimistic concurrency via a generation counter in the metadata store: the migration worker only deletes from the hot tier if the generation matches what it read at job creation time.

Watch Out

Never delete the hot copy until S3 PUT has returned 200 and you have verified the checksum. A worker crash between delete and upload confirmation causes permanent data loss. Always copy-then-verify-then-delete, never move.

The Access Tracker

The Access Tracker’s job is simple to state and surprisingly tricky to implement correctly at scale: maintain a heat score per object that decays over time and increases with each access.

Think of the heat score like a radioactive isotope with a configurable half-life. Each access adds energy; time causes exponential decay. An object accessed 100 times last week but zero times this week should rank below an object accessed 5 times today. Simple count-since-creation doesn’t capture this recency weighting.

The standard implementation uses exponential decay: score = score * e^(-lambda * delta_t) + access_increment. At 10 billion objects, maintaining per-object scores in memory is infeasible (even at 16 bytes per score, that is 160 GB). The solution is a Count-Min Sketch for approximate frequency, backed by a time-decayed sliding window using Redis sorted sets for the top-N hottest objects.

# Access frequency tracking with sliding window decay
import time
import math
import redis

class AccessTracker:
    def __init__(self, redis_client: redis.Redis, decay_lambda: float = 0.1):
        """
        decay_lambda: decay rate per hour. 0.1 = half-life of ~7 hours.
        Scores decay as: score *= exp(-lambda * hours_elapsed)
        """
        self.r = redis_client
        self.decay_lambda = decay_lambda
        # Redis sorted set: key=object_key, score=heat_score
        self.HEAT_ZSET = "tier:heat_scores"
        # Hash: key=object_key, value=last_update_ts (unix float)
        self.LAST_UPDATE_HASH = "tier:last_update"

    def record_access(self, object_key: str, size_bytes: int) -> float:
        """Increment heat score with exponential decay applied first."""
        now = time.time()
        pipe = self.r.pipeline()

        last_update_raw = self.r.hget(self.LAST_UPDATE_HASH, object_key)
        last_update = float(last_update_raw) if last_update_raw else now
        hours_elapsed = (now - last_update) / 3600.0

        # Fetch current score (default 0 if new)
        current_score_raw = self.r.zscore(self.HEAT_ZSET, object_key)
        current_score = float(current_score_raw) if current_score_raw else 0.0

        # Apply decay then add access increment
        # Access increment is 1.0 for small objects, scaled for large ones
        access_weight = 1.0 + math.log10(max(1, size_bytes / 4096))
        decayed_score = current_score * math.exp(-self.decay_lambda * hours_elapsed)
        new_score = decayed_score + access_weight

        pipe.zadd(self.HEAT_ZSET, {object_key: new_score})
        pipe.hset(self.LAST_UPDATE_HASH, object_key, now)
        pipe.execute()
        return new_score

    def get_score(self, object_key: str) -> float:
        """Get current decayed heat score."""
        now = time.time()
        score_raw = self.r.zscore(self.HEAT_ZSET, object_key)
        if score_raw is None:
            return 0.0
        last_update_raw = self.r.hget(self.LAST_UPDATE_HASH, object_key)
        if last_update_raw is None:
            return float(score_raw)
        hours_elapsed = (now - float(last_update_raw)) / 3600.0
        return float(score_raw) * math.exp(-self.decay_lambda * hours_elapsed)

    def get_cold_candidates(self, threshold: float, limit: int = 10000) -> list[str]:
        """Return object keys whose decayed score is below the cold threshold."""
        # Use ZRANGEBYSCORE to get lowest-scoring objects efficiently
        # Score of 0 = never accessed or fully decayed
        return self.r.zrangebyscore(self.HEAT_ZSET, 0, threshold, start=0, num=limit)

    def get_hot_candidates_for_promotion(self, threshold: float, limit: int = 5000) -> list[str]:
        """Return cold objects whose score has risen above the hot threshold."""
        return self.r.zrangebyscore(self.HEAT_ZSET, threshold, "+inf", start=0, num=limit)
Real World

Netflix’s EVCache and Facebook’s Memcached fleet use virtually identical decay-weighted scoring to decide which objects to evict from memory. At Netflix’s scale, the decay lambda is tuned per content type - live-stream chunks decay faster than back-catalog movies because the access tail is much shorter.

For the metadata path, the Access Tracker also feeds a Count-Min Sketch running in each proxy instance for approximate real-time frequency. This lets the proxy make local eviction decisions without a round-trip to Redis for every access. The Redis sorted set is the authoritative source used by the Policy Engine’s periodic scan.

The Tiering Policy Engine

The Policy Engine is the system’s brain: it reads heat scores from the Access Tracker, applies configured rules, and emits tier change decisions. The component internals are shown below.

Tiering policy engine component internals showing heat score calculation, threshold logic, hysteresis guard, and output decisions

The non-obvious design requirement is hysteresis: without it, an object oscillating around the hot/cold threshold boundary will be migrated back and forth continuously, burning migration bandwidth and creating metadata churn. Like a thermostat with a dead band, the policy uses separate thresholds for promotion and demotion: an object must fall well below the HOT_DEMOTE_THRESHOLD to be evicted, and it must rise well above the COLD_PROMOTE_THRESHOLD to be rehydrated. The gap between these thresholds is the hysteresis band.

# Tiering policy engine with hysteresis and configurable rules
from dataclasses import dataclass
from enum import Enum
import time

class TierDecision(Enum):
    STAY_HOT = "stay_hot"
    DEMOTE = "demote"
    PROMOTE = "promote"
    STAY_COLD = "stay_cold"

@dataclass
class TierPolicy:
    # Score thresholds (use separate values to create hysteresis band)
    hot_demote_threshold: float = 2.0     # below this: demote hot -> cold
    cold_promote_threshold: float = 8.0   # above this: promote cold -> hot
    # Age constraints
    min_hot_age_seconds: float = 3600.0   # 1 hour min on hot tier
    min_cold_age_seconds: float = 86400.0 * 30  # 30 days min on cold tier
    # Capacity-triggered eviction (bypass age check)
    hot_capacity_override_pct: float = 0.85  # evict if hot tier > 85% full

@dataclass
class ObjectRecord:
    key: str
    tier: str
    heat_score: float
    tier_assigned_at: float
    size_bytes: int

class PolicyEngine:
    def __init__(self, policy: TierPolicy, access_tracker, metadata_store, capacity_monitor):
        self.policy = policy
        self.tracker = access_tracker
        self.metadata = metadata_store
        self.capacity = capacity_monitor
        # Cooldown registry: key -> last_moved_ts. Prevents rapid flip-flop.
        self._cooldown: dict[str, float] = {}
        self.COOLDOWN_SECONDS = 1800  # 30 minutes between any moves

    def evaluate(self, record: ObjectRecord) -> TierDecision:
        now = time.time()
        score = self.tracker.get_score(record.key)
        time_on_tier = now - record.tier_assigned_at

        # Enforce cooldown regardless of other conditions
        last_move = self._cooldown.get(record.key, 0)
        if now - last_move < self.COOLDOWN_SECONDS:
            return TierDecision.STAY_HOT if record.tier == "hot" else TierDecision.STAY_COLD

        if record.tier == "hot":
            # Normal demotion: score below threshold AND object has been hot long enough
            if score < self.policy.hot_demote_threshold and time_on_tier >= self.policy.min_hot_age_seconds:
                self._cooldown[record.key] = now
                return TierDecision.DEMOTE

            # Capacity-triggered eviction: bypass age check but respect cooldown
            hot_util = self.capacity.hot_tier_utilization()
            if hot_util > self.policy.hot_capacity_override_pct:
                # Only evict if this object is below the demote threshold
                if score < self.policy.hot_demote_threshold:
                    self._cooldown[record.key] = now
                    return TierDecision.DEMOTE

            return TierDecision.STAY_HOT

        elif record.tier == "cold":
            # Promote: score crossed hot threshold AND object has aged enough on cold
            if score >= self.policy.cold_promote_threshold and time_on_tier >= self.policy.min_cold_age_seconds:
                self._cooldown[record.key] = now
                return TierDecision.PROMOTE

            return TierDecision.STAY_COLD

        # Fallback - unknown tier, treat as cold
        return TierDecision.STAY_COLD

    def run_sweep(self, batch_size: int = 5000) -> dict[str, int]:
        """Scan objects and emit migration decisions. Returns counts per decision."""
        counts = {d.value: 0 for d in TierDecision}

        # Process hot candidates for demotion
        cold_candidates = self.tracker.get_cold_candidates(
            threshold=self.policy.hot_demote_threshold,
            limit=batch_size
        )
        for key in cold_candidates:
            record = self.metadata.get(key)
            if record and record.tier == "hot":
                decision = self.evaluate(record)
                counts[decision.value] += 1
                if decision == TierDecision.DEMOTE:
                    self.metadata.enqueue_migration(key, direction="demote")

        # Process cold candidates for promotion
        hot_candidates = self.tracker.get_hot_candidates_for_promotion(
            threshold=self.policy.cold_promote_threshold,
            limit=batch_size // 4  # promotions are rare
        )
        for key in hot_candidates:
            record = self.metadata.get(key)
            if record and record.tier == "cold":
                decision = self.evaluate(record)
                counts[decision.value] += 1
                if decision == TierDecision.PROMOTE:
                    self.metadata.enqueue_migration(key, direction="promote")

        return counts
Key Insight

The hysteresis band is not optional. Set hot_demote_threshold and cold_promote_threshold at least 4x apart. If an object’s score oscillates around a single shared threshold, you get migration thrashing that can consume 30% of your migration bandwidth on churn with zero net benefit.

Migration Job Design

The Migration Engine physically moves data between tiers. Think of it like a postal sorting system: individual packages (objects) are batched into delivery runs (jobs), each run has a rate limit, and packages are confirmed delivered before being removed from the origin post office.

The data flow through the full request and migration path is illustrated below.

Complete data flow showing read path, write path, and background migration path with rehydration

The migration worker must be safe to kill at any point without corrupting data. The key is treating migration as a two-phase operation with a metadata commit:

  1. COPY: Read from source, write to destination with checksum verification
  2. COMMIT: Update metadata store to reflect new tier and new location
  3. DELETE: Remove from source tier only after metadata commit succeeds

If the worker dies after COPY but before COMMIT, the object exists in both tiers. A reconciliation scan detects this via a migration_state = pending_commit flag in metadata and completes the commit. If the worker dies after COMMIT but before DELETE, the object has been safely moved but the hot tier holds a stale copy. The reconciliation scan detects orphaned hot copies (metadata says cold, hot tier contains the key) and deletes them.

// Migration worker: copy-verify-commit-delete pattern
package migration

import (
    "context"
    "crypto/sha256"
    "fmt"
    "io"
    "log"
    "time"
)

type MigrationJob struct {
    JobID         string
    ObjectKey     string
    Direction     string // "demote" or "promote"
    SourceTier    string
    DestTier      string
    ExpectedSize  int64
    ExpectedSHA   string
    GenerationID  int64  // for optimistic concurrency check
    CreatedAt     time.Time
}

type MigrationWorker struct {
    nvme      NVMePool
    s3        S3Client
    metadata  MetadataStore
    jobQueue  JobQueue
    auditLog  AuditLogger
}

func (w *MigrationWorker) ProcessJob(ctx context.Context, job MigrationJob) error {
    // Step 1: Mark job as in-progress
    if err := w.metadata.SetMigrationState(job.ObjectKey, "in_progress", job.JobID); err != nil {
        return fmt.Errorf("failed to mark in_progress: %w", err)
    }

    // Step 2: Read from source
    var data []byte
    var readErr error
    if job.Direction == "demote" {
        data, readErr = w.nvme.Read(ctx, job.ObjectKey)
    } else {
        data, readErr = w.s3.GetObject(ctx, job.ObjectKey)
    }
    if readErr != nil {
        w.metadata.SetMigrationState(job.ObjectKey, "failed", job.JobID)
        return fmt.Errorf("source read failed: %w", readErr)
    }

    // Step 3: Verify source checksum
    hash := sha256.Sum256(data)
    actualSHA := fmt.Sprintf("%x", hash)
    if actualSHA != job.ExpectedSHA {
        return fmt.Errorf("source checksum mismatch: expected %s got %s", job.ExpectedSHA, actualSHA)
    }

    // Step 4: Write to destination
    var writeErr error
    if job.Direction == "demote" {
        writeErr = w.s3.PutObject(ctx, job.ObjectKey, data, job.ExpectedSHA)
    } else {
        writeErr = w.nvme.Write(ctx, job.ObjectKey, data)
    }
    if writeErr != nil {
        w.metadata.SetMigrationState(job.ObjectKey, "failed", job.JobID)
        return fmt.Errorf("destination write failed: %w", writeErr)
    }

    // Step 5: COMMIT - update metadata tier (optimistic concurrency check)
    newTier := job.DestTier
    if err := w.metadata.CommitTierChange(
        job.ObjectKey, newTier, job.GenerationID,
    ); err != nil {
        // GenerationID mismatch = a write arrived for this object during migration
        // The source copy is still valid; abort migration to avoid data loss
        log.Printf("migration aborted due to concurrent write: key=%s err=%v", job.ObjectKey, err)
        // Clean up the destination copy we just wrote
        w.cleanupDestination(ctx, job)
        return fmt.Errorf("concurrent write conflict: %w", err)
    }

    // Step 6: DELETE source - only safe after metadata commit
    if job.Direction == "demote" {
        if err := w.nvme.Delete(ctx, job.ObjectKey); err != nil {
            // Non-fatal: orphan cleanup scan will handle this
            log.Printf("warning: failed to delete hot copy after demote: key=%s err=%v", job.ObjectKey, err)
        }
    } else {
        // For promotions, keep cold copy for durability - S3 lifecycle rule handles eventual deletion
    }

    w.auditLog.LogMigration(job, "completed", time.Now())
    return nil
}

func (w *MigrationWorker) cleanupDestination(ctx context.Context, job MigrationJob) {
    if job.Direction == "demote" {
        w.s3.DeleteObject(ctx, job.ObjectKey)
    } else {
        w.nvme.Delete(ctx, job.ObjectKey)
    }
}

Rate limiting the migration workers is critical. Without it, a backlog of demotion jobs can saturate your NVMe read bandwidth or your S3 upload bandwidth, causing hot-tier read latency to spike. The scheduler uses a token bucket with separate buckets for read bandwidth (NVMe) and write bandwidth (S3 egress), with the hot-path reads getting higher priority tokens.

Real World

Apache Ozone, the HDFS successor used at scale by Alibaba and Tencent, uses a nearly identical migration job design with generation-based optimistic concurrency. When a file is modified during replication, the replication coordinator detects the generation mismatch and requeues the job rather than committing a stale copy.

Data Model

The metadata store is the system’s single source of truth for object location. Every read and migration decision flows through it, so it must be fast, consistent, and partition-tolerant.

-- Core metadata table: one row per stored object
CREATE TABLE object_metadata (
    object_key        TEXT NOT NULL,
    namespace         TEXT NOT NULL DEFAULT 'default',
    tier              TEXT NOT NULL CHECK (tier IN ('hot', 'cold', 'archive')),
    size_bytes        BIGINT NOT NULL,
    checksum_sha256   CHAR(64) NOT NULL,
    generation        BIGINT NOT NULL DEFAULT 1,  -- for optimistic concurrency
    tier_assigned_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    last_accessed_at  TIMESTAMPTZ,
    created_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    migration_state   TEXT CHECK (migration_state IN ('idle', 'in_progress', 'pending_commit', 'failed')),
    migration_job_id  TEXT,                         -- active job ID if in_progress
    hot_path          TEXT,                         -- NVMe node + local path
    cold_path         TEXT,                         -- S3 URI (s3://bucket/prefix/key)
    archive_path      TEXT,                         -- Glacier vault ARN + archive ID
    PRIMARY KEY (namespace, object_key)
);

-- Index for policy engine: scan objects by tier + last_accessed
CREATE INDEX idx_metadata_tier_accessed
    ON object_metadata (tier, last_accessed_at)
    WHERE migration_state = 'idle';

-- Index for orphan detection: find objects stuck in migration
CREATE INDEX idx_metadata_migration_state
    ON object_metadata (migration_state, tier_assigned_at)
    WHERE migration_state IN ('in_progress', 'pending_commit');

-- Partition by namespace for multi-tenant isolation
-- In production this would be partitioned by namespace hash range

-- Access event log (append-only, used by access tracker reconciliation)
CREATE TABLE access_events (
    event_id          BIGSERIAL PRIMARY KEY,
    object_key        TEXT NOT NULL,
    namespace         TEXT NOT NULL,
    event_type        TEXT NOT NULL CHECK (event_type IN ('read', 'write', 'rehydration')),
    occurred_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    size_bytes        BIGINT,
    source_ip         INET,
    requester_id      TEXT
) PARTITION BY RANGE (occurred_at);

CREATE TABLE access_events_2026_06 PARTITION OF access_events
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- Migration audit log (append-only)
CREATE TABLE migration_audit (
    job_id            TEXT PRIMARY KEY,
    object_key        TEXT NOT NULL,
    namespace         TEXT NOT NULL,
    direction         TEXT NOT NULL CHECK (direction IN ('demote', 'promote')),
    from_tier         TEXT NOT NULL,
    to_tier           TEXT NOT NULL,
    status            TEXT NOT NULL CHECK (status IN ('started', 'completed', 'failed', 'aborted')),
    bytes_moved       BIGINT,
    duration_ms       INT,
    error_message     TEXT,
    started_at        TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at      TIMESTAMPTZ
);

CREATE INDEX idx_audit_key_time ON migration_audit (object_key, started_at DESC);

The partitioning strategy for the metadata table is critical at 10 billion objects. A single object_metadata table at that scale needs to be sharded. The natural sharding key is (namespace, object_key) with consistent hashing across shards. Within a shard, the policy engine scan uses the idx_metadata_tier_accessed index to efficiently find hot objects that have gone cold without a full table scan.

Key Insight

Store hot_path, cold_path, and archive_path as separate nullable columns, not a single location field. This makes tier transitions atomic - you can verify the cold path exists before nulling the hot path, and the database row always has a valid pointer to at least one tier.

The generation column provides optimistic concurrency for the migration worker. When a migration job is created, the worker reads the current generation. At the COMMIT phase, it executes:

-- Atomic tier commit with optimistic concurrency check
UPDATE object_metadata
SET tier = 'cold',
    cold_path = $1,
    hot_path = NULL,
    migration_state = 'idle',
    migration_job_id = NULL,
    tier_assigned_at = NOW(),
    generation = generation + 1
WHERE namespace = $2
  AND object_key = $3
  AND generation = $4  -- fails if a concurrent write bumped the generation
  AND migration_state = 'in_progress';
-- If 0 rows updated: concurrent write conflict, abort migration

Key Algorithms and Protocols

Exponential Decay Heat Scoring

The heat score algorithm must balance recency and frequency. A pure frequency count favors historically popular objects that are no longer accessed. A pure recency count ignores that an object accessed 1000 times recently is much hotter than one accessed once. Exponential decay combines both:

# Heat score with exponential decay - time complexity O(1) per access
import math
import time

def compute_decayed_score(
    current_score: float,
    last_update_ts: float,
    access_weight: float = 1.0,
    decay_lambda: float = 0.1,  # per-hour decay rate
) -> float:
    """
    Computes new heat score after applying time decay and adding access weight.
    Half-life = ln(2) / decay_lambda hours.
    With lambda=0.1: half-life ~7 hours.
    With lambda=0.02: half-life ~35 hours (for slower-moving data).
    """
    hours_elapsed = (time.time() - last_update_ts) / 3600.0
    decay_factor = math.exp(-decay_lambda * hours_elapsed)
    return current_score * decay_factor + access_weight

def score_after_inactivity(initial_score: float, hours: float, decay_lambda: float = 0.1) -> float:
    """
    How much score remains after N hours of inactivity?
    Used to pre-compute demotion time without scanning all objects.
    Example: score=50, lambda=0.1, hours=69.3 -> score ~= 0.05 (near zero)
    """
    return initial_score * math.exp(-decay_lambda * hours)

def time_to_cold_threshold(current_score: float, cold_threshold: float, decay_lambda: float) -> float:
    """
    How many hours until this object's score decays to the cold threshold?
    Useful for scheduling future policy scans.
    Returns hours until demotion if no more accesses occur.
    """
    if current_score <= cold_threshold:
        return 0.0
    # Solve: current_score * exp(-lambda * t) = threshold
    # t = -ln(threshold / current_score) / lambda
    return -math.log(cold_threshold / current_score) / decay_lambda

The time_to_cold_threshold function is particularly useful for the Policy Engine scheduler: rather than scanning all hot objects every 15 minutes, we can pre-compute when each hot object is predicted to cross the threshold and schedule the scan accordingly. Objects with predicted_demote_time > now + 24h don’t need to be scanned this cycle.

LFU Eviction Under Capacity Pressure

When the hot tier exceeds its capacity ceiling, the eviction logic must select victims quickly without iterating every object. The standard O(1) LFU implementation uses a doubly-linked list of frequency buckets:

# O(1) LFU cache for hot tier eviction tracking
from collections import defaultdict, OrderedDict

class LFUEvictionTracker:
    """
    Tracks the minimum-frequency objects for eviction candidates.
    Uses the O(1) LFU algorithm by Ketan Shah et al.
    """
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.size = 0
        self.min_freq = 0
        # key -> (value, freq)
        self.key_freq: dict[str, int] = {}
        # freq -> OrderedDict of keys (maintains insertion order for same-freq tie-break)
        self.freq_keys: dict[int, OrderedDict] = defaultdict(OrderedDict)

    def access(self, key: str) -> None:
        """Record an access for key, updating its frequency bucket."""
        if key in self.key_freq:
            freq = self.key_freq[key]
            del self.freq_keys[freq][key]
            if not self.freq_keys[freq] and freq == self.min_freq:
                self.min_freq += 1
            self.key_freq[key] = freq + 1
            self.freq_keys[freq + 1][key] = True
        else:
            self.key_freq[key] = 1
            self.freq_keys[1][key] = True
            self.min_freq = 1
            self.size += 1

    def evict(self) -> str:
        """Evict the least-frequently-used key. Returns the evicted key."""
        if self.size == 0:
            raise ValueError("Nothing to evict")
        # Get LFU key (oldest among ties)
        lfu_key, _ = self.freq_keys[self.min_freq].popitem(last=False)
        del self.key_freq[lfu_key]
        self.size -= 1
        return lfu_key

    def get_eviction_candidates(self, n: int) -> list[str]:
        """Get top-N eviction candidates without actually evicting."""
        candidates = []
        freq = self.min_freq
        while len(candidates) < n:
            bucket = self.freq_keys.get(freq, {})
            for key in bucket:
                candidates.append(key)
                if len(candidates) >= n:
                    break
            freq += 1
            if freq > max(self.freq_keys.keys(), default=freq):
                break
        return candidates
Key Insight

The LFU eviction tracker runs in the proxy layer, not in the Policy Engine. The proxy makes instant local eviction decisions under capacity pressure using the in-memory LFU structure, then asynchronously notifies the Migration Scheduler to back-fill with S3 copies. This decouples the write-hot fast path from the slower background migration system.

Consistent Hashing for NVMe Shard Routing

The hot tier is sharded across multiple NVMe SSD nodes using consistent hashing. The key property is that adding or removing a node only redistributes 1/N of the keyspace, not everything:

# Consistent hash ring for NVMe shard routing
import hashlib
import bisect

class ConsistentHashRing:
    def __init__(self, virtual_nodes: int = 150):
        """
        virtual_nodes: higher = more even distribution, higher memory usage.
        150 virtual nodes per physical node gives ~2% standard deviation in load.
        """
        self.virtual_nodes = virtual_nodes
        self.ring: dict[int, str] = {}
        self.sorted_keys: list[int] = []
        self.nodes: set[str] = set()

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16)

    def add_node(self, node_id: str) -> None:
        """Add a physical node with virtual_nodes replicas on the ring."""
        self.nodes.add(node_id)
        for i in range(self.virtual_nodes):
            vnode_key = self._hash(f"{node_id}:{i}")
            self.ring[vnode_key] = node_id
            bisect.insort(self.sorted_keys, vnode_key)

    def remove_node(self, node_id: str) -> None:
        """Remove a node - only affects its portion of the ring."""
        self.nodes.discard(node_id)
        for i in range(self.virtual_nodes):
            vnode_key = self._hash(f"{node_id}:{i}")
            if vnode_key in self.ring:
                del self.ring[vnode_key]
                idx = bisect.bisect_left(self.sorted_keys, vnode_key)
                self.sorted_keys.pop(idx)

    def get_node(self, object_key: str) -> str:
        """Find the NVMe node responsible for this object key."""
        if not self.ring:
            raise ValueError("Ring is empty")
        hash_val = self._hash(object_key)
        idx = bisect.bisect(self.sorted_keys, hash_val)
        # Wrap around
        if idx == len(self.sorted_keys):
            idx = 0
        return self.ring[self.sorted_keys[idx]]

Scaling and Performance

The system scales differently across its components. The Storage Proxy scales horizontally - add proxy instances behind a load balancer, each with its own local metadata cache. The Access Tracker’s Redis sorted set is the first bottleneck: at 50,000 accesses/second, all updating the same sorted set, Redis single-threaded command processing saturates around 100K ops/second. The solution is key-partitioned Redis shards: partition the sorted set by murmur3(object_key) % N_SHARDS, with each shard running on a separate Redis instance.

Multi-region tiered storage scale-out architecture with consistent hash ring sharding and shared cold storage

The metadata store scales via horizontal sharding by (namespace, object_key). At 10 billion objects with 500 bytes per row, that is 5 TB of metadata - requiring 10-20 shards of a database like CockroachDB or Spanner. The policy engine scan is the most sensitive: it must evaluate billions of objects every 15 minutes. At 10B objects / 900 seconds = 11 million objects/second, which requires the scan to be parallelized across all metadata shards simultaneously with each shard owning its policy evaluation independently.

Capacity Estimation:

Given:
  - 10 billion objects total
  - Average object size: 1 MB (mix of 4KB configs to 5GB media)
  - Hot tier: 5% of objects = 500M objects
  - Hot tier storage: 500M objects * 1 MB avg = 500 TB NVMe
    (In practice, objects on hot tier skew smaller: ~200 GB avg = 100 TB actual)
  - Cold tier: 95% = 9.5B objects = 9.5 PB
  - Read QPS: 50,000/s (p99 hot reads), 500/s cold rehydrations

NVMe SSD pool sizing (hot tier):
  - 100 TB data
  - IOPS needed: 50,000 reads/s at avg 16KB random read
  - NVMe: 500K IOPS per node
  - Min nodes: 50K / 500K = 0.1 nodes for IOPS (bandwidth-bound, not IOPS-bound)
  - Bandwidth: 50K * 16KB = 800 MB/s; NVMe: 3.5 GB/s per node
  - Min nodes for bandwidth: 800 / 3500 = 0.23 nodes
  - Storage: 100 TB / 4 TB per NVMe node = 25 nodes minimum
  - With replication (RAID-10): 50 NVMe nodes

Migration bandwidth:
  - Target: demote 1% of hot tier per day = 1 TB/day
  - Required: 1 TB / 86400s = 11.6 MB/s sustained
  - With burst headroom: provision 100 MB/s (8.7x headroom)
  - S3 multi-part upload: 100 MB/s easily achieved with 4 workers

Metadata store sizing:
  - 10B objects * 500 bytes/row = 5 TB metadata
  - With indexes: ~12 TB
  - 16 CockroachDB nodes at 1 TB each with 3x replication = 48 nodes
  - Or 4 Spanner instances at 3 TB each

Cost comparison (annual):
  - All NVMe: 100 TB * $0.20/GB/mo * 12 = $245,760/yr
  - Tiered (5% hot + 95% cold Standard-IA): 
      Hot: 5 TB * $0.20 * 12 = $12,288
      Cold: 95 TB * $0.0125 * 12 = $14,250
      Total: $26,538/yr  -- 90% savings

The dominant bottleneck at high write volume is the metadata store write path: every write requires a synchronous metadata insert. At 10,000 writes/second, that is 10,000 inserts/second into the metadata table. Batch inserts (group 100 writes per transaction) reduce this to 100 transactions/second at the cost of 50-100ms write latency increase.

Real World

Dropbox Magic Pocket, their custom object store, uses precisely this tiering architecture at exabyte scale. The hot tier is a fleet of custom storage servers with NVMe SSDs; the cold tier is S3-compatible storage. Their migration daemon processes roughly 1 PB of demotions per day using a rate-limited job queue that yields to read traffic when the hot-tier disk queue depth exceeds a threshold.

Failure Modes and Recovery

FailureDetectionImpactRecovery
NVMe node crashHealth check timeout < 5s; read errors on proxyHot objects on that node return errors; hit cold fallbackRAID-10 survives single drive; node replacement triggers re-replication from cold tier
Migration worker crashJob heartbeat missing > 60s; migration_state='in_progress' for > 5minObject left in pending migration state; possible duplicate in both tiersReconciliation scan resets migration_state to idle after 5min; orphan detector removes duplicates
Metadata store partitionRead timeout > 5ms; quorum loss alertProxy cannot route requests; all reads return errorsRoute to read replicas for reads; queue writes to WAL; restore from Raft quorum
S3 outage / throttling503 SlowDown from S3 API; rehydration queue depth spikeCold misses cannot complete; rehydration queue backs upExponential backoff on S3 calls; serve cached hot objects normally; alert on queue depth > 10K
Hot tier capacity 100%disk_full errors from NVMe writes; capacity monitor alertNew writes fail; migration jobs cannot completeEmergency LFU eviction of bottom 10% of hot objects; temporarily suspend promotions
Policy engine bug: wrong thresholdObjects rapidly migrating back and forth; high migration bandwidthCPU and network burn on churn; latency impact on hot tierHysteresis cooldown limits damage; circuit breaker on migration rate > 500/min triggers alert
Watch Out

The most common operational failure is the reconciliation scan falling behind during a mass migration event. If you demote 10% of your hot tier in a short window (e.g., after a decay lambda change), the orphan objects on the hot tier accumulate faster than the reconciliation scan clears them. This causes the hot tier to appear full even though metadata shows objects as cold. Always rate-limit configuration changes to tier thresholds.

Comparison of Approaches

ApproachTier Boundary MechanismRehydration StrategyComplexityBest Fit
Time-based tieringFixed age threshold (e.g., demote after 30d)On-demand async; caller waitsLowCompliance/archive data with predictable access patterns
Frequency-based tiering (this design)Decayed heat score thresholdsTransparent async with 503 + Retry-AfterMediumMixed workloads with Zipf-distributed access
Workload-hint tieringApplication sets storage-class headerImmediate synchronous fetchLowWorkloads where the application knows data temperature
ML-predicted tieringTrained model predicts next-access timeProactive prefetch before accessHighMedia platforms with predictable content lifecycle
Cost-optimized tieringPrice/performance score including retrieval costBudget-aware: cold if retrieval < thresholdMediumCost-sensitive workloads with variable access frequency

The frequency-based approach (this design) is the right default for most engineering teams. Time-based tiering is simpler but creates latency cliffs - data that was accessed yesterday at day 29 and today at day 31 gets a 500ms penalty that confuses callers. Workload-hint tiering requires application changes, violating our zero-coupling constraint. ML-predicted tiering adds operational complexity (model drift, retraining pipelines) that rarely justifies the marginal improvement in tier placement accuracy over a well-tuned decay function.

For workloads with highly predictable access patterns - video platforms where new uploads are hot for 24h then cold forever - time-based tiering is actually superior because it eliminates the policy evaluation overhead and is trivially explainable.

Key Takeaways

  • Access frequency tracking requires exponential decay, not cumulative counts - cumulative counts permanently bias toward historically popular objects regardless of current access patterns.
  • Hysteresis bands between promotion and demotion thresholds are mandatory - a single shared threshold creates migration thrashing that consumes bandwidth with no net benefit.
  • Transparent retrieval means the proxy must handle cold misses gracefully via 503 + Retry-After, not by blocking the read thread for hundreds of milliseconds while rehydration runs.
  • Migration job safety depends on the copy-verify-commit-delete sequence with optimistic concurrency checks - any shortcut here risks permanent data loss when writes race with migrations.
  • Rehydration latency is a user-visible SLA, not a background concern - S3 Standard-IA returns in 50-500ms, S3 Glacier Instant Retrieval in milliseconds to seconds, and S3 Glacier Deep Archive in hours; tier your cold tiers by retrieval SLA requirements.
  • Eviction on hot tier during capacity pressure requires a separate fast path - the LFU eviction tracker in the proxy handles immediate capacity relief while the background migration system handles the slow copy-to-cold.
  • Cost vs latency tradeoff is the central knob of this system - a lower cold threshold aggressively moves data cold (lower cost, higher cold-miss rate), a higher threshold keeps more data hot (higher cost, fewer cold misses).
  • Metadata consistency is the hardest correctness problem - a stale metadata entry pointing to a nonexistent hot-tier location is more dangerous than a cold miss, because it returns an error rather than a latency penalty.

The counter-intuitive lesson from building this system is that the storage tiers themselves are not the hard problem. S3 and NVMe SSDs both work reliably. The hard problem is the metadata plane - specifically, keeping metadata strongly consistent with actual object locations while two workers (migration and rehydration) are simultaneously modifying the same objects. Teams that underinvest in metadata consistency find themselves with phantom objects, duplicate copies, and mysterious 404s on objects they know they wrote.

Frequently Asked Questions

Q: Why not just use S3 Intelligent Tiering and skip building this entirely? A: S3 Intelligent Tiering is excellent for pure S3 workloads and costs $0.0025/1000 objects/month for monitoring. The gap is the hot tier itself: S3 Standard has 50-100ms P99 read latency, while NVMe SSD delivers sub-1ms. For applications with latency SLAs under 5ms, S3 alone cannot serve the hot fraction. Additionally, S3 Intelligent Tiering does not give you control over the tiering thresholds, the policy logic, or the rehydration behavior - this design gives you all three.

Q: What happens when a rehydration is in-progress and the same object is written? A: The write must win. When the proxy receives a write for a key that has migration_state='in_progress', it writes to the NVMe SSD and bumps the generation counter. When the rehydration worker tries to COMMIT, its optimistic concurrency check fails because the generation no longer matches. The worker aborts the promotion and cleans up the destination copy. The object is now fresh on the hot tier with the new content.

Q: Why not use a local filesystem on the NVMe nodes instead of a distributed pool? A: A local filesystem per NVMe node is simpler to operate but creates hot spots: if one NVMe node holds the top-100 most accessed objects, its bandwidth is saturated while other nodes sit idle. Consistent hashing across the NVMe pool distributes load based on object key, not object temperature. The tradeoff is an extra network hop for hot reads, but at 25 Gbps inter-node bandwidth, this is negligible compared to NVMe local reads.

Q: How do you handle the metadata store becoming the bottleneck? A: Three mitigations in order of deployment complexity: (1) The proxy-local LRU cache absorbs 95%+ of reads - most objects don’t change tier frequently. (2) Read replicas handle metadata reads, leaving the primary for writes only. (3) Shard the metadata table by consistent hash of (namespace, object_key) - with 16 shards, each shard sees 1/16 of write throughput.

Q: What’s the right decay lambda value and how do you tune it? A: Lambda controls the effective memory window of the heat score. lambda = 0.1/hour gives a half-life of ~7 hours - roughly two working sessions. For media platforms where content is either viral (accessed millions of times in 24h) or forgotten, use lambda = 0.5/hour (half-life ~1.4h). For enterprise storage where access patterns are more uniform, use lambda = 0.02/hour (half-life ~35h). The right signal is migration churn rate: if more than 5% of your daily migration volume is objects that were migrated in the past 48h, your lambda is too high.

Q: How does this handle multi-tenant workloads where one tenant’s hot objects shouldn’t evict another’s? A: Use namespace-scoped capacity quotas: each namespace gets a configured allocation of the hot tier. The Policy Engine evaluates eviction pressure per namespace independently. A tenant that fills their hot quota gets namespace-local LFU eviction rather than competing globally. The namespace column in the metadata schema is the partition key for this purpose.

Interview Questions

Q: Walk me through what happens end-to-end when an application reads a file that was migrated to cold storage 2 hours ago.

Expected depth: Cover the full proxy read path: metadata lookup (tier=cold), decision to return 503 vs synchronous fetch based on object size and SLA tier, enqueue of rehydration job with priority=interactive, rehydration worker fetching from S3 with retry logic, checksum verification, write to NVMe hot tier, metadata COMMIT with generation check, and what the application does on receiving 503 (polling, exponential backoff). Discuss the latency budget: S3 Standard-IA fetch at 50-300ms, NVMe write at 1-5ms, metadata update at 2-5ms, total 60-320ms for rehydration before the read can succeed.

Q: The hot tier just hit 95% capacity and new writes are starting to fail. Walk me through your emergency response.

Expected depth: Distinguish immediate mitigation (LFU eviction to drop to 80% - which objects to evict and why LFU not LRU), from root cause investigation (policy engine not demoting fast enough? sudden write spike? decay lambda misconfigured?). Discuss rate of emergency eviction - evicting too fast risks demoting objects that are still actively accessed, causing a cold miss storm. Cover the capacity circuit breaker that should have triggered at 85% and why it might have failed. Discuss the longer-term fix: adjust hot tier capacity or tighten demote threshold.

Q: How would you design the migration rate limiter to be fair across multiple tenants without letting one tenant’s migration storm impact another’s read latency?

Expected depth: Describe a two-level token bucket: global bucket for total NVMe read bandwidth used by migration (e.g., 100 MB/s budget), and per-namespace sub-buckets that share from the global pool. Cover work-stealing when one namespace has no pending migrations - another namespace can borrow their tokens. Discuss the feedback loop: monitor hot-tier read p99 latency in real-time; if p99 spikes above 2ms, the rate limiter automatically reduces the migration token rate by 50% until latency recovers. Name that this is the same backpressure pattern used in Kafka’s consumer group coordinator.

Q: You need to support a “pin” API where applications can mark specific objects as permanently hot (never demote). How does this change the data model and policy engine?

Expected depth: Add pinned=boolean and pin_expires_at=timestamptz columns to object_metadata. The Policy Engine’s evaluate() function checks pinned=true first and returns STAY_HOT unconditionally. Discuss the capacity implications: if 30% of your hot tier is pinned, effective evictable capacity drops to 70% - you must account for pinned bytes in capacity planning. Cover expiring pins (TTL-based) as a safety valve to prevent hot tier from filling with forgotten pins. Discuss the API surface: should pin be a storage-class hint at write time or a separate API call? Consider what happens when a pinned object’s content is updated - the pin should carry forward to the new version.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article