Build an Infrastructure Cost Attribution System

cost-optimization cloud-infrastructure observability

System Design Deep Dive

Infrastructure Cost Attribution System

Tagging every cloud resource to its owner and catching spend anomalies before the CFO sees the bill

⏱ 14 min read📐 Advanced🏗️ FinOps

Cloud bills arrive once a month and look like a 10,000-line spreadsheet from a foreign country. Every row tells you what was charged but almost nothing about why. The compute line for us-east-1 says $84,000 - but which team ran those instances? Which service spawned that Lambda burst? Which data pipeline forgot to shut down its Redshift cluster? Without answers to those questions, you cannot optimize what you cannot see, you cannot hold teams accountable for costs they don’t know they incur, and you cannot catch the engineer who accidentally left a GPU cluster running over a long weekend.

Think of it like a shared office building where every team uses electricity, water, and HVAC but pays one collective bill. You could split it evenly, but that punishes frugal tenants and lets wasteful ones hide. The right model is submetering: put a meter on every circuit, attribute each kilowatt to the team drawing it, and send individual breakdowns. Cloud cost attribution is exactly this - putting a software meter on every compute instance, storage bucket, network byte, and managed service call, then routing each charge to the owning team in near real-time rather than at month’s end.

The engineering challenge is more subtle than slapping tags on resources. Tags drift: engineers forget them, resources get created by automation without tags, and shared infrastructure like a VPC NAT gateway or a centralized logging cluster genuinely belongs to multiple teams simultaneously. You need a resource tagging strategy that spans manual and automated creation paths. You need cost allocation rules that handle the messy middle - partial ownership, time-bound attribution, and costs that legitimately should be split. And you need shared resource apportionment that distributes infrastructure-level costs fairly without double-counting or under-counting.

On top of the attribution problem sits a temporal one. Cloud providers emit billing data with variable delay - AWS Cost and Usage Reports (CUR) lag 8-24 hours, while usage metrics from CloudWatch can be near real-time. Bridging that gap requires a real-time spend ingestion layer that fuses early signals (usage events) with confirmed cost data (billing records) to give teams hour-granularity visibility rather than day-old snapshots. Layer in anomaly scoring and budget alerting, and we’re solving for attribution accuracy, data freshness, anomaly detection sensitivity, and alert fatigue simultaneously.

Requirements and Constraints

Functional Requirements

Tag every cloud resource (compute, storage, network, managed services) to an owning team and service
Apply allocation rules to split shared resource costs across multiple teams by configurable proportions
Ingest billing events and usage metrics in near real-time (under 30-minute lag from cloud provider emission)
Surface per-team, per-service, and per-tag-dimension cost breakdowns queryable via API and dashboard
Detect anomalous spend spikes and dispatch alerts within 15 minutes of the anomaly onset
Support both showback (informational cost reports) and chargeback (actual internal billing/transfers)
Provide budget management: teams set monthly budgets, system alerts at configurable thresholds (50%, 80%, 100%)
Track untagged resource costs separately and expose them for remediation workflows
Support multi-cloud (AWS, GCP, Azure) with a unified cost model

Non-Functional Requirements

Ingest up to 5 million billing line items per hour across all cloud accounts
Query cost breakdowns with p95 latency under 200ms for 30-day aggregations
Store 24 months of daily-grain cost data and 90 days of hourly-grain data
System availability: 99.9% uptime (8.7 hours/year downtime tolerance)
Attribution accuracy: fewer than 0.1% of costs in the “unattributed” bucket after 48 hours
Anomaly alerts delivered within 15 minutes of a 2-sigma spend deviation
Support 500 teams and 10,000 individually tracked services

Constraints and Assumptions

Cloud provider billing APIs are the source of truth; we do not attempt to re-derive costs from usage
Tag schema is centrally managed (teams cannot create arbitrary tag keys without approval)
Chargeback triggers actual internal fund transfers via a finance system API - outside the scope of this design
We assume a single currency; multi-currency conversion is a downstream concern
We do not handle reserved instance or savings plan optimization recommendations - only attribution

High-Level Architecture

The system has six major components arranged in a pipeline with a feedback loop for tag remediation.

Infrastructure cost attribution system architecture overview

Cloud Source Connectors pull billing and usage data from AWS CUR S3 exports, GCP BigQuery billing exports, and Azure Cost Management APIs. Each connector normalizes vendor-specific schemas into a common CostEvent protobuf.

Tag Ingestion Pipeline is a Kafka-backed stream processor that enriches each CostEvent with team and service tags from the Tag Registry. Resources without tags are routed to a separate dead-letter topic for manual remediation.

Cost Allocation Engine is the core workhorse. It applies allocation rules - splitting shared resource costs by configurable proportions - and writes attributed TeamCostRecord rows to the Cost Store. It also maintains a running tally of unattributed spend.

Shared Resource Apportionment handles resources that are genuinely multi-tenant: NAT gateways, load balancers, centralized log aggregators. It distributes costs by configurable metrics (request count, data transfer volume, CPU utilization) sampled from the metrics tier.

Anomaly Detection Engine runs a rolling z-score model over hourly spend per team, scoring each new cost batch and publishing anomaly events when spend deviates beyond a threshold.

Alert Manager consumes anomaly events and budget-threshold events, deduplicates, rate-limits, and dispatches to Slack, PagerDuty, and email via a notification fanout.

Data flows like this: cloud providers emit billing records, connectors poll or subscribe every 15 minutes, the ingestion pipeline enriches and routes events, the allocation engine attributes and persists costs, and the anomaly detector scores in parallel. The dashboard and cost API read from the Cost Store directly, using pre-computed rollup tables for the common aggregations.

Key Insight

The most important architectural decision is separating tag enrichment from cost allocation - they run at different cadences (enrichment is near real-time, allocation may need to wait for confirmed billing data) and have different failure modes (a missing tag is recoverable; an incorrect split ratio is not).

The Tag Registry and Resource Tagging Strategy

The Tag Registry is the single source of truth for which resource belongs to which team and service. Without it, every component in the system would need its own lookup logic, and drift between copies would corrupt attribution.

Think of it like a library card catalog: every book (resource) has a card (tag record) that says who checked it out (owning team), which section it belongs to (service), and when it was catalogued. The catalog is updated whenever a book moves, and the catalog, not the book’s spine label, is the authoritative reference.

Cost allocation engine component internals

A resource tagging strategy has three enforcement layers. The first is preventive tagging - infrastructure-as-code policies (AWS Service Control Policies, GCP Organization Policies) that reject resource creation if mandatory tags (team, service, env, cost-center) are absent. The second is detective tagging - a nightly scanner that queries cloud asset inventories and flags resources missing required tags. The third is corrective tagging - automated tag propagation that copies tags from parent resources (an EC2 instance inherits the Auto Scaling Group’s tags) and from Terraform state metadata.

-- Tag Registry schema: canonical resource-to-team mapping
CREATE TABLE resource_tags (
  resource_id       VARCHAR(256)  NOT NULL,
  cloud_provider    VARCHAR(16)   NOT NULL CHECK (cloud_provider IN ('aws','gcp','azure')),
  resource_type     VARCHAR(64)   NOT NULL,
  region            VARCHAR(32)   NOT NULL,
  team_id           VARCHAR(64)   NOT NULL REFERENCES teams(id),
  service_id        VARCHAR(64)   NOT NULL REFERENCES services(id),
  cost_center       VARCHAR(32),
  env               VARCHAR(16)   NOT NULL CHECK (env IN ('prod','staging','dev','sandbox')),
  tag_source        VARCHAR(32)   NOT NULL CHECK (tag_source IN ('iac','scanner','propagated','manual')),
  confidence        SMALLINT      NOT NULL DEFAULT 100 CHECK (confidence BETWEEN 0 AND 100),
  effective_from    TIMESTAMPTZ   NOT NULL DEFAULT now(),
  effective_until   TIMESTAMPTZ,
  created_at        TIMESTAMPTZ   NOT NULL DEFAULT now(),
  updated_at        TIMESTAMPTZ   NOT NULL DEFAULT now(),
  PRIMARY KEY (resource_id, cloud_provider, effective_from)
);

CREATE INDEX idx_resource_tags_team  ON resource_tags (team_id, effective_from DESC);
CREATE INDEX idx_resource_tags_svc   ON resource_tags (service_id, effective_from DESC);
CREATE INDEX idx_resource_tags_res   ON resource_tags (resource_id, effective_until NULLS FIRST);

The confidence column is non-obvious but critical. When the system propagates tags from a parent resource, confidence is 80. When a tag comes from reviewed IaC, confidence is 100. When it comes from a heuristic matcher (resource name contains “payments-api”), confidence is 60. Downstream consumers can choose whether to include low-confidence attributions in official reports.

The effective_from / effective_until pair enables temporal tag correctness: if the payments team transfers ownership of a database to the platform team mid-month, the costs before the transfer date correctly attribute to payments and after to platform. Without temporal validity, retroactive tag changes would corrupt historical reports.

Watch Out

Tag propagation from Auto Scaling Groups to EC2 instances takes up to 15 minutes in AWS - during that window, a freshly launched instance has no team tag. If your ingestion pipeline processes billing events during this lag, those events land in the unattributed bucket and stay there unless you run a retroactive re-attribution job.

Real World

Spotify’s cost attribution system, described in their engineering blog, uses a “tag contract” enforced at the Backstage service catalog level: services that aren’t registered in the catalog cannot receive GCP resources. The catalog entry IS the tag source of truth, eliminating the drift between IaC tags and catalog metadata that plagues most organizations.

The Cost Allocation Engine

The Cost Allocation Engine’s job is to take a raw CostEvent (resource X cost Y cents in time window T) and produce one or more TeamCostRecord rows with each team’s share of that cost. For exclusively owned resources, this is a 1:1 mapping. For shared resources, it is a 1:N fan-out with proportional splits that must sum exactly to 100%.

Allocation rules are the business logic that drives this fan-out. A rule looks like: “NAT Gateway nat-04f8a3 is shared by teams payments, orders, and catalog in proportions derived from their outbound data transfer bytes, sampled hourly.” There are three rule types:

Static split: fixed percentages, e.g. platform team pays 40%, all others split the remaining 60% equally
Metric-proportional split: percentages derived from a runtime metric (request count, byte count, CPU share) recalculated each allocation window
Tag-passthrough: the cloud resource’s native tags are authoritative; no rule needed

# Cost allocation engine: core split logic
# Handles both static and metric-proportional rules

from dataclasses import dataclass
from decimal import Decimal, ROUND_HALF_UP
from typing import Optional
import logging

@dataclass
class CostEvent:
    resource_id: str
    cloud_provider: str
    cost_usd_micros: int   # cost in millionths of a dollar to avoid float errors
    window_start: int      # unix epoch seconds
    window_end: int

@dataclass
class AllocationRule:
    rule_id: str
    resource_id: str
    rule_type: str         # "static" | "metric_proportional" | "passthrough"
    splits: dict           # team_id -> Decimal weight (must sum to 1.0 for static)
    metric_key: Optional[str] = None  # e.g. "nat_bytes_out" for metric_proportional

@dataclass
class TeamCostRecord:
    team_id: str
    service_id: str
    resource_id: str
    cost_usd_micros: int
    window_start: int
    window_end: int
    allocation_rule_id: str
    split_fraction: Decimal

def allocate_cost(
    event: CostEvent,
    rule: AllocationRule,
    metric_samples: dict,  # team_id -> float usage value, for metric_proportional rules
    tag_registry,
) -> list[TeamCostRecord]:
    """
    Allocate a CostEvent across teams according to the rule.
    Returns a list of TeamCostRecord; total cost_usd_micros must equal event.cost_usd_micros.
    """
    total_cost = event.cost_usd_micros

    if rule.rule_type == "passthrough":
        tag = tag_registry.get_tag(event.resource_id, event.window_start)
        if tag is None:
            raise UntaggedResourceError(event.resource_id)
        return [TeamCostRecord(
            team_id=tag.team_id, service_id=tag.service_id,
            resource_id=event.resource_id, cost_usd_micros=total_cost,
            window_start=event.window_start, window_end=event.window_end,
            allocation_rule_id="passthrough", split_fraction=Decimal("1.0"),
        )]

    if rule.rule_type == "static":
        weights = rule.splits  # already normalized

    elif rule.rule_type == "metric_proportional":
        total_usage = sum(metric_samples.values())
        if total_usage == 0:
            # Fall back to equal split when no usage data available
            n = len(rule.splits)
            weights = {t: Decimal(1) / Decimal(n) for t in rule.splits}
            logging.warning("metric_proportional fallback to equal split for %s", event.resource_id)
        else:
            weights = {
                t: Decimal(str(v)) / Decimal(str(total_usage))
                for t, v in metric_samples.items()
                if t in rule.splits
            }

    # Integer allocation with remainder assigned to the largest-share team
    # to guarantee sum(allocated) == total_cost with no floating-point leakage
    records = []
    allocated = 0
    items = sorted(weights.items(), key=lambda x: x[1], reverse=True)

    for i, (team_id, fraction) in enumerate(items):
        tag = tag_registry.get_tag_by_team(team_id, event.window_start)
        if i < len(items) - 1:
            share = int(Decimal(str(total_cost)) * fraction.quantize(Decimal("0.0000001"), rounding=ROUND_HALF_UP))
        else:
            share = total_cost - allocated  # last team absorbs rounding residual

        allocated += share
        records.append(TeamCostRecord(
            team_id=team_id, service_id=tag.service_id if tag else "unknown",
            resource_id=event.resource_id, cost_usd_micros=share,
            window_start=event.window_start, window_end=event.window_end,
            allocation_rule_id=rule.rule_id, split_fraction=fraction,
        ))

    assert sum(r.cost_usd_micros for r in records) == total_cost, "Cost split integrity violation"
    return records

The assert at the end is not defensive programming theater - it is a hard invariant check. A single rounding error that causes sum(splits) != total_cost would create either phantom cost (money appearing from nowhere) or silent under-billing. Both are financially material. Run this check in production and alert on failures.

Key Insight

Store costs as integer microdollars (millionths of a dollar), not floats. A float representation of $1.00 can be 0.9999999… in IEEE 754, and those rounding errors compound across millions of events into dollars of attribution error over a month.

Shared Resource Apportionment

Shared resources are the hardest attribution problem. A NAT gateway charges for bytes transferred. A centralized Kafka cluster charges for storage and throughput. A shared RDS Aurora cluster charges for compute and I/O. These resources serve multiple teams but bill to a single cloud account line item.

The apportionment approach mirrors how shared utilities are handled in real estate: you install submeters on each tenant’s circuit (collect usage metrics per consumer), read them at billing time, and allocate the shared meter bill proportional to each tenant’s submeter reading.

Cost attribution data flow from billing event to team record

For each shared resource, we maintain a usage telemetry collector that samples consumption metrics at 5-minute granularity. For a NAT gateway, this is bytes_out per source VPC (correlated to team via VPC tags). For a Kafka cluster, this is bytes_in + bytes_out per consumer group (correlated to team via consumer group naming convention {team}-{service}-{consumer}).

# Shared resource apportionment configuration
# Defines which metric to use for splitting each shared resource type

shared_resources:
  - resource_type: nat_gateway
    metric_source: cloudwatch
    metric_name: BytesOutToDestination
    dimension: SourceSubnet
    aggregation: sum
    lookback_window_minutes: 60
    fallback: equal_split

  - resource_type: aurora_cluster
    metric_source: cloudwatch
    metric_name: DatabaseConnections
    dimension: DBClusterIdentifier
    aggregation: avg
    # For Aurora, track connection count as proxy for compute consumption
    # More accurate: track actual query execution time via Performance Insights
    lookback_window_minutes: 60
    fallback: static_split
    static_splits:
      platform: 0.30
      payments: 0.40
      catalog: 0.30

  - resource_type: kafka_cluster
    metric_source: kafka_jmx
    metric_name: BytesInPerSec
    dimension: topic
    aggregation: sum
    topic_to_team_resolver: topic_prefix  # "payments-*" -> payments team
    lookback_window_minutes: 60
    fallback: equal_split

The fallback field is critical for operational safety. When the metrics pipeline is degraded and usage data is unavailable, falling back to equal_split ensures costs still get attributed rather than accumulating in an unattributed bucket. Teams may dispute the equal split, but zero attribution is worse.

Real World

Netflix’s FinOps team uses a metric they call “resource equivalent units” (REUs) to normalize heterogeneous usage across CPU, memory, and GPU into a single consumption score per service. This sidesteps the question of which metric to use for apportionment by computing a weighted composite that reflects actual resource consumption holistically.

Real-Time Spend Ingestion

Cloud providers do not emit billing events in real time. AWS CUR files land in S3 with 8-24 hour lag. GCP BigQuery exports are daily. Azure Cost Management API has a 12-hour delay. If we relied solely on billing data, teams would see yesterday’s costs at best. For anomaly detection to be useful, we need a signal within 15 minutes of a spend event occurring.

The solution is a dual-track ingestion architecture: one track ingests confirmed billing records (high accuracy, high latency), and a second track ingests usage metrics that can be converted to estimated costs (lower accuracy, near real-time).

For the fast track, AWS CloudWatch metrics and Cost Explorer cost explorer metrics (which have a ~1 hour granularity) give us hourly estimates. GCP Cloud Monitoring billing metrics update every 5 minutes for some resource types. We treat fast-track data as “estimated cost” and mark it with data_quality = 'estimated'. When the confirmed billing record arrives, we reconcile the estimate against the actual and write a correction event.

// Real-time spend ingestion worker
// Reads from Kafka topics: fast-track (estimated) and confirmed-billing
// Writes to the cost store with appropriate quality markers

package ingestion

import (
    "context"
    "fmt"
    "log/slog"
    "time"
)

type DataQuality string

const (
    QualityEstimated  DataQuality = "estimated"
    QualityConfirmed  DataQuality = "confirmed"
    QualityReconciled DataQuality = "reconciled"
)

type CostRecord struct {
    ResourceID     string
    CloudProvider  string
    WindowStart    time.Time
    WindowEnd      time.Time
    CostUSDMicros  int64
    DataQuality    DataQuality
    IngestionTS    time.Time
    ReconcileKey   string // sha256(resource_id + window_start) for dedup
}

type ReconcileResult struct {
    EstimatedCost  int64
    ConfirmedCost  int64
    DeltaMicros    int64
    DeltaPct       float64
}

func (w *IngestionWorker) ReconcileEstimate(
    ctx context.Context,
    confirmed CostRecord,
) (ReconcileResult, error) {
    // Look up the estimated record for the same resource + window
    estimated, err := w.store.GetEstimated(ctx, confirmed.ReconcileKey)
    if err != nil {
        return ReconcileResult{}, fmt.Errorf("lookup estimated: %w", err)
    }

    delta := confirmed.CostUSDMicros - estimated.CostUSDMicros
    var deltaPct float64
    if estimated.CostUSDMicros != 0 {
        deltaPct = float64(delta) / float64(estimated.CostUSDMicros) * 100
    }

    // Persist reconciliation record
    result := ReconcileResult{
        EstimatedCost: estimated.CostUSDMicros,
        ConfirmedCost: confirmed.CostUSDMicros,
        DeltaMicros:   delta,
        DeltaPct:      deltaPct,
    }

    if err := w.store.WriteReconciliation(ctx, confirmed, result); err != nil {
        return result, fmt.Errorf("write reconciliation: %w", err)
    }

    // If estimate was off by more than 20%, emit a model calibration event
    if abs(deltaPct) > 20.0 {
        slog.Warn("large estimation error", "resource", confirmed.ResourceID,
            "delta_pct", deltaPct, "window", confirmed.WindowStart)
        w.metrics.RecordCalibrationNeeded(confirmed.CloudProvider, confirmed.ResourceID)
    }

    return result, nil
}

func abs(x float64) float64 {
    if x < 0 {
        return -x
    }
    return x
}

The reconciliation step is what keeps the fast-track useful. Teams see estimated costs within minutes but understand that the numbers will settle as confirmed billing arrives. The DeltaPct metric feeds back into improving the estimation model - if AWS NAT gateway estimates are consistently 15% below confirmed costs, we apply a 1.15 correction factor to the fast-track estimator.

Watch Out

Never overwrite estimated records with confirmed records in place. Instead, write a separate reconciliation event and recompute aggregates. If you mutate in place, you lose the ability to audit estimation accuracy over time, and rollback becomes nearly impossible if the confirmed record contains an error.

Anomaly Scoring Engine

Anomaly detection on cost data is harder than it looks. Team spending patterns have daily seasonality (lower on weekends), weekly seasonality (month-end spikes from batch jobs), and long-term trends (organic growth). A naive threshold like “alert if cost exceeds $X” generates constant noise as teams grow. You need a model that asks “is this cost unusual given the historical pattern for this team at this time of week and month?”

The algorithm we use is a seasonal decomposition with rolling z-score. For each (team, service, hour-of-week) triple, we maintain a rolling distribution of observed hourly costs over the past 8 weeks. A new observation’s z-score is (observed - rolling_mean) / rolling_std. Scores above 2.5 trigger a warning; above 3.5 trigger a critical alert.

# Anomaly scoring: seasonal rolling z-score per team/service/hour-of-week
# hour_of_week: 0 = Monday 00:00, 167 = Sunday 23:00

import math
from collections import deque
from dataclasses import dataclass, field
from typing import Optional

LOOKBACK_WEEKS = 8
MIN_SAMPLES_FOR_SCORING = 4  # require at least 4 historical data points

@dataclass
class RollingStats:
    """Welford's online algorithm for numerically stable mean/variance."""
    count: int = 0
    mean: float = 0.0
    M2: float = 0.0   # sum of squared deviations from mean
    samples: deque = field(default_factory=lambda: deque(maxlen=LOOKBACK_WEEKS))

    def update(self, value: float) -> None:
        # Evict oldest if at capacity and adjust running stats
        if len(self.samples) == self.samples.maxlen:
            old = self.samples[0]
            self.count -= 1
            old_mean = self.mean
            self.mean -= (old - self.mean) / max(self.count, 1)
            self.M2 -= (old - old_mean) * (old - self.mean)

        self.samples.append(value)
        self.count += 1
        delta = value - self.mean
        self.mean += delta / self.count
        delta2 = value - self.mean
        self.M2 += delta * delta2

    @property
    def variance(self) -> float:
        if self.count < 2:
            return 0.0
        return self.M2 / (self.count - 1)

    @property
    def std(self) -> float:
        return math.sqrt(self.variance)

    def z_score(self, value: float) -> Optional[float]:
        if self.count < MIN_SAMPLES_FOR_SCORING:
            return None
        s = self.std
        if s < 1e-9:  # near-zero std: team has perfectly flat spend
            # Use relative threshold instead: flag if 3x above mean
            if self.mean > 0 and value > self.mean * 3.0:
                return 3.0
            return 0.0
        return (value - self.mean) / s


class AnomalyScorer:
    def __init__(self) -> None:
        # Key: (team_id, service_id, hour_of_week)
        self._stats: dict[tuple, RollingStats] = {}

    def _key(self, team_id: str, service_id: str, hour_of_week: int) -> tuple:
        return (team_id, service_id, hour_of_week % 168)

    def ingest(
        self, team_id: str, service_id: str, hour_of_week: int, cost_usd: float
    ) -> Optional[float]:
        """
        Record a new hourly cost observation and return the z-score.
        Returns None if insufficient history for scoring.
        """
        key = self._key(team_id, service_id, hour_of_week)
        if key not in self._stats:
            self._stats[key] = RollingStats()

        stats = self._stats[key]
        z = stats.z_score(cost_usd)  # score BEFORE updating with new value
        stats.update(cost_usd)       # update AFTER scoring to avoid self-comparison
        return z

The key design choice is scoring the value before updating the rolling stats. If we updated first, an extreme value would inflate the mean and reduce the z-score of itself - self-normalizing the anomaly into invisibility.

Key Insight

The near-zero standard deviation case deserves special handling. Teams with perfectly flat spend (e.g., a reserved instance that always charges the same amount) would have std near zero, and any fluctuation at all would produce an infinite z-score. The relative 3x threshold catches genuine anomalies without flooding with false alarms for these stable-spend services.

Budget Alerting and Showback vs Chargeback

Budget alerting is straightforward operationally but complex in terms of policy. The system supports two fundamentally different cost accountability models: showback and chargeback.

In showback, teams receive cost reports and dashboards but no actual financial consequences follow. The organization is building cost visibility culture. Engineers can see that their service costs $12,000/month and compare it to peers. This is where most organizations start.

In chargeback, the infrastructure cost attributions translate into actual internal financial transfers. If the payments team consumed $30,000 of infrastructure in May, that amount is debited from the payments team’s budget and credited back to the central platform budget. This creates strong incentives but requires organizational buy-in, accurate attribution (disputed charges cause friction), and integration with the finance system.

The system supports both models simultaneously per team - some teams run on chargeback, others on showback - because organizations typically migrate incrementally.

-- Budget configuration and alert state
CREATE TABLE team_budgets (
    id              BIGSERIAL PRIMARY KEY,
    team_id         VARCHAR(64)   NOT NULL REFERENCES teams(id),
    budget_period   VARCHAR(8)    NOT NULL CHECK (budget_period IN ('monthly','quarterly')),
    period_start    DATE          NOT NULL,
    budget_usd      NUMERIC(12,2) NOT NULL,
    alert_thresholds JSONB        NOT NULL DEFAULT '[50, 80, 100, 120]',
    -- 50% warning, 80% warning, 100% alert, 120% critical
    accountability_model VARCHAR(16) NOT NULL DEFAULT 'showback'
        CHECK (accountability_model IN ('showback','chargeback')),
    created_at      TIMESTAMPTZ   NOT NULL DEFAULT now(),
    UNIQUE (team_id, budget_period, period_start)
);

CREATE TABLE budget_alerts (
    id              BIGSERIAL PRIMARY KEY,
    team_id         VARCHAR(64)   NOT NULL,
    period_start    DATE          NOT NULL,
    threshold_pct   SMALLINT      NOT NULL,
    triggered_at    TIMESTAMPTZ   NOT NULL DEFAULT now(),
    spend_at_trigger NUMERIC(12,2) NOT NULL,
    budget_amount   NUMERIC(12,2) NOT NULL,
    alert_sent      BOOLEAN       NOT NULL DEFAULT FALSE,
    suppressed      BOOLEAN       NOT NULL DEFAULT FALSE,
    -- suppressed=true if a higher threshold was already triggered this period
    UNIQUE (team_id, period_start, threshold_pct)
);

CREATE INDEX idx_budget_alerts_unsent ON budget_alerts (triggered_at)
    WHERE alert_sent = FALSE AND suppressed = FALSE;

The suppressed column prevents alert storms. If the 50% threshold fires, then 80%, then 100% in rapid succession (e.g., a runaway job), the team should not receive three separate alerts within minutes. The alert manager marks lower thresholds as suppressed once a higher one fires in the same period.

Real World

Airbnb’s FinOps team published a detailed account of their migration from showback to chargeback. The key lesson: chargeback requires “appeal windows” where teams can dispute attributions before the financial transfer executes. Without this, engineers lose trust in the system the first time a shared resource is incorrectly attributed, and the entire cost accountability program collapses politically.

Data Model

The cost store needs to serve two very different access patterns: real-time writes of individual TeamCostRecord events at high throughput, and read queries that aggregate costs across dimensions (team, service, resource type, time range) for dashboards.

The solution is a hybrid storage model: a write-optimized append-only event table captures all raw cost events, and a set of pre-aggregated materialized views serve the common dashboard queries.

-- Core cost event table: append-only, partitioned by month
CREATE TABLE team_cost_events (
    id               BIGSERIAL,
    team_id          VARCHAR(64)   NOT NULL,
    service_id       VARCHAR(64)   NOT NULL,
    resource_id      VARCHAR(256)  NOT NULL,
    resource_type    VARCHAR(64)   NOT NULL,
    cloud_provider   VARCHAR(16)   NOT NULL,
    region           VARCHAR(32)   NOT NULL,
    cost_usd_micros  BIGINT        NOT NULL,
    window_start     TIMESTAMPTZ   NOT NULL,
    window_end       TIMESTAMPTZ   NOT NULL,
    allocation_rule  VARCHAR(128),
    split_fraction   NUMERIC(8,6),
    data_quality     VARCHAR(16)   NOT NULL DEFAULT 'confirmed',
    env              VARCHAR(16)   NOT NULL,
    tags             JSONB,
    created_at       TIMESTAMPTZ   NOT NULL DEFAULT now(),
    PRIMARY KEY (id, window_start)
) PARTITION BY RANGE (window_start);

-- Create monthly partitions programmatically; example for 2026
CREATE TABLE team_cost_events_2026_06
    PARTITION OF team_cost_events
    FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

-- Indexes on the partition for fast team+time queries
CREATE INDEX idx_tce_team_window ON team_cost_events (team_id, window_start DESC);
CREATE INDEX idx_tce_svc_window  ON team_cost_events (service_id, window_start DESC);
CREATE INDEX idx_tce_res_window  ON team_cost_events (resource_id, window_start DESC);

-- Hourly rollup materialized view: pre-aggregated for dashboard queries
CREATE MATERIALIZED VIEW cost_hourly_rollup AS
SELECT
    team_id,
    service_id,
    resource_type,
    cloud_provider,
    region,
    env,
    date_trunc('hour', window_start) AS hour_bucket,
    SUM(cost_usd_micros)             AS total_cost_usd_micros,
    COUNT(DISTINCT resource_id)      AS resource_count,
    MAX(data_quality)                AS min_data_quality
FROM team_cost_events
GROUP BY 1,2,3,4,5,6,7;

CREATE UNIQUE INDEX ON cost_hourly_rollup (team_id, service_id, resource_type, cloud_provider, region, env, hour_bucket);

-- Daily rollup for 24-month retention tier
CREATE MATERIALIZED VIEW cost_daily_rollup AS
SELECT
    team_id,
    service_id,
    resource_type,
    cloud_provider,
    region,
    env,
    date_trunc('day', window_start)  AS day_bucket,
    SUM(cost_usd_micros)             AS total_cost_usd_micros,
    COUNT(DISTINCT resource_id)      AS resource_count
FROM team_cost_events
GROUP BY 1,2,3,4,5,6,7;

Partitioning by window_start month is the correct sharding key here. Cost queries are almost always time-bounded (“show me June costs for team X”), so partition pruning eliminates irrelevant months entirely. Partition management - creating future partitions, dropping partitions older than the retention window - runs as a nightly maintenance job.

The tags JSONB column on team_cost_events stores auxiliary tags (project, feature, component) that do not have first-class columns but are needed for ad-hoc cost analysis. JSONB indexing via a GIN index enables queries like “show all costs tagged project=checkout-redesign” without requiring schema migrations each time a new tag key is introduced.

Key Insight

Pre-computing hourly and daily rollups as materialized views is the difference between a dashboard that loads in 200ms and one that takes 30 seconds. A 30-day cost breakdown for one team at hourly granularity is 720 rows from the rollup vs. potentially 500,000 rows from the raw event table - the difference compounds as the system ages.

Key Algorithms and Protocols

Integer Cost Splitting with Residual Assignment

The most subtle algorithm in the system is the one that splits a cost among N parties such that the integer-valued shares sum exactly to the original total. Floating point arithmetic cannot guarantee this. Proportional rounding to the nearest cent leaves a residual that must be assigned to exactly one party.

We use the largest remainder method (also called Hamilton’s method in political science - it’s used to apportion congressional seats). Sort parties by their fractional remainder after floor allocation, assign the residual to the party with the largest remainder.

def split_integer_cost(total_micros: int, weights: dict) -> dict:
    """
    Split total_micros across teams by weight using largest remainder method.
    Guarantees sum(result.values()) == total_micros exactly.
    Time: O(n log n) for sort, Space: O(n)

    weights: dict of team_id -> float weight (must sum to 1.0)
    Returns: dict of team_id -> int micros
    """
    assert abs(sum(weights.values()) - 1.0) < 1e-9, "weights must sum to 1.0"
    assert total_micros >= 0

    # Step 1: compute exact (non-integer) shares
    exact = {t: w * total_micros for t, w in weights.items()}

    # Step 2: floor each share
    floors = {t: int(v) for t, v in exact.items()}
    remainder = total_micros - sum(floors.values())

    # Step 3: sort by fractional remainder descending, assign extras
    fractions = sorted(
        [(t, exact[t] - floors[t]) for t in exact],
        key=lambda x: x[1],
        reverse=True
    )
    for i in range(remainder):
        floors[fractions[i][0]] += 1

    # Invariant check: must never fail in production
    assert sum(floors.values()) == total_micros, (
        f"split integrity error: sum={sum(floors.values())} != {total_micros}"
    )
    return floors

Time complexity: O(n log n) for the sort. At typical shared-resource cardinality (2-20 teams per resource), this is negligible compared to database I/O. Space complexity: O(n) for the intermediate maps.

The edge case worth noting: when total_micros = 0 (a credit or free tier), all shares are 0 and the algorithm returns zeros for all parties. This is correct; crediting teams with zero-cost shares would generate noise in dashboards.

Z-Score with Welford’s Online Algorithm

Maintaining a rolling z-score over 8 weeks of hourly data means updating statistics for potentially millions of (team, service, hour-of-week) triples. Storing all 8 * 168 = 1,344 raw data points per triple would require substantial memory. Welford’s algorithm computes mean and variance in a single pass with O(1) memory per triple, making it suitable for streaming.

The key property that makes Welford’s work at scale: it is numerically stable. The naive two-pass formula var = sum((x - mean)^2) / n requires knowing the mean first and suffers from catastrophic cancellation when values are large (like costs measured in microdollars). Welford’s updates variance incrementally using only the delta from the running mean, maintaining full precision.

Key Insight

The anomaly scorer uses a rolling window (8 weeks, evicting the oldest sample) rather than an expanding window. This is deliberate: if a team permanently grows their infrastructure, their baseline should shift upward over 8 weeks so they aren’t perpetually anomalous relative to their launch-era spend. A fixed-origin expanding window would never stop flagging growing teams as anomalous.

Allocation Rule Conflict Resolution

When multiple rules match a resource (e.g., a resource tagged to the platform team but also matching a shared-resource rule for a NAT gateway), the system must resolve the conflict deterministically. Rules carry a priority field (lower number = higher priority). If priorities tie, the most specific rule wins - a rule matching resource_id exactly beats one matching resource_type.

def resolve_rule_conflicts(
    resource_id: str,
    resource_type: str,
    candidate_rules: list,
) -> "AllocationRule":
    """
    Given multiple applicable rules, return the one with highest priority.
    Tie-breaking: exact resource_id match > resource_type match.
    """
    def rule_score(rule) -> tuple:
        specificity = 1 if rule.resource_id == resource_id else 0
        return (rule.priority, -specificity)  # lower priority number = better

    return min(candidate_rules, key=rule_score)

Scaling and Performance

Horizontal scaling via partitioned allocation workers

The cost allocation engine scales horizontally by partitioning the team namespace. Each allocation worker owns a shard of teams (e.g., Worker A handles teams A-F, Worker B handles G-M). Incoming cost events are routed to the correct worker by a consistent hash on team_id. This guarantees that all events for a given team are processed by the same worker, enabling in-process state for the anomaly scorer’s rolling statistics without distributed locking.

Shared resource apportionment is the exception: it requires cross-team coordination to gather usage metrics before splitting costs. The apportionment service runs as a separate fleet that polls all workers for their metered usage, computes splits, and publishes pre-computed split fractions for the allocation workers to consume.

Capacity Estimation:

Given:
  - 5,000 cloud accounts across 3 providers
  - 2,000 cost events per account per hour (conservative for AWS CUR)
  - 10,000,000 cost events per hour total
  - 80 bytes per CostEvent (after normalization)

Ingestion throughput:
  10M events/hr = ~2,800 events/sec
  Kafka topic: 10 partitions, each handling 280 events/sec (well within limits)

Storage (raw events, 90-day hourly retention):
  10M events/hr * 80 bytes * 24 hr * 90 days = ~1.73 TB
  With 3x replication: ~5.2 TB

Storage (daily rollups, 24-month retention):
  10M events/hr / 24 hr = 417K unique (team, service, resource) records/day
  * 100 bytes per rollup row * 730 days = ~30 GB (negligible)

Query throughput (dashboard):
  500 teams * 10 dashboard loads/hr = 5,000 queries/hr = ~1.4 queries/sec
  All queries hit materialized views; p95 latency < 200ms target is achievable
  with 4-node read replica cluster on the rollup tables

Anomaly scorer state:
  500 teams * 100 services/team * 168 hr/week * 200 bytes/RollingStats = ~1.7 GB RAM
  Fits comfortably in a single Redis instance; use Redis hash per (team,service)

The dominant bottleneck at scale is not write throughput (Kafka handles that) but rollup freshness. Materialized view refresh in Postgres is an exclusive lock operation - refreshing a view with 500M rows takes minutes and blocks reads during that window. The solution is incremental refresh using a timestamp watermark: only process events newer than last_refresh_ts, appending to the rollup rather than recomputing from scratch.

Real World

Datadog’s cost attribution engine uses a tiered storage approach: ClickHouse for raw event aggregation (optimized for analytical GROUP BY queries), Redis for the anomaly scorer state, and a thin Postgres layer for budget configuration and alert state. ClickHouse’s columnar storage enables 10x compression of cost event data compared to row-oriented Postgres, dramatically reducing the 90-day retention storage requirement.

Failure Modes and Recovery

Failure	Detection	Impact	Recovery
Cloud connector outage (API rate limit or credential expiry)	Connector health metric drops to zero; Kafka lag grows	Cost events stop flowing; estimated spend visible but confirmed billing delayed	Rotate credentials automatically; exponential backoff retry; alert on connector lag > 2 hours
Tag Registry unavailable during ingestion	Enrichment step returns errors; events routed to dead-letter Kafka topic	All new events land in unattributed bucket; attribution accuracy degrades	Tag Registry backed by read replicas; enrichment retries from dead-letter queue when Registry recovers; events are not lost
Allocation rule conflict (two rules both claim 100% of same resource)	Integrity assertion in allocate_cost() fires; event sent to error queue	Resource cost double-counted across teams	Rule validation pipeline rejects conflicting rules on creation; on detection, remove newer rule and re-process affected events from Kafka replay
Clock skew between allocation worker and cost store	Events with future timestamps appear; window aggregations produce gaps	Hourly rollups show gaps for affected windows	Use event time (not processing time) for all windowing; accept events up to 2 hours late; rollups compute at event-time + 2hr watermark
Anomaly scorer state lost (Redis failure)	Score output drops to null; alert volume drops to zero	No anomaly alerts until state rebuilds	Redis persistent AOF + daily RDB snapshots; on restart, replay 8 weeks of hourly cost events to rebuild RollingStats in < 30 minutes
Large batch re-processing (manual rule change affects past data)	Re-attribution job queued; Kafka consumer group offset rewound	Temporary cost double-counting if new and old records coexist	Use generation counter on cost records; rollup queries filter to max generation per (resource, window); old records logically superseded

Watch Out

The most common operational failure mode is not a component crash - it is silent data quality degradation. A cloud connector that returns HTTP 200 but returns truncated billing data (AWS CUR files can have missing line items during re-processing windows) will quietly under-attribute costs with no alerts. Implement a daily reconciliation check that compares total attributed spend to the cloud provider’s reported total, and alert if the gap exceeds 1%.

Comparison of Approaches

Approach	Latency	Attribution Accuracy	Operational Complexity	Best Fit
Native cloud cost tagging only (AWS cost allocation tags, GCP labels)	24h+ lag (billing export)	Medium - tags on resources only, no shared apportionment	Low - cloud-native, no custom system	Small orgs, single cloud, no shared infrastructure
Third-party FinOps platforms (Apptio, CloudHealth, FOCUS)	12-24h lag	Medium-high - rule engines included but rigid	Low to medium - SaaS, integration effort only	Mid-size orgs willing to pay per-seat licensing
Custom attribution system (this design)	15-30 min (fast track)	High - full control over rules, metric-proportional splits	High - 6 components to operate, schema migrations	Large orgs, multi-cloud, complex shared infrastructure, chargeback requirements
Sidecar-based cost telemetry (eBPF probes per pod, k8s cost attribution)	Near real-time (< 5 min)	Very high for k8s workloads - pod-level granularity	Very high - requires agent rollout across all nodes	Kubernetes-native orgs, container-first workloads
Cloud provider native tools only (AWS Cost Explorer, GCP Cost Management)	24h+ lag	Low - no cross-account, no shared resource splitting	Very low - zero infrastructure	Startups, single cloud, < 20 engineers

The right choice for an organization running 500+ services across 3 cloud providers with chargeback requirements is the custom attribution system. The 15-30 minute latency enables anomaly alerting, the metric-proportional splits are necessary for shared infrastructure fairness, and the chargeback accounting model requires integration hooks that SaaS FinOps tools rarely expose. Start with a third-party tool to build organizational cost visibility culture, then migrate to a custom system once chargeback requirements emerge.

Key Takeaways

Resource tagging strategy is a three-layer enforcement problem: preventive (IaC policies reject untagged resources), detective (nightly scanner flags drift), and corrective (automated propagation from parents and catalog metadata).
Shared resource apportionment requires metered usage data per consumer, not just cost data per resource - this is the most common gap in naive tagging approaches.
Store costs as integer microdollars, never floats, and use the largest remainder method for splits to guarantee zero rounding leakage across millions of events.
Dual-track ingestion (estimated fast track + confirmed slow track with reconciliation) is the only way to provide both real-time anomaly detection and accurate month-end billing.
Temporal tag validity (effective_from / effective_until on tag records) is mandatory for correct historical attribution after team ownership transfers.
Showback and chargeback require different organizational infrastructure: chargeback needs appeal windows and finance system integration that showback does not.
Anomaly scoring needs seasonal adjustment and a rolling baseline window - a flat z-score threshold misses weekend-normalized spikes and permanently flags growing teams.
Reconciliation beats mutation: when confirmed billing data arrives, write a reconciliation event rather than overwriting estimated records, preserving auditability of estimation accuracy.

The counter-intuitive lesson from this design: the hardest problem is not the engineering - it is the organizational one. A technically perfect attribution system that teams do not trust (because of one bad shared-resource split early on) will be abandoned in favor of a spreadsheet. Build the appeal workflow and the transparency UI before you build the chargeback integration. Accuracy debates need a venue before money actually moves.

Frequently Asked Questions

Q: Why not just use AWS Cost Allocation Tags and call it done? A: AWS cost allocation tags only attribute costs at the resource level - they cannot split a single resource’s cost across multiple teams. More critically, they have no concept of shared infrastructure apportionment: a NAT gateway shared by 10 teams bills entirely to the account where it lives, with no mechanism to distribute that cost to consumers. For organizations with shared platform infrastructure and chargeback requirements, native tags alone produce attribution errors of 20-40% of total spend.

Q: How do you handle containers and Kubernetes workloads where EC2 nodes are shared across pods from different teams? A: This is the hardest class of shared resource in practice. The solution is a Kubernetes cost attribution layer (tools like Kubecost or OpenCost, or a custom eBPF sidecar approach) that runs alongside this system. K8s attribution computes per-namespace (team) cost shares based on CPU request + memory request fractions, then reports those as pre-attributed cost events to the ingestion pipeline. The attribution system treats k8s as a source connector, not as something it handles internally.

Q: Why not use a streaming SQL engine like Apache Flink for the allocation instead of a custom worker fleet? A: Flink would handle the stateless enrichment and allocation steps well, but the anomaly scorer’s rolling statistics are difficult to express in SQL’s window function model - you need the last 8 weeks of the same (team, service, hour-of-week) triple, which is not a standard sliding window. The custom worker approach allows in-process RollingStats objects per key without distributed state management overhead. That said, Flink is a reasonable choice for organizations that already operate it; the allocation logic maps naturally to a Flink KeyedProcessFunction.

Q: Can the system handle retroactive rule changes without corrupting historical data? A: Yes, via the generation counter pattern. When a rule changes, all cost events affected by that rule are re-processed from the Kafka event log and written as new records with an incremented generation field. The rollup materialized views are re-computed using MAX(generation) per (resource, window) to use only the latest interpretation. Old-generation records are retained for audit but excluded from dashboards. This makes rule changes safely reversible.

Q: How do you prevent alert fatigue when a team has a legitimate large deployment that triggers anomaly alerts? A: Two mechanisms. First, teams can declare “planned spend events” via the budget API: “on 2026-06-15 we’re launching a new region, expected cost increase of $50K/month.” The anomaly scorer shifts the baseline upward for the declared window. Second, once a team acknowledges an anomaly alert, the system suppresses re-alerts for the same (team, service) combination for 4 hours unless the z-score increases by more than 0.5 beyond the acknowledged level.

Q: Why not just alert on absolute cost thresholds instead of z-scores? A: Absolute thresholds require constant manual maintenance as teams grow. A team that legitimately scaled from $10K/month to $100K/month over a year would need their threshold updated every quarter or generate constant alerts. Z-scores auto-adjust to the team’s current spending pattern. The only case where absolute thresholds are better is for very new teams with fewer than 4 weeks of history - the anomaly scorer correctly returns null z-scores until it has sufficient history, and during that bootstrap period you fall back to a conservative absolute threshold.

Interview Questions

Q: Walk me through how you’d attribute the cost of a shared NAT gateway to the teams that use it.

Expected depth: Explain that NAT gateway billing has two components (hourly rate + data processing charge). The hourly rate can be split equally or by some agreed proportion; the data processing charge should be split proportional to bytes_out per source subnet (correlated to teams via VPC/subnet tags). Discuss how to collect bytes_out per subnet from CloudWatch at 1-minute granularity, aggregate to the billing window, normalize to fractions, and apply those fractions to the data processing line item from the billing export. Cover the fallback case when CloudWatch data is unavailable.

Q: How would you detect that a team’s cost is anomalously high without generating constant false alarms for teams that are legitimately growing fast?

Expected depth: Discuss seasonal decomposition (daily and weekly patterns in cloud costs), rolling baseline windows (8 weeks gives seasonal context without being too sensitive to recent growth), Welford’s online algorithm for memory-efficient rolling stats, the z-score threshold selection tradeoff (higher threshold = fewer false alarms but missed real anomalies), and the near-zero standard deviation edge case for flat-spend services. Bonus: discuss how planned spend events and acknowledgment suppression reduce alert fatigue.

Q: A team transferred ownership of a database to another team mid-month. How does your system handle this so that costs before and after the transfer date are attributed correctly?

Expected depth: Describe the temporal tag validity model (effective_from / effective_until on resource_tags). Explain that the allocation engine looks up the active tag record at window_start of each cost event, not the current tag. Discuss the retroactive correction job that runs when a tag change affects past unbilled windows. Cover the audit trail requirement for chargeback (finance disputes require showing which team owned the resource at the time the cost was incurred).

Q: The cost attribution system needs to process 10 million cost events per hour. How do you scale the allocation engine horizontally without distributed locking on the anomaly scorer state?

Expected depth: Describe consistent hashing on team_id to route all events for a given team to the same worker (no cross-worker coordination needed for per-team stats). Explain that shared resource apportionment runs as a separate pre-computation step that publishes split fractions for workers to consume (pull model, not push). Discuss the Redis-backed persistence of RollingStats for anomaly scorer state, with AOF persistence for recovery. Cover the rebalancing problem when workers are added/removed (consistent hash ring minimizes reshuffling, but the anomaly scorer needs to rebuild state for newly assigned team keys from Kafka replay).

Q: How do you reconcile the estimated costs from your fast-track ingestion with confirmed billing data that arrives 24 hours later, without corrupting the dashboards teams are actively viewing?

Expected depth: Describe the append-only reconciliation event pattern (never overwrite estimated records, write a separate reconciliation event). Explain that the hourly rollup materialized view uses a MAX(data_quality) ordering (confirmed > estimated) when multiple records exist for the same (team, service, resource, window). Discuss the dashboard indicator that shows whether displayed data is estimated or confirmed. Cover the model calibration feedback loop: large estimation errors (> 20% delta) trigger an alert to improve the fast-track cost model for that resource type.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access

Unlock Full Article