Build an Infrastructure Cost Attribution System
cost-optimization cloud-infrastructure observability
System Design Deep Dive
Infrastructure Cost Attribution System
Tagging every cloud resource to its owner and catching spend anomalies before the CFO sees the bill
Cloud bills arrive once a month and look like a 10,000-line spreadsheet from a foreign country. Every row tells you what was charged but almost nothing about why. The compute line for us-east-1 says $84,000 - but which team ran those instances? Which service spawned that Lambda burst? Which data pipeline forgot to shut down its Redshift cluster? Without answers to those questions, you cannot optimize what you cannot see, you cannot hold teams accountable for costs they don’t know they incur, and you cannot catch the engineer who accidentally left a GPU cluster running over a long weekend.
Think of it like a shared office building where every team uses electricity, water, and HVAC but pays one collective bill. You could split it evenly, but that punishes frugal tenants and lets wasteful ones hide. The right model is submetering: put a meter on every circuit, attribute each kilowatt to the team drawing it, and send individual breakdowns. Cloud cost attribution is exactly this - putting a software meter on every compute instance, storage bucket, network byte, and managed service call, then routing each charge to the owning team in near real-time rather than at month’s end.
The engineering challenge is more subtle than slapping tags on resources. Tags drift: engineers forget them, resources get created by automation without tags, and shared infrastructure like a VPC NAT gateway or a centralized logging cluster genuinely belongs to multiple teams simultaneously. You need a resource tagging strategy that spans manual and automated creation paths. You need cost allocation rules that handle the messy middle - partial ownership, time-bound attribution, and costs that legitimately should be split. And you need shared resource apportionment that distributes infrastructure-level costs fairly without double-counting or under-counting.
On top of the attribution problem sits a temporal one. Cloud providers emit billing data with variable delay - AWS Cost and Usage Reports (CUR) lag 8-24 hours, while usage metrics from CloudWatch can be near real-time. Bridging that gap requires a real-time spend ingestion layer that fuses early signals (usage events) with confirmed cost data (billing records) to give teams hour-granularity visibility rather than day-old snapshots. Layer in anomaly scoring and budget alerting, and we’re solving for attribution accuracy, data freshness, anomaly detection sensitivity, and alert fatigue simultaneously.
Requirements and Constraints
Functional Requirements
- Tag every cloud resource (compute, storage, network, managed services) to an owning team and service
- Apply allocation rules to split shared resource costs across multiple teams by configurable proportions
- Ingest billing events and usage metrics in near real-time (under 30-minute lag from cloud provider emission)
- Surface per-team, per-service, and per-tag-dimension cost breakdowns queryable via API and dashboard
- Detect anomalous spend spikes and dispatch alerts within 15 minutes of the anomaly onset
- Support both showback (informational cost reports) and chargeback (actual internal billing/transfers)
- Provide budget management: teams set monthly budgets, system alerts at configurable thresholds (50%, 80%, 100%)
- Track untagged resource costs separately and expose them for remediation workflows
- Support multi-cloud (AWS, GCP, Azure) with a unified cost model
Non-Functional Requirements
- Ingest up to 5 million billing line items per hour across all cloud accounts
- Query cost breakdowns with p95 latency under 200ms for 30-day aggregations
- Store 24 months of daily-grain cost data and 90 days of hourly-grain data
- System availability: 99.9% uptime (8.7 hours/year downtime tolerance)
- Attribution accuracy: fewer than 0.1% of costs in the “unattributed” bucket after 48 hours
- Anomaly alerts delivered within 15 minutes of a 2-sigma spend deviation
- Support 500 teams and 10,000 individually tracked services
Constraints and Assumptions
- Cloud provider billing APIs are the source of truth; we do not attempt to re-derive costs from usage
- Tag schema is centrally managed (teams cannot create arbitrary tag keys without approval)
- Chargeback triggers actual internal fund transfers via a finance system API - outside the scope of this design
- We assume a single currency; multi-currency conversion is a downstream concern
- We do not handle reserved instance or savings plan optimization recommendations - only attribution
High-Level Architecture
The system has six major components arranged in a pipeline with a feedback loop for tag remediation.
Cloud Source Connectors pull billing and usage data from AWS CUR S3 exports, GCP BigQuery billing exports, and Azure Cost Management APIs. Each connector normalizes vendor-specific schemas into a common CostEvent protobuf.
Tag Ingestion Pipeline is a Kafka-backed stream processor that enriches each CostEvent with team and service tags from the Tag Registry. Resources without tags are routed to a separate dead-letter topic for manual remediation.
Cost Allocation Engine is the core workhorse. It applies allocation rules - splitting shared resource costs by configurable proportions - and writes attributed TeamCostRecord rows to the Cost Store. It also maintains a running tally of unattributed spend.
Shared Resource Apportionment handles resources that are genuinely multi-tenant: NAT gateways, load balancers, centralized log aggregators. It distributes costs by configurable metrics (request count, data transfer volume, CPU utilization) sampled from the metrics tier.
Anomaly Detection Engine runs a rolling z-score model over hourly spend per team, scoring each new cost batch and publishing anomaly events when spend deviates beyond a threshold.
Alert Manager consumes anomaly events and budget-threshold events, deduplicates, rate-limits, and dispatches to Slack, PagerDuty, and email via a notification fanout.
Data flows like this: cloud providers emit billing records, connectors poll or subscribe every 15 minutes, the ingestion pipeline enriches and routes events, the allocation engine attributes and persists costs, and the anomaly detector scores in parallel. The dashboard and cost API read from the Cost Store directly, using pre-computed rollup tables for the common aggregations.
The most important architectural decision is separating tag enrichment from cost allocation - they run at different cadences (enrichment is near real-time, allocation may need to wait for confirmed billing data) and have different failure modes (a missing tag is recoverable; an incorrect split ratio is not).
The Tag Registry and Resource Tagging Strategy
The Tag Registry is the single source of truth for which resource belongs to which team and service. Without it, every component in the system would need its own lookup logic, and drift between copies would corrupt attribution.
Think of it like a library card catalog: every book (resource) has a card (tag record) that says who checked it out (owning team), which section it belongs to (service), and when it was catalogued. The catalog is updated whenever a book moves, and the catalog, not the book’s spine label, is the authoritative reference.
A resource tagging strategy has three enforcement layers. The first is preventive tagging - infrastructure-as-code policies (AWS Service Control Policies, GCP Organization Policies) that reject resource creation if mandatory tags (team, service, env, cost-center) are absent. The second is detective tagging - a nightly scanner that queries cloud asset inventories and flags resources missing required tags. The third is corrective tagging - automated tag propagation that copies tags from parent resources (an EC2 instance inherits the Auto Scaling Group’s tags) and from Terraform state metadata.
-- Tag Registry schema: canonical resource-to-team mapping
CREATE TABLE resource_tags (
resource_id VARCHAR(256) NOT NULL,
cloud_provider VARCHAR(16) NOT NULL CHECK (cloud_provider IN ('aws','gcp','azure')),
resource_type VARCHAR(64) NOT NULL,
region VARCHAR(32) NOT NULL,
team_id VARCHAR(64) NOT NULL REFERENCES teams(id),
service_id VARCHAR(64) NOT NULL REFERENCES services(id),
cost_center VARCHAR(32),
env VARCHAR(16) NOT NULL CHECK (env IN ('prod','staging','dev','sandbox')),
tag_source VARCHAR(32) NOT NULL CHECK (tag_source IN ('iac','scanner','propagated','manual')),
confidence SMALLINT NOT NULL DEFAULT 100 CHECK (confidence BETWEEN 0 AND 100),
effective_from TIMESTAMPTZ NOT NULL DEFAULT now(),
effective_until TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (resource_id, cloud_provider, effective_from)
);
CREATE INDEX idx_resource_tags_team ON resource_tags (team_id, effective_from DESC);
CREATE INDEX idx_resource_tags_svc ON resource_tags (service_id, effective_from DESC);
CREATE INDEX idx_resource_tags_res ON resource_tags (resource_id, effective_until NULLS FIRST);
The confidence column is non-obvious but critical. When the system propagates tags from a parent resource, confidence is 80. When a tag comes from reviewed IaC, confidence is 100. When it comes from a heuristic matcher (resource name contains “payments-api”), confidence is 60. Downstream consumers can choose whether to include low-confidence attributions in official reports.
The effective_from / effective_until pair enables temporal tag correctness: if the payments team transfers ownership of a database to the platform team mid-month, the costs before the transfer date correctly attribute to payments and after to platform. Without temporal validity, retroactive tag changes would corrupt historical reports.
Tag propagation from Auto Scaling Groups to EC2 instances takes up to 15 minutes in AWS - during that window, a freshly launched instance has no team tag. If your ingestion pipeline processes billing events during this lag, those events land in the unattributed bucket and stay there unless you run a retroactive re-attribution job.
Spotify’s cost attribution system, described in their engineering blog, uses a “tag contract” enforced at the Backstage service catalog level: services that aren’t registered in the catalog cannot receive GCP resources. The catalog entry IS the tag source of truth, eliminating the drift between IaC tags and catalog metadata that plagues most organizations.
The Cost Allocation Engine
The Cost Allocation Engine’s job is to take a raw CostEvent (resource X cost Y cents in time window T) and produce one or more TeamCostRecord rows with each team’s share of that cost. For exclusively owned resources, this is a 1:1 mapping. For shared resources, it is a 1:N fan-out with proportional splits that must sum exactly to 100%.
Allocation rules are the business logic that drives this fan-out. A rule looks like: “NAT Gateway nat-04f8a3 is shared by teams payments, orders, and catalog in proportions derived from their outbound data transfer bytes, sampled hourly.” There are three rule types:
- Static split: fixed percentages, e.g. platform team pays 40%, all others split the remaining 60% equally
- Metric-proportional split: percentages derived from a runtime metric (request count, byte count, CPU share) recalculated each allocation window
- Tag-passthrough: the cloud resource’s native tags are authoritative; no rule needed
# Cost allocation engine: core split logic
# Handles both static and metric-proportional rules
from dataclasses import dataclass
from decimal import Decimal, ROUND_HALF_UP
from typing import Optional
import logging
@dataclass
class CostEvent:
resource_id: str
cloud_provider: str
cost_usd_micros: int # cost in millionths of a dollar to avoid float errors
window_start: int # unix epoch seconds
window_end: int
@dataclass
class AllocationRule:
rule_id: str
resource_id: str
rule_type: str # "static" | "metric_proportional" | "passthrough"
splits: dict # team_id -> Decimal weight (must sum to 1.0 for static)
metric_key: Optional[str] = None # e.g. "nat_bytes_out" for metric_proportional
@dataclass
class TeamCostRecord:
team_id: str
service_id: str
resource_id: str
cost_usd_micros: int
window_start: int
window_end: int
allocation_rule_id: str
split_fraction: Decimal
def allocate_cost(
event: CostEvent,
rule: AllocationRule,
metric_samples: dict, # team_id -> float usage value, for metric_proportional rules
tag_registry,
) -> list[TeamCostRecord]:
"""
Allocate a CostEvent across teams according to the rule.
Returns a list of TeamCostRecord; total cost_usd_micros must equal event.cost_usd_micros.
"""
total_cost = event.cost_usd_micros
if rule.rule_type == "passthrough":
tag = tag_registry.get_tag(event.resource_id, event.window_start)
if tag is None:
raise UntaggedResourceError(event.resource_id)
return [TeamCostRecord(
team_id=tag.team_id, service_id=tag.service_id,
resource_id=event.resource_id, cost_usd_micros=total_cost,
window_start=event.window_start, window_end=event.window_end,
allocation_rule_id="passthrough", split_fraction=Decimal("1.0"),
)]
if rule.rule_type == "static":
weights = rule.splits # already normalized
elif rule.rule_type == "metric_proportional":
total_usage = sum(metric_samples.values())
if total_usage == 0:
# Fall back to equal split when no usage data available
n = len(rule.splits)
weights = {t: Decimal(1) / Decimal(n) for t in rule.splits}
logging.warning("metric_proportional fallback to equal split for %s", event.resource_id)
else:
weights = {
t: Decimal(str(v)) / Decimal(str(total_usage))
for t, v in metric_samples.items()
if t in rule.splits
}
# Integer allocation with remainder assigned to the largest-share team
# to guarantee sum(allocated) == total_cost with no floating-point leakage
records = []
allocated = 0
items = sorted(weights.items(), key=lambda x: x[1], reverse=True)
for i, (team_id, fraction) in enumerate(items):
tag = tag_registry.get_tag_by_team(team_id, event.window_start)
if i < len(items) - 1:
share = int(Decimal(str(total_cost)) * fraction.quantize(Decimal("0.0000001"), rounding=ROUND_HALF_UP))
else:
share = total_cost - allocated # last team absorbs rounding residual
allocated += share
records.append(TeamCostRecord(
team_id=team_id, service_id=tag.service_id if tag else "unknown",
resource_id=event.resource_id, cost_usd_micros=share,
window_start=event.window_start, window_end=event.window_end,
allocation_rule_id=rule.rule_id, split_fraction=fraction,
))
assert sum(r.cost_usd_micros for r in records) == total_cost, "Cost split integrity violation"
return records
The assert at the end is not defensive programming theater - it is a hard invariant check. A single rounding error that causes sum(splits) != total_cost would create either phantom cost (money appearing from nowhere) or silent under-billing. Both are financially material. Run this check in production and alert on failures.
Store costs as integer microdollars (millionths of a dollar), not floats. A float representation of $1.00 can be 0.9999999… in IEEE 754, and those rounding errors compound across millions of events into dollars of attribution error over a month.
Shared Resource Apportionment
Shared resources are the hardest attribution problem. A NAT gateway charges for bytes transferred. A centralized Kafka cluster charges for storage and throughput. A shared RDS Aurora cluster charges for compute and I/O. These resources serve multiple teams but bill to a single cloud account line item.
The apportionment approach mirrors how shared utilities are handled in real estate: you install submeters on each tenant’s circuit (collect usage metrics per consumer), read them at billing time, and allocate the shared meter bill proportional to each tenant’s submeter reading.
For each shared resource, we maintain a usage telemetry collector that samples consumption metrics at 5-minute granularity. For a NAT gateway, this is bytes_out per source VPC (correlated to team via VPC tags). For a Kafka cluster, this is bytes_in + bytes_out per consumer group (correlated to team via consumer group naming convention {team}-{service}-{consumer}).
# Shared resource apportionment configuration
# Defines which metric to use for splitting each shared resource type
shared_resources:
- resource_type: nat_gateway
metric_source: cloudwatch
metric_name: BytesOutToDestination
dimension: SourceSubnet
aggregation: sum
lookback_window_minutes: 60
fallback: equal_split
- resource_type: aurora_cluster
metric_source: cloudwatch
metric_name: DatabaseConnections
dimension: DBClusterIdentifier
aggregation: avg
# For Aurora, track connection count as proxy for compute consumption
# More accurate: track actual query execution time via Performance Insights
lookback_window_minutes: 60
fallback: static_split
static_splits:
platform: 0.30
payments: 0.40
catalog: 0.30
- resource_type: kafka_cluster
metric_source: kafka_jmx
metric_name: BytesInPerSec
dimension: topic
aggregation: sum
topic_to_team_resolver: topic_prefix # "payments-*" -> payments team
lookback_window_minutes: 60
fallback: equal_split
The fallback field is critical for operational safety. When the metrics pipeline is degraded and usage data is unavailable, falling back to equal_split ensures costs still get attributed rather than accumulating in an unattributed bucket. Teams may dispute the equal split, but zero attribution is worse.
Netflix’s FinOps team uses a metric they call “resource equivalent units” (REUs) to normalize heterogeneous usage across CPU, memory, and GPU into a single consumption score per service. This sidesteps the question of which metric to use for apportionment by computing a weighted composite that reflects actual resource consumption holistically.
Real-Time Spend Ingestion
Cloud providers do not emit billing events in real time. AWS CUR files land in S3 with 8-24 hour lag. GCP BigQuery exports are daily. Azure Cost Management API has a 12-hour delay. If we relied solely on billing data, teams would see yesterday’s costs at best. For anomaly detection to be useful, we need a signal within 15 minutes of a spend event occurring.
The solution is a dual-track ingestion architecture: one track ingests confirmed billing records (high accuracy, high latency), and a second track ingests usage metrics that can be converted to estimated costs (lower accuracy, near real-time).
For the fast track, AWS CloudWatch metrics and Cost Explorer cost explorer metrics (which have a ~1 hour granularity) give us hourly estimates. GCP Cloud Monitoring billing metrics update every 5 minutes for some resource types. We treat fast-track data as “estimated cost” and mark it with data_quality = 'estimated'. When the confirmed billing record arrives, we reconcile the estimate against the actual and write a correction event.
// Real-time spend ingestion worker
// Reads from Kafka topics: fast-track (estimated) and confirmed-billing
// Writes to the cost store with appropriate quality markers
package ingestion
import (
"context"
"fmt"
"log/slog"
"time"
)
type DataQuality string
const (
QualityEstimated DataQuality = "estimated"
QualityConfirmed DataQuality = "confirmed"
QualityReconciled DataQuality = "reconciled"
)
type CostRecord struct {
ResourceID string
CloudProvider string
WindowStart time.Time
WindowEnd time.Time
CostUSDMicros int64
DataQuality DataQuality
IngestionTS time.Time
ReconcileKey string // sha256(resource_id + window_start) for dedup
}
type ReconcileResult struct {
EstimatedCost int64
ConfirmedCost int64
DeltaMicros int64
DeltaPct float64
}
func (w *IngestionWorker) ReconcileEstimate(
ctx context.Context,
confirmed CostRecord,
) (ReconcileResult, error) {
// Look up the estimated record for the same resource + window
estimated, err := w.store.GetEstimated(ctx, confirmed.ReconcileKey)
if err != nil {
return ReconcileResult{}, fmt.Errorf("lookup estimated: %w", err)
}
delta := confirmed.CostUSDMicros - estimated.CostUSDMicros
var deltaPct float64
if estimated.CostUSDMicros != 0 {
deltaPct = float64(delta) / float64(estimated.CostUSDMicros) * 100
}
// Persist reconciliation record
result := ReconcileResult{
EstimatedCost: estimated.CostUSDMicros,
ConfirmedCost: confirmed.CostUSDMicros,
DeltaMicros: delta,
DeltaPct: deltaPct,
}
if err := w.store.WriteReconciliation(ctx, confirmed, result); err != nil {
return result, fmt.Errorf("write reconciliation: %w", err)
}
// If estimate was off by more than 20%, emit a model calibration event
if abs(deltaPct) > 20.0 {
slog.Warn("large estimation error", "resource", confirmed.ResourceID,
"delta_pct", deltaPct, "window", confirmed.WindowStart)
w.metrics.RecordCalibrationNeeded(confirmed.CloudProvider, confirmed.ResourceID)
}
return result, nil
}
func abs(x float64) float64 {
if x < 0 {
return -x
}
return x
}
The reconciliation step is what keeps the fast-track useful. Teams see estimated costs within minutes but understand that the numbers will settle as confirmed billing arrives. The DeltaPct metric feeds back into improving the estimation model - if AWS NAT gateway estimates are consistently 15% below confirmed costs, we apply a 1.15 correction factor to the fast-track estimator.
Never overwrite estimated records with confirmed records in place. Instead, write a separate reconciliation event and recompute aggregates. If you mutate in place, you lose the ability to audit estimation accuracy over time, and rollback becomes nearly impossible if the confirmed record contains an error.
Anomaly Scoring Engine
Anomaly detection on cost data is harder than it looks. Team spending patterns have daily seasonality (lower on weekends), weekly seasonality (month-end spikes from batch jobs), and long-term trends (organic growth). A naive threshold like “alert if cost exceeds $X” generates constant noise as teams grow. You need a model that asks “is this cost unusual given the historical pattern for this team at this time of week and month?”
The algorithm we use is a seasonal decomposition with rolling z-score. For each (team, service, hour-of-week) triple, we maintain a rolling distribution of observed hourly costs over the past 8 weeks. A new observation’s z-score is (observed - rolling_mean) / rolling_std. Scores above 2.5 trigger a warning; above 3.5 trigger a critical alert.
# Anomaly scoring: seasonal rolling z-score per team/service/hour-of-week
# hour_of_week: 0 = Monday 00:00, 167 = Sunday 23:00
import math
from collections import deque
from dataclasses import dataclass, field
from typing import Optional
LOOKBACK_WEEKS = 8
MIN_SAMPLES_FOR_SCORING = 4 # require at least 4 historical data points
@dataclass
class RollingStats:
"""Welford's online algorithm for numerically stable mean/variance."""
count: int = 0
mean: float = 0.0
M2: float = 0.0 # sum of squared deviations from mean
samples: deque = field(default_factory=lambda: deque(maxlen=LOOKBACK_WEEKS))
def update(self, value: float) -> None:
# Evict oldest if at capacity and adjust running stats
if len(self.samples) == self.samples.maxlen:
old = self.samples[0]
self.count -= 1
old_mean = self.mean
self.mean -= (old - self.mean) / max(self.count, 1)
self.M2 -= (old - old_mean) * (old - self.mean)
self.samples.append(value)
self.count += 1
delta = value - self.mean
self.mean += delta / self.count
delta2 = value - self.mean
self.M2 += delta * delta2
@property
def variance(self) -> float:
if self.count < 2:
return 0.0
return self.M2 / (self.count - 1)
@property
def std(self) -> float:
return math.sqrt(self.variance)
def z_score(self, value: float) -> Optional[float]:
if self.count < MIN_SAMPLES_FOR_SCORING:
return None
s = self.std
if s < 1e-9: # near-zero std: team has perfectly flat spend
# Use relative threshold instead: flag if 3x above mean
if self.mean > 0 and value > self.mean * 3.0:
return 3.0
return 0.0
return (value - self.mean) / s
class AnomalyScorer:
def __init__(self) -> None:
# Key: (team_id, service_id, hour_of_week)
self._stats: dict[tuple, RollingStats] = {}
def _key(self, team_id: str, service_id: str, hour_of_week: int) -> tuple:
return (team_id, service_id, hour_of_week % 168)
def ingest(
self, team_id: str, service_id: str, hour_of_week: int, cost_usd: float
) -> Optional[float]:
"""
Record a new hourly cost observation and return the z-score.
Returns None if insufficient history for scoring.
"""
key = self._key(team_id, service_id, hour_of_week)
if key not in self._stats:
self._stats[key] = RollingStats()
stats = self._stats[key]
z = stats.z_score(cost_usd) # score BEFORE updating with new value
stats.update(cost_usd) # update AFTER scoring to avoid self-comparison
return z
The key design choice is scoring the value before updating the rolling stats. If we updated first, an extreme value would inflate the mean and reduce the z-score of itself - self-normalizing the anomaly into invisibility.
The near-zero standard deviation case deserves special handling. Teams with perfectly flat spend (e.g., a reserved instance that always charges the same amount) would have std near zero, and any fluctuation at all would produce an infinite z-score. The relative 3x threshold catches genuine anomalies without flooding with false alarms for these stable-spend services.
Budget Alerting and Showback vs Chargeback
Budget alerting is straightforward operationally but complex in terms of policy. The system supports two fundamentally different cost accountability models: showback and chargeback.
In showback, teams receive cost reports and dashboards but no actual financial consequences follow. The organization is building cost visibility culture. Engineers can see that their service costs $12,000/month and compare it to peers. This is where most organizations start.
In chargeback, the infrastructure cost attributions translate into actual internal financial transfers. If the payments team consumed $30,000 of infrastructure in May, that amount is debited from the payments team’s budget and credited back to the central platform budget. This creates strong incentives but requires organizational buy-in, accurate attribution (disputed charges cause friction), and integration with the finance system.
The system supports both models simultaneously per team - some teams run on chargeback, others on showback - because organizations typically migrate incrementally.
-- Budget configuration and alert state
CREATE TABLE team_budgets (
id BIGSERIAL PRIMARY KEY,
team_id VARCHAR(64) NOT NULL REFERENCES teams(id),
budget_period VARCHAR(8) NOT NULL CHECK (budget_period IN ('monthly','quarterly')),
period_start DATE NOT NULL,
budget_usd NUMERIC(12,2) NOT NULL,
alert_thresholds JSONB NOT NULL DEFAULT '[50, 80, 100, 120]',
-- 50% warning, 80% warning, 100% alert, 120% critical
accountability_model VARCHAR(16) NOT NULL DEFAULT 'showback'
CHECK (accountability_model IN ('showback','chargeback')),
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (team_id, budget_period, period_start)
);
CREATE TABLE budget_alerts (
id BIGSERIAL PRIMARY KEY,
team_id VARCHAR(64) NOT NULL,
period_start DATE NOT NULL,
threshold_pct SMALLINT NOT NULL,
triggered_at TIMESTAMPTZ NOT NULL DEFAULT now(),
spend_at_trigger NUMERIC(12,2) NOT NULL,
budget_amount NUMERIC(12,2) NOT NULL,
alert_sent BOOLEAN NOT NULL DEFAULT FALSE,
suppressed BOOLEAN NOT NULL DEFAULT FALSE,
-- suppressed=true if a higher threshold was already triggered this period
UNIQUE (team_id, period_start, threshold_pct)
);
CREATE INDEX idx_budget_alerts_unsent ON budget_alerts (triggered_at)
WHERE alert_sent = FALSE AND suppressed = FALSE;
The suppressed column prevents alert storms. If the 50% threshold fires, then 80%, then 100% in rapid succession (e.g., a runaway job), the team should not receive three separate alerts within minutes. The alert manager marks lower thresholds as suppressed once a higher one fires in the same period.
Airbnb’s FinOps team published a detailed account of their migration from showback to chargeback. The key lesson: chargeback requires “appeal windows” where teams can dispute attributions before the financial transfer executes. Without this, engineers lose trust in the system the first time a shared resource is incorrectly attributed, and the entire cost accountability program collapses politically.
Data Model
The cost store needs to serve two very different access patterns: real-time writes of individual TeamCostRecord events at high throughput, and read queries that aggregate costs across dimensions (team, service, resource type, time range) for dashboards.
The solution is a hybrid storage model: a write-optimized append-only event table captures all raw cost events, and a set of pre-aggregated materialized views serve the common dashboard queries.
-- Core cost event table: append-only, partitioned by month
CREATE TABLE team_cost_events (
id BIGSERIAL,
team_id VARCHAR(64) NOT NULL,
service_id VARCHAR(64) NOT NULL,
resource_id VARCHAR(256) NOT NULL,
resource_type VARCHAR(64) NOT NULL,
cloud_provider VARCHAR(16) NOT NULL,
region VARCHAR(32) NOT NULL,
cost_usd_micros BIGINT NOT NULL,
window_start TIMESTAMPTZ NOT NULL,
window_end TIMESTAMPTZ NOT NULL,
allocation_rule VARCHAR(128),
split_fraction NUMERIC(8,6),
data_quality VARCHAR(16) NOT NULL DEFAULT 'confirmed',
env VARCHAR(16) NOT NULL,
tags JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (id, window_start)
) PARTITION BY RANGE (window_start);
-- Create monthly partitions programmatically; example for 2026
CREATE TABLE team_cost_events_2026_06
PARTITION OF team_cost_events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
-- Indexes on the partition for fast team+time queries
CREATE INDEX idx_tce_team_window ON team_cost_events (team_id, window_start DESC);
CREATE INDEX idx_tce_svc_window ON team_cost_events (service_id, window_start DESC);
CREATE INDEX idx_tce_res_window ON team_cost_events (resource_id, window_start DESC);
-- Hourly rollup materialized view: pre-aggregated for dashboard queries
CREATE MATERIALIZED VIEW cost_hourly_rollup AS
SELECT
team_id,
service_id,
resource_type,
cloud_provider,
region,
env,
date_trunc('hour', window_start) AS hour_bucket,
SUM(cost_usd_micros) AS total_cost_usd_micros,
COUNT(DISTINCT resource_id) AS resource_count,
MAX(data_quality) AS min_data_quality
FROM team_cost_events
GROUP BY 1,2,3,4,5,6,7;
CREATE UNIQUE INDEX ON cost_hourly_rollup (team_id, service_id, resource_type, cloud_provider, region, env, hour_bucket);
-- Daily rollup for 24-month retention tier
CREATE MATERIALIZED VIEW cost_daily_rollup AS
SELECT
team_id,
service_id,
resource_type,
cloud_provider,
region,
env,
date_trunc('day', window_start) AS day_bucket,
SUM(cost_usd_micros) AS total_cost_usd_micros,
COUNT(DISTINCT resource_id) AS resource_count
FROM team_cost_events
GROUP BY 1,2,3,4,5,6,7;
Partitioning by window_start month is the correct sharding key here. Cost queries are almost always time-bounded (“show me June costs for team X”), so partition pruning eliminates irrelevant months entirely. Partition management - creating future partitions, dropping partitions older than the retention window - runs as a nightly maintenance job.
The tags JSONB column on team_cost_events stores auxiliary tags (project, feature, component) that do not have first-class columns but are needed for ad-hoc cost analysis. JSONB indexing via a GIN index enables queries like “show all costs tagged project=checkout-redesign” without requiring schema migrations each time a new tag key is introduced.
Pre-computing hourly and daily rollups as materialized views is the difference between a dashboard that loads in 200ms and one that takes 30 seconds. A 30-day cost breakdown for one team at hourly granularity is 720 rows from the rollup vs. potentially 500,000 rows from the raw event table - the difference compounds as the system ages.
Key Algorithms and Protocols
Integer Cost Splitting with Residual Assignment
The most subtle algorithm in the system is the one that splits a cost among N parties such that the integer-valued shares sum exactly to the original total. Floating point arithmetic cannot guarantee this. Proportional rounding to the nearest cent leaves a residual that must be assigned to exactly one party.
We use the largest remainder method (also called Hamilton’s method in political science - it’s used to apportion congressional seats). Sort parties by their fractional remainder after floor allocation, assign the residual to the party with the largest remainder.
def split_integer_cost(total_micros: int, weights: dict) -> dict:
"""
Split total_micros across teams by weight using largest remainder method.
Guarantees sum(result.values()) == total_micros exactly.
Time: O(n log n) for sort, Space: O(n)
weights: dict of team_id -> float weight (must sum to 1.0)
Returns: dict of team_id -> int micros
"""
assert abs(sum(weights.values()) - 1.0) < 1e-9, "weights must sum to 1.0"
assert total_micros >= 0
# Step 1: compute exact (non-integer) shares
exact = {t: w * total_micros for t, w in weights.items()}
# Step 2: floor each share
floors = {t: int(v) for t, v in exact.items()}
remainder = total_micros - sum(floors.values())
# Step 3: sort by fractional remainder descending, assign extras
fractions = sorted(
[(t, exact[t] - floors[t]) for t in exact],
key=lambda x: x[1],
reverse=True
)
for i in range(remainder):
floors[fractions[i][0]] += 1
# Invariant check: must never fail in production
assert sum(floors.values()) == total_micros, (
f"split integrity error: sum={sum(floors.values())} != {total_micros}"
)
return floors
Time complexity: O(n log n) for the sort. At typical shared-resource cardinality (2-20 teams per resource), this is negligible compared to database I/O. Space complexity: O(n) for the intermediate maps.
The edge case worth noting: when total_micros = 0 (a credit or free tier), all shares are 0 and the algorithm returns zeros for all parties. This is correct; crediting teams with zero-cost shares would generate noise in dashboards.
Z-Score with Welford’s Online Algorithm
Maintaining a rolling z-score over 8 weeks of hourly data means updating statistics for potentially millions of (team, service, hour-of-week) triples. Storing all 8 * 168 = 1,344 raw data points per triple would require substantial memory. Welford’s algorithm computes mean and variance in a single pass with O(1) memory per triple, making it suitable for streaming.
The key property that makes Welford’s work at scale: it is numerically stable. The naive two-pass formula var = sum((x - mean)^2) / n requires knowing the mean first and suffers from catastrophic cancellation when values are large (like costs measured in microdollars). Welford’s updates variance incrementally using only the delta from the running mean, maintaining full precision.
The anomaly scorer uses a rolling window (8 weeks, evicting the oldest sample) rather than an expanding window. This is deliberate: if a team permanently grows their infrastructure, their baseline should shift upward over 8 weeks so they aren’t perpetually anomalous relative to their launch-era spend. A fixed-origin expanding window would never stop flagging growing teams as anomalous.
Allocation Rule Conflict Resolution
When multiple rules match a resource (e.g., a resource tagged to the platform team but also matching a shared-resource rule for a NAT gateway), the system must resolve the conflict deterministically. Rules carry a priority field (lower number = higher priority). If priorities tie, the most specific rule wins - a rule matching resource_id exactly beats one matching resource_type.
def resolve_rule_conflicts(
resource_id: str,
resource_type: str,
candidate_rules: list,
) -> "AllocationRule":
"""
Given multiple applicable rules, return the one with highest priority.
Tie-breaking: exact resource_id match > resource_type match.
"""
def rule_score(rule) -> tuple:
specificity = 1 if rule.resource_id == resource_id else 0
return (rule.priority, -specificity) # lower priority number = better
return min(candidate_rules, key=rule_score)
Scaling and Performance
The cost allocation engine scales horizontally by partitioning the team namespace. Each allocation worker owns a shard of teams (e.g., Worker A handles teams A-F, Worker B handles G-M). Incoming cost events are routed to the correct worker by a consistent hash on team_id. This guarantees that all events for a given team are processed by the same worker, enabling in-process state for the anomaly scorer’s rolling statistics without distributed locking.
Shared resource apportionment is the exception: it requires cross-team coordination to gather usage metrics before splitting costs. The apportionment service runs as a separate fleet that polls all workers for their metered usage, computes splits, and publishes pre-computed split fractions for the allocation workers to consume.
Capacity Estimation:
Given:
- 5,000 cloud accounts across 3 providers
- 2,000 cost events per account per hour (conservative for AWS CUR)
- 10,000,000 cost events per hour total
- 80 bytes per CostEvent (after normalization)
Ingestion throughput:
10M events/hr = ~2,800 events/sec
Kafka topic: 10 partitions, each handling 280 events/sec (well within limits)
Storage (raw events, 90-day hourly retention):
10M events/hr * 80 bytes * 24 hr * 90 days = ~1.73 TB
With 3x replication: ~5.2 TB
Storage (daily rollups, 24-month retention):
10M events/hr / 24 hr = 417K unique (team, service, resource) records/day
* 100 bytes per rollup row * 730 days = ~30 GB (negligible)
Query throughput (dashboard):
500 teams * 10 dashboard loads/hr = 5,000 queries/hr = ~1.4 queries/sec
All queries hit materialized views; p95 latency < 200ms target is achievable
with 4-node read replica cluster on the rollup tables
Anomaly scorer state:
500 teams * 100 services/team * 168 hr/week * 200 bytes/RollingStats = ~1.7 GB RAM
Fits comfortably in a single Redis instance; use Redis hash per (team,service)
The dominant bottleneck at scale is not write throughput (Kafka handles that) but rollup freshness. Materialized view refresh in Postgres is an exclusive lock operation - refreshing a view with 500M rows takes minutes and blocks reads during that window. The solution is incremental refresh using a timestamp watermark: only process events newer than last_refresh_ts, appending to the rollup rather than recomputing from scratch.
Datadog’s cost attribution engine uses a tiered storage approach: ClickHouse for raw event aggregation (optimized for analytical GROUP BY queries), Redis for the anomaly scorer state, and a thin Postgres layer for budget configuration and alert state. ClickHouse’s columnar storage enables 10x compression of cost event data compared to row-oriented Postgres, dramatically reducing the 90-day retention storage requirement.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Cloud connector outage (API rate limit or credential expiry) | Connector health metric drops to zero; Kafka lag grows | Cost events stop flowing; estimated spend visible but confirmed billing delayed | Rotate credentials automatically; exponential backoff retry; alert on connector lag > 2 hours |
| Tag Registry unavailable during ingestion | Enrichment step returns errors; events routed to dead-letter Kafka topic | All new events land in unattributed bucket; attribution accuracy degrades | Tag Registry backed by read replicas; enrichment retries from dead-letter queue when Registry recovers; events are not lost |
| Allocation rule conflict (two rules both claim 100% of same resource) | Integrity assertion in allocate_cost() fires; event sent to error queue | Resource cost double-counted across teams | Rule validation pipeline rejects conflicting rules on creation; on detection, remove newer rule and re-process affected events from Kafka replay |
| Clock skew between allocation worker and cost store | Events with future timestamps appear; window aggregations produce gaps | Hourly rollups show gaps for affected windows | Use event time (not processing time) for all windowing; accept events up to 2 hours late; rollups compute at event-time + 2hr watermark |
| Anomaly scorer state lost (Redis failure) | Score output drops to null; alert volume drops to zero | No anomaly alerts until state rebuilds | Redis persistent AOF + daily RDB snapshots; on restart, replay 8 weeks of hourly cost events to rebuild RollingStats in < 30 minutes |
| Large batch re-processing (manual rule change affects past data) | Re-attribution job queued; Kafka consumer group offset rewound | Temporary cost double-counting if new and old records coexist | Use generation counter on cost records; rollup queries filter to max generation per (resource, window); old records logically superseded |
The most common operational failure mode is not a component crash - it is silent data quality degradation. A cloud connector that returns HTTP 200 but returns truncated billing data (AWS CUR files can have missing line items during re-processing windows) will quietly under-attribute costs with no alerts. Implement a daily reconciliation check that compares total attributed spend to the cloud provider’s reported total, and alert if the gap exceeds 1%.
Comparison of Approaches
| Approach | Latency | Attribution Accuracy | Operational Complexity | Best Fit |
|---|---|---|---|---|
| Native cloud cost tagging only (AWS cost allocation tags, GCP labels) | 24h+ lag (billing export) | Medium - tags on resources only, no shared apportionment | Low - cloud-native, no custom system | Small orgs, single cloud, no shared infrastructure |
| Third-party FinOps platforms (Apptio, CloudHealth, FOCUS) | 12-24h lag | Medium-high - rule engines included but rigid | Low to medium - SaaS, integration effort only | Mid-size orgs willing to pay per-seat licensing |
| Custom attribution system (this design) | 15-30 min (fast track) | High - full control over rules, metric-proportional splits | High - 6 components to operate, schema migrations | Large orgs, multi-cloud, complex shared infrastructure, chargeback requirements |
| Sidecar-based cost telemetry (eBPF probes per pod, k8s cost attribution) | Near real-time (< 5 min) | Very high for k8s workloads - pod-level granularity | Very high - requires agent rollout across all nodes | Kubernetes-native orgs, container-first workloads |
| Cloud provider native tools only (AWS Cost Explorer, GCP Cost Management) | 24h+ lag | Low - no cross-account, no shared resource splitting | Very low - zero infrastructure | Startups, single cloud, < 20 engineers |
The right choice for an organization running 500+ services across 3 cloud providers with chargeback requirements is the custom attribution system. The 15-30 minute latency enables anomaly alerting, the metric-proportional splits are necessary for shared infrastructure fairness, and the chargeback accounting model requires integration hooks that SaaS FinOps tools rarely expose. Start with a third-party tool to build organizational cost visibility culture, then migrate to a custom system once chargeback requirements emerge.
Key Takeaways
- Resource tagging strategy is a three-layer enforcement problem: preventive (IaC policies reject untagged resources), detective (nightly scanner flags drift), and corrective (automated propagation from parents and catalog metadata).
- Shared resource apportionment requires metered usage data per consumer, not just cost data per resource - this is the most common gap in naive tagging approaches.
- Store costs as integer microdollars, never floats, and use the largest remainder method for splits to guarantee zero rounding leakage across millions of events.
- Dual-track ingestion (estimated fast track + confirmed slow track with reconciliation) is the only way to provide both real-time anomaly detection and accurate month-end billing.
- Temporal tag validity (effective_from / effective_until on tag records) is mandatory for correct historical attribution after team ownership transfers.
- Showback and chargeback require different organizational infrastructure: chargeback needs appeal windows and finance system integration that showback does not.
- Anomaly scoring needs seasonal adjustment and a rolling baseline window - a flat z-score threshold misses weekend-normalized spikes and permanently flags growing teams.
- Reconciliation beats mutation: when confirmed billing data arrives, write a reconciliation event rather than overwriting estimated records, preserving auditability of estimation accuracy.
The counter-intuitive lesson from this design: the hardest problem is not the engineering - it is the organizational one. A technically perfect attribution system that teams do not trust (because of one bad shared-resource split early on) will be abandoned in favor of a spreadsheet. Build the appeal workflow and the transparency UI before you build the chargeback integration. Accuracy debates need a venue before money actually moves.
Frequently Asked Questions
Q: Why not just use AWS Cost Allocation Tags and call it done? A: AWS cost allocation tags only attribute costs at the resource level - they cannot split a single resource’s cost across multiple teams. More critically, they have no concept of shared infrastructure apportionment: a NAT gateway shared by 10 teams bills entirely to the account where it lives, with no mechanism to distribute that cost to consumers. For organizations with shared platform infrastructure and chargeback requirements, native tags alone produce attribution errors of 20-40% of total spend.
Q: How do you handle containers and Kubernetes workloads where EC2 nodes are shared across pods from different teams? A: This is the hardest class of shared resource in practice. The solution is a Kubernetes cost attribution layer (tools like Kubecost or OpenCost, or a custom eBPF sidecar approach) that runs alongside this system. K8s attribution computes per-namespace (team) cost shares based on CPU request + memory request fractions, then reports those as pre-attributed cost events to the ingestion pipeline. The attribution system treats k8s as a source connector, not as something it handles internally.
Q: Why not use a streaming SQL engine like Apache Flink for the allocation instead of a custom worker fleet? A: Flink would handle the stateless enrichment and allocation steps well, but the anomaly scorer’s rolling statistics are difficult to express in SQL’s window function model - you need the last 8 weeks of the same (team, service, hour-of-week) triple, which is not a standard sliding window. The custom worker approach allows in-process RollingStats objects per key without distributed state management overhead. That said, Flink is a reasonable choice for organizations that already operate it; the allocation logic maps naturally to a Flink KeyedProcessFunction.
Q: Can the system handle retroactive rule changes without corrupting historical data?
A: Yes, via the generation counter pattern. When a rule changes, all cost events affected by that rule are re-processed from the Kafka event log and written as new records with an incremented generation field. The rollup materialized views are re-computed using MAX(generation) per (resource, window) to use only the latest interpretation. Old-generation records are retained for audit but excluded from dashboards. This makes rule changes safely reversible.
Q: How do you prevent alert fatigue when a team has a legitimate large deployment that triggers anomaly alerts? A: Two mechanisms. First, teams can declare “planned spend events” via the budget API: “on 2026-06-15 we’re launching a new region, expected cost increase of $50K/month.” The anomaly scorer shifts the baseline upward for the declared window. Second, once a team acknowledges an anomaly alert, the system suppresses re-alerts for the same (team, service) combination for 4 hours unless the z-score increases by more than 0.5 beyond the acknowledged level.
Q: Why not just alert on absolute cost thresholds instead of z-scores?
A: Absolute thresholds require constant manual maintenance as teams grow. A team that legitimately scaled from $10K/month to $100K/month over a year would need their threshold updated every quarter or generate constant alerts. Z-scores auto-adjust to the team’s current spending pattern. The only case where absolute thresholds are better is for very new teams with fewer than 4 weeks of history - the anomaly scorer correctly returns null z-scores until it has sufficient history, and during that bootstrap period you fall back to a conservative absolute threshold.
Interview Questions
Q: Walk me through how you’d attribute the cost of a shared NAT gateway to the teams that use it.
Expected depth: Explain that NAT gateway billing has two components (hourly rate + data processing charge). The hourly rate can be split equally or by some agreed proportion; the data processing charge should be split proportional to bytes_out per source subnet (correlated to teams via VPC/subnet tags). Discuss how to collect bytes_out per subnet from CloudWatch at 1-minute granularity, aggregate to the billing window, normalize to fractions, and apply those fractions to the data processing line item from the billing export. Cover the fallback case when CloudWatch data is unavailable.
Q: How would you detect that a team’s cost is anomalously high without generating constant false alarms for teams that are legitimately growing fast?
Expected depth: Discuss seasonal decomposition (daily and weekly patterns in cloud costs), rolling baseline windows (8 weeks gives seasonal context without being too sensitive to recent growth), Welford’s online algorithm for memory-efficient rolling stats, the z-score threshold selection tradeoff (higher threshold = fewer false alarms but missed real anomalies), and the near-zero standard deviation edge case for flat-spend services. Bonus: discuss how planned spend events and acknowledgment suppression reduce alert fatigue.
Q: A team transferred ownership of a database to another team mid-month. How does your system handle this so that costs before and after the transfer date are attributed correctly?
Expected depth: Describe the temporal tag validity model (effective_from / effective_until on resource_tags). Explain that the allocation engine looks up the active tag record at window_start of each cost event, not the current tag. Discuss the retroactive correction job that runs when a tag change affects past unbilled windows. Cover the audit trail requirement for chargeback (finance disputes require showing which team owned the resource at the time the cost was incurred).
Q: The cost attribution system needs to process 10 million cost events per hour. How do you scale the allocation engine horizontally without distributed locking on the anomaly scorer state?
Expected depth: Describe consistent hashing on team_id to route all events for a given team to the same worker (no cross-worker coordination needed for per-team stats). Explain that shared resource apportionment runs as a separate pre-computation step that publishes split fractions for workers to consume (pull model, not push). Discuss the Redis-backed persistence of RollingStats for anomaly scorer state, with AOF persistence for recovery. Cover the rebalancing problem when workers are added/removed (consistent hash ring minimizes reshuffling, but the anomaly scorer needs to rebuild state for newly assigned team keys from Kafka replay).
Q: How do you reconcile the estimated costs from your fast-track ingestion with confirmed billing data that arrives 24 hours later, without corrupting the dashboards teams are actively viewing?
Expected depth: Describe the append-only reconciliation event pattern (never overwrite estimated records, write a separate reconciliation event). Explain that the hourly rollup materialized view uses a MAX(data_quality) ordering (confirmed > estimated) when multiple records exist for the same (team, service, resource, window). Discuss the dashboard indicator that shows whether displayed data is estimated or confirmed. Cover the model calibration feedback loop: large estimation errors (> 20% delta) trigger an alert to improve the fast-track cost model for that resource type.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.