Build a Content Moderation Pipeline
security data-engineering microservices
System Design Deep Dive
Content Moderation Pipeline
Balancing precision and recall at scale - where over-removal silences legitimate voices and under-removal enables real harm
Every piece of content uploaded to a large platform - a comment, a photo, a four-hour livestream - needs a decision: does it violate policy? At a platform receiving 10 million items per day, no human team can review everything before it goes live. The alternative, letting everything through and acting reactively, means harmful content spreads for hours before removal. The engineering challenge is building a system that makes accurate, policy-consistent decisions at machine speed while keeping a human in the loop for the cases where machine confidence is genuinely uncertain.
Think of it like a postal inspection system sorting millions of packages. The vast majority go straight through automated scanners and leave the facility in seconds. A package that triggers a scan alert gets routed to a secondary inspection lane where a human examiner makes the final call. A tiny fraction gets escalated to a senior inspector. The challenge is calibrating the scanner’s sensitivity: too loose and prohibited items get through, too tight and you paralyze the entire facility with false alarms.
The hard part in content moderation is that the “package” is multi-modal. A post might combine benign text with a harmful image, or require cultural context to disambiguate satire from incitement. A text-only classifier misses coordinated image-based abuse campaigns. An image-only classifier misreads ironic memes. The system needs a classifier ensemble that fuses signals across modalities, a confidence thresholding layer that routes each item to the right decision lane, and a human review queue that handles the uncertain middle band - all while tracking false positive rates obsessively to avoid silencing legitimate users.
We need to solve for classification accuracy across three modalities, sub-200ms latency on the automated path, human reviewer throughput at scale, a fair and auditable appeal workflow, and policy consistency across classifier versions - simultaneously.
Requirements and Constraints
Functional Requirements
- Evaluate text, image, and video content against a versioned policy ruleset with per-category confidence thresholds
- Route each item to one of three outcomes: auto-approve, auto-remove, or human review based on ensemble confidence scores
- Provide an appeal workflow where users can contest removals, with senior reviewer assignment and 3-day SLA
- Support policy version updates that take effect within 30 seconds of publish and can be applied retroactively to borderline items from the last 7 days
- Expose a real-time false positive rate dashboard segmented by category, modality, and policy version
Non-Functional Requirements
- Throughput: 10 million items per day (approximately 116 items per second sustained, with 5x burst capacity)
- Latency: text classification P99 under 150ms, image classification P99 under 250ms, video moderation SLA 10 minutes
- Human review SLA: items must receive a reviewer decision within 4 hours of queue entry
- Availability: 99.9% uptime for the moderation pipeline; classifier outages must not block content uploads
- False positive rate: under 1% of auto-removed items should be reinstated on appeal
- Policy propagation: new policy versions must reach all classifier nodes within 30 seconds of publish
- Compliance: GDPR data retention controls, DMCA takedown workflow, no public storage of CSAM perceptual hashes
Constraints
- The upload critical path must not block on moderation decisions - content goes live optimistically
- Human reviewers see the content plus the policy excerpt, not raw classifier scores
- Appeals are handled by a separate reviewer pool from initial review to prevent anchoring bias
- Retroactive policy enforcement applies only to currently live content, never to already-removed items
High-Level Architecture
The system separates the upload path from the moderation path entirely. Content is accepted at upload time and a moderation event is placed on a Kafka topic. The pipeline picks up the event, dispatches it to the classifier ensemble, applies confidence thresholds to determine routing, and either auto-approves, auto-removes, or enqueues the item for human review. Six major components carry out this flow.
The Content Ingestion API accepts uploads, acknowledges immediately, and publishes a content.submitted event to Kafka partitioned by content type. The upload path never waits for a moderation decision.
The Moderation Router consumes from Kafka and dispatches content to the appropriate classifier workers. Text and image content goes to synchronous classifier pools. Video content goes to an async video analysis queue with a separate consumer group and longer timeout budget.
The Classifier Ensemble runs text, image, and video specialist classifiers in parallel, producing confidence scores per policy category per modality. Each classifier is a dedicated ML model trained on policy-specific labeled data.
The Decision Engine receives ensemble outputs, applies per-category confidence thresholds, and determines the routing outcome. It writes an immutable moderation event record to the audit store and executes the action.
The Human Review System is a priority queue backed by Redis sorted sets, a reviewer assignment service, and a reviewer workstation API. It handles the uncertain confidence band between auto-approve and auto-remove.
The Policy Registry is the source of truth for threshold configuration. It publishes policy version updates to all classifier and decision engine nodes via a push notification channel with a 30-second propagation SLA.
Text and image classification run synchronously against the upload event with sub-250ms latency targets. Video classification is fully async with a 10-minute SLA - the video goes live immediately and is suppressed if the analysis returns a remove decision. This asymmetry is intentional: synchronous video analysis would require massive GPU headroom to handle upload bursts, while async analysis amortizes GPU costs across time.
The Classifier Ensemble
The classifier ensemble is the core of the system. Its job is to produce a confidence score between 0 and 1 for each policy violation category, for each modality in the submitted content.
A naive approach trains a single model on all content types. This fails because the signal distributions are completely different: text classification relies on token sequences and contextual embeddings; image classification relies on convolutional features and object recognition; video requires temporal coherence analysis across frames plus audio transcription. A single model trained across all three modalities underperforms specialists trained on modality-specific labeled data.
Text Classifier uses a fine-tuned DeBERTa-v3 transformer with a multi-label classification head, producing scores for each policy category: hate speech, harassment, spam, self-harm promotion, misinformation. Text preprocessing normalizes Unicode evasion attacks - replacing homoglyph substitutions, stripping invisible separators, expanding common character-substitution obfuscation - before the model sees the input.
Image Classifier uses a ResNet-50 backbone with a multi-label classification head for general policy categories. A separate CLIP-based model handles context-dependent violations where the image meaning depends on accompanying text. A perceptual hash pre-filter (pHash with Hamming distance threshold) catches exact and near-duplicate re-uploads of known-bad images before any classifier is invoked - matching known CSAM hashes against the PhotoDNA database without storing the hashes publicly.
Video Classifier samples frames at 1 fps for initial screening, running the image classifier on each sampled frame. Audio is transcribed using Whisper ASR and fed to the text classifier. If frame-level scores exceed a sampling threshold, the system upgrades to dense frame analysis at 4 fps for the flagged segment. Temporal coherence signals - a single flagged frame in an otherwise clean video - are weighted differently than sustained flagged segments.
The ensemble fusion combines per-modality outputs into a single routing decision. The code below shows how multi-modal scores are combined with veto category handling for the most severe content categories.
# Classifier ensemble fusion - combining multi-modal scores into a routing decision
from dataclasses import dataclass
from typing import Optional
import math
@dataclass
class ModalityScore:
score: float # 0.0 to 1.0 violation probability
category: str # e.g. "hate_speech", "nudity", "violence"
confidence: float # model confidence in this prediction
model_version: str
@dataclass
class EnsembleResult:
fused_score: float
dominant_category: str
modality_scores: dict
routing: str # "approve", "review", "remove"
is_veto: bool # True if specialist model triggered hard block
VETO_CATEGORIES = {"csam", "terrorism_incitement"}
VETO_THRESHOLD = 0.70 # lower bar for worst-case content
THRESHOLDS = {
"approve": 0.30,
"remove": 0.70,
}
MODALITY_WEIGHTS = {
"text": 0.35,
"image": 0.45,
"video": 0.20,
}
def fuse_scores(
text_scores: list[ModalityScore],
image_scores: list[ModalityScore],
video_scores: Optional[list[ModalityScore]] = None,
) -> EnsembleResult:
"""
Combine multi-modal classifier outputs into a single routing decision.
Veto categories (CSAM, terrorism) bypass weighted average and trigger
immediate removal if any single modality exceeds VETO_THRESHOLD.
"""
all_scores = []
if text_scores:
all_scores.extend([(s, "text") for s in text_scores])
if image_scores:
all_scores.extend([(s, "image") for s in image_scores])
if video_scores:
all_scores.extend([(s, "video") for s in video_scores])
# Check veto categories first - any hit triggers immediate remove
for score, modality in all_scores:
if score.category in VETO_CATEGORIES and score.score >= VETO_THRESHOLD:
return EnsembleResult(
fused_score=score.score,
dominant_category=score.category,
modality_scores=_group_by_modality(all_scores),
routing="remove",
is_veto=True,
)
# Weighted average per category across modalities
category_scores: dict[str, list[float]] = {}
for score, modality in all_scores:
weight = MODALITY_WEIGHTS.get(modality, 0.33)
weighted = score.score * weight * score.confidence
category_scores.setdefault(score.category, []).append(weighted)
if not category_scores:
return EnsembleResult(0.0, "none", {}, "approve", False)
# Take max across categories, sum within category (multiple signals compound)
category_finals = {
cat: min(1.0, sum(scores))
for cat, scores in category_scores.items()
}
dominant_cat = max(category_finals, key=category_finals.get)
fused = category_finals[dominant_cat]
if fused >= THRESHOLDS["remove"]:
routing = "remove"
elif fused >= THRESHOLDS["approve"]:
routing = "review"
else:
routing = "approve"
return EnsembleResult(
fused_score=fused,
dominant_category=dominant_cat,
modality_scores=_group_by_modality(all_scores),
routing=routing,
is_veto=False,
)
def _group_by_modality(all_scores):
grouped = {}
for score, modality in all_scores:
grouped.setdefault(modality, []).append(score.score)
return {k: max(v) for k, v in grouped.items()}
YouTube and Meta both run classifier ensembles where CSAM and terrorism content bypass the normal weighted-average fusion path and trigger immediate removal via a veto mechanism the moment any single specialist model exceeds its threshold. This asymmetric handling exists because the cost of a false negative in these categories is categorically higher than in any other category - the veto path accepts a higher false positive rate in exchange for near-zero false negative rate on the worst content.
Confidence Thresholding
Confidence thresholding is where most content moderation systems make their biggest architectural mistake: they define a single global threshold and apply it uniformly across all categories. In practice, different policy categories have fundamentally different cost asymmetries between false positives and false negatives.
For child safety material, a false negative (missed violation) is catastrophic and a false positive (wrongly removed content) is merely inconvenient. The auto-remove threshold should be low - content above 0.30 confidence gets removed automatically. For political commentary that might be considered harassment, a false positive that silences legitimate speech is a serious harm. The auto-remove threshold should be high, and the human review band should be wide, routing a larger fraction of borderline cases to a human rather than auto-removing them.
# Cost-sensitive threshold optimization per policy category
# Thresholds are calibrated offline against labeled hold-out sets
import numpy as np
from sklearn.metrics import confusion_matrix
# Per-category threshold configuration
CATEGORY_THRESHOLDS = {
"csam": {
"auto_remove": 0.30, # Very low bar - false negative cost is catastrophic
"human_review": 0.10, # Almost everything goes to human review
"fp_cost": 1.0, # Cost units: wrong removal
"fn_cost": 500.0, # Cost units: missed violation (much higher)
},
"hate_speech": {
"auto_remove": 0.85, # High bar - false positives silence legitimate speech
"human_review": 0.45, # Wide review band for borderline political content
"fp_cost": 10.0,
"fn_cost": 25.0,
},
"spam": {
"auto_remove": 0.80,
"human_review": 0.50,
"fp_cost": 2.0,
"fn_cost": 5.0,
},
"graphic_violence": {
"auto_remove": 0.75,
"human_review": 0.40,
"fp_cost": 5.0,
"fn_cost": 30.0,
},
}
def optimize_threshold(
scores: np.ndarray,
labels: np.ndarray,
fp_cost: float,
fn_cost: float,
fpr_cap: float = 0.01, # Hard cap: FPR must not exceed 1%
) -> float:
"""
Find the threshold minimizing total cost while respecting the FPR cap.
Run per-category on a labeled hold-out set before each model rollout.
"""
thresholds = np.linspace(0.01, 0.99, 990)
best_threshold = 0.5
best_cost = float("inf")
total_negatives = np.sum(labels == 0)
for t in thresholds:
preds = (scores >= t).astype(int)
tn, fp, fn, tp = confusion_matrix(labels, preds).ravel()
fpr = fp / total_negatives if total_negatives > 0 else 0.0
if fpr > fpr_cap:
continue # Reject thresholds that violate the hard FPR cap
cost = (fp * fp_cost) + (fn * fn_cost)
if cost < best_cost:
best_cost = cost
best_threshold = t
return best_threshold
Thresholds calibrated on your training distribution will drift as content patterns evolve and adversarial evasion techniques spread. A threshold that produces 0.8% FPR on your validation set may produce 3% FPR six months later after attackers have adapted their language to stay near the decision boundary. Run threshold re-calibration as a scheduled job every 30 days using fresh labeled data from reviewer decisions, and alert when rolling FPR exceeds 1.5x the baseline established at last calibration.
Human Review Queue Design
The human review queue holds items in the uncertain confidence band - content that scored too low to auto-remove but too high to auto-approve. Queue design determines how quickly the most harmful borderline content reaches a reviewer and how efficiently reviewer capacity is used.
Queue priority is not first-in-first-out. Items are scored by a composite priority function:
priority = (virality_score * 0.4) + (severity_score * 0.4) + (sla_urgency * 0.2)
Virality score reflects how many users have already seen this content. Harmful content that has already reached 50,000 users causes more ongoing harm per hour of delay than content with 5 views. Severity score is derived from the category: CSAM and terrorism are severity 1.0, graphic violence is 0.8, hate speech is 0.6, spam is 0.2. SLA urgency starts at 0.0 and increases linearly as the item approaches its 4-hour SLA deadline, reaching 1.0 at 30 minutes remaining.
The priority queue is backed by a Redis sorted set. The score stored is the composite priority value. Reviewer workstations call ZPOPMAX to claim the next highest-priority item available in their certified category pool.
// Human review queue - priority insertion and reviewer assignment
package reviewqueue
import (
"context"
"encoding/json"
"fmt"
"time"
"github.com/redis/go-redis/v9"
)
type ReviewItem struct {
ItemID string `json:"item_id"`
ContentID string `json:"content_id"`
Category string `json:"category"`
FusedScore float64 `json:"fused_score"`
PolicyVersion string `json:"policy_version"`
ViralityScore float64 `json:"virality_score"`
SeverityScore float64 `json:"severity_score"`
EnqueuedAt time.Time `json:"enqueued_at"`
SLADeadline time.Time `json:"sla_deadline"`
}
type PriorityQueue struct {
rdb *redis.Client
}
func (q *PriorityQueue) Enqueue(ctx context.Context, item ReviewItem) error {
// Compute initial priority score (SLA urgency starts at 0)
priority := (item.ViralityScore * 0.4) + (item.SeverityScore * 0.4)
// Serialize item metadata for storage
data, err := json.Marshal(item)
if err != nil {
return fmt.Errorf("marshal item: %w", err)
}
// Store item details in hash, add to category-specific sorted set
pipe := q.rdb.Pipeline()
pipe.HSet(ctx, "review:items:"+item.ItemID, "data", string(data))
pipe.ZAdd(ctx, "review:queue:"+item.Category, redis.Z{
Score: priority,
Member: item.ItemID,
})
// Also add to global queue for cross-category SLA monitoring
pipe.ZAdd(ctx, "review:queue:all", redis.Z{
Score: priority,
Member: item.ItemID,
})
_, err = pipe.Exec(ctx)
return err
}
// ClaimNextItem is called by reviewer workstation to pull the highest-priority
// item from the reviewer's certified category queues.
func (q *PriorityQueue) ClaimNextItem(
ctx context.Context,
reviewerID string,
certifiedCategories []string,
) (*ReviewItem, error) {
for _, category := range certifiedCategories {
queueKey := "review:queue:" + category
// ZPOPMAX atomically removes and returns the highest-scored member
results, err := q.rdb.ZPopMax(ctx, queueKey, 1).Result()
if err != nil || len(results) == 0 {
continue
}
itemID := results[0].Member.(string)
// Fetch item metadata
data, err := q.rdb.HGet(ctx, "review:items:"+itemID, "data").Result()
if err != nil {
return nil, fmt.Errorf("fetch item data: %w", err)
}
var item ReviewItem
if err := json.Unmarshal([]byte(data), &item); err != nil {
return nil, fmt.Errorf("unmarshal item: %w", err)
}
// Record assignment
q.rdb.HSet(ctx, "review:items:"+itemID,
"assigned_to", reviewerID,
"assigned_at", time.Now().Unix(),
"status", "assigned",
)
return &item, nil
}
return nil, nil // No items available for this reviewer's certified categories
}
Reviewer assignment uses skill-based routing. CSAM reviewers are a separate certified pool with mandatory psychological support protocols and rotation limits. Hate speech and political content reviewers are certified separately from general content reviewers. A reviewer is only shown items in their certified categories. CSAM review happens in an isolated workstation environment with no external network access.
Priority-based queuing outperforms FIFO for one non-obvious reason: the SLA clock on a high-virality item starts the moment it is uploaded, not the moment it enters the review queue. FIFO gives equal weight to a video seen by 200,000 users and a comment seen by 3 users. Priority scoring ensures the high-virality item reaches a reviewer within minutes regardless of queue depth, while low-virality items wait in order without causing measurable harm.
Appeal Workflow
When a user believes their content was wrongly removed, they submit an appeal. The appeal creates a new case linked to the original moderation record and routes to a senior reviewer pool separate from the initial review team. Having the same person review their own decision produces poor second-opinion quality due to consistency bias.
# Appeal state machine - tracks state transitions from submission to resolution
from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from typing import Optional
class AppealStatus(Enum):
OPEN = "open"
UNDER_REVIEW = "under_review"
DECIDED_REINSTATE = "decided_reinstate"
DECIDED_UPHOLD = "decided_uphold"
ESCALATED = "escalated"
POLICY_TEAM_REVIEW = "policy_team_review"
CLOSED = "closed"
@dataclass
class Appeal:
appeal_id: str
content_id: str
original_action: str # "auto_remove" or "human_remove"
original_item_id: Optional[str] # review_queue item if human_remove
submitted_at: datetime
user_statement: str
status: AppealStatus = AppealStatus.OPEN
assigned_to: Optional[str] = None
decision: Optional[str] = None
decided_at: Optional[datetime] = None
decision_note: Optional[str] = None
sla_deadline: datetime = field(init=False)
def __post_init__(self):
self.sla_deadline = self.submitted_at + timedelta(days=3)
class AppealStateMachine:
"""
Valid transitions:
OPEN -> UNDER_REVIEW (when senior reviewer claims the appeal)
UNDER_REVIEW -> DECIDED_REINSTATE (reviewer upholds appeal, content restored)
UNDER_REVIEW -> DECIDED_UPHOLD (reviewer upholds removal)
UNDER_REVIEW -> ESCALATED (genuinely ambiguous policy interpretation)
ESCALATED -> POLICY_TEAM_REVIEW (policy team takes over)
POLICY_TEAM_REVIEW -> CLOSED (final decision by policy team)
DECIDED_* -> CLOSED (after user notification sent)
"""
VALID_TRANSITIONS = {
AppealStatus.OPEN: [AppealStatus.UNDER_REVIEW],
AppealStatus.UNDER_REVIEW: [
AppealStatus.DECIDED_REINSTATE,
AppealStatus.DECIDED_UPHOLD,
AppealStatus.ESCALATED,
],
AppealStatus.ESCALATED: [AppealStatus.POLICY_TEAM_REVIEW],
AppealStatus.POLICY_TEAM_REVIEW: [AppealStatus.CLOSED],
AppealStatus.DECIDED_REINSTATE: [AppealStatus.CLOSED],
AppealStatus.DECIDED_UPHOLD: [AppealStatus.CLOSED],
}
def transition(self, appeal: Appeal, new_status: AppealStatus,
reviewer_id: str, note: str = "") -> Appeal:
allowed = self.VALID_TRANSITIONS.get(appeal.status, [])
if new_status not in allowed:
raise ValueError(
f"Invalid transition: {appeal.status} -> {new_status}"
)
appeal.status = new_status
if new_status in (
AppealStatus.DECIDED_REINSTATE,
AppealStatus.DECIDED_UPHOLD,
AppealStatus.CLOSED,
):
appeal.decided_at = datetime.utcnow()
appeal.decision_note = note
# If reinstated, content goes back live + negative training signal enqueued
if new_status == AppealStatus.DECIDED_REINSTATE:
self._reinstate_content(appeal.content_id)
self._enqueue_negative_training_signal(appeal)
return appeal
def _reinstate_content(self, content_id: str):
# Set content status back to live in content store
# Publish content.reinstated event to Kafka for downstream cache invalidation
pass
def _enqueue_negative_training_signal(self, appeal: Appeal):
# The original removal was a false positive - feed back to classifier training
# This signal will be included in the next training batch as a negative example
pass
When an appeal is upheld and content is reinstated, two things happen: the content status is set back to live and a negative training signal is enqueued. The original removal decision was a false positive, and the classifier that produced it needs to learn from this outcome. The signal is batched with other reviewer decisions and flows into the weekly classifier retraining pipeline.
The appeal reviewer must not see the original reviewer’s decision note before making their own assessment. Meta’s internal research found that showing appeal reviewers the original decision increased their agreement rate with that decision by over 30 percentage points - anchoring bias effectively defeats the purpose of a second review. The appeal workstation deliberately hides the original outcome until after the appeal reviewer submits their decision.
False Positive Rate Monitoring
False positive rate monitoring is not a dashboard afterthought. It is the primary operational signal for threshold tuning, classifier rollout decisions, and policy change validation. The FPR target is under 1% of auto-removals reinstated on appeal.
FPR is computed per category, per modality, and per policy version. A baseline FPR is established for each classifier version during model validation. Automated alerts fire when rolling FPR exceeds 1.5x baseline for any category over a 1-hour window.
-- FPR dashboard query: appeal-confirmed false positives per category over time
-- Used for threshold tuning decisions and classifier rollout gates
SELECT
me.triggering_category,
me.policy_version,
date_trunc('hour', me.evaluated_at) AS hour_bucket,
COUNT(*) FILTER (WHERE a.decision = 'reinstate') AS false_positives,
COUNT(*) AS total_auto_removals,
ROUND(
100.0 * COUNT(*) FILTER (WHERE a.decision = 'reinstate') / NULLIF(COUNT(*), 0),
2
) AS fpr_pct
FROM moderation_decisions me
LEFT JOIN appeals a
ON a.content_id = me.content_id
AND a.status IN ('decided_reinstate', 'closed')
WHERE
me.routing_action = 'auto_remove'
AND me.evaluated_at >= NOW() - INTERVAL '7 days'
GROUP BY
me.triggering_category,
me.policy_version,
date_trunc('hour', me.evaluated_at)
ORDER BY
hour_bucket DESC,
fpr_pct DESC;
The feedback loop from reviewer decisions back to classifier training is a double-edged sword. Reviewer decisions are high-quality labels, but the reviewer population is not a random sample of all content - they only see the uncertain middle band. Training exclusively on reviewer decisions would bias the classifier to be uncertain about content similar to what reviewers have historically seen, while potentially drifting on content far above or below the review band.
Feedback loops from reviewer decisions can amplify demographic and cultural biases in the reviewer pool. If reviewers from a particular region consistently flag political speech from a minority community at higher rates, that signal flows into training data and makes the classifier more aggressive toward that community’s speech over time. Monitor FPR segmented by language, region, and content creator demographics. Alert when any segment’s FPR diverges more than 2 standard deviations from its 90-day baseline.
Data Model
-- Content moderation core schema
-- All tables use append-only decision records for audit integrity
CREATE TABLE content (
id BIGINT PRIMARY KEY,
uploader_id BIGINT NOT NULL,
content_type VARCHAR(16) NOT NULL, -- text, image, video, composite
uploaded_at TIMESTAMPTZ NOT NULL,
status VARCHAR(32) NOT NULL DEFAULT 'live',
-- live, removed_auto, removed_human, removed_appeal_uphold, reinstated
current_policy_version VARCHAR(32)
) PARTITION BY RANGE (uploaded_at);
-- Immutable moderation decision record - never updated, only inserted
CREATE TABLE moderation_decisions (
decision_id BIGSERIAL PRIMARY KEY,
content_id BIGINT NOT NULL REFERENCES content(id),
policy_version VARCHAR(32) NOT NULL,
evaluated_at TIMESTAMPTZ NOT NULL,
text_scores JSONB, -- [{category, score, model_version, confidence}]
image_scores JSONB,
video_scores JSONB,
fused_scores JSONB, -- {category: fused_score}
routing_action VARCHAR(32) NOT NULL, -- auto_approve, auto_remove, human_review
triggering_category VARCHAR(64),
triggering_score FLOAT,
is_veto BOOLEAN NOT NULL DEFAULT FALSE,
ensemble_version VARCHAR(32) NOT NULL -- version of the ensemble config
) PARTITION BY RANGE (evaluated_at);
CREATE INDEX idx_mod_decisions_content
ON moderation_decisions (content_id, evaluated_at DESC);
CREATE INDEX idx_mod_decisions_policy_action
ON moderation_decisions (policy_version, routing_action, triggering_category);
-- Review queue items
CREATE TABLE review_queue_items (
item_id BIGSERIAL PRIMARY KEY,
content_id BIGINT NOT NULL REFERENCES content(id),
decision_id BIGINT NOT NULL REFERENCES moderation_decisions(decision_id),
category VARCHAR(64) NOT NULL,
fused_score FLOAT NOT NULL,
virality_score FLOAT NOT NULL DEFAULT 0.0,
severity_score FLOAT NOT NULL,
priority_score FLOAT NOT NULL,
policy_version VARCHAR(32) NOT NULL,
enqueued_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
sla_deadline TIMESTAMPTZ NOT NULL,
assigned_to BIGINT,
assigned_at TIMESTAMPTZ,
status VARCHAR(32) NOT NULL DEFAULT 'pending',
decision VARCHAR(32),
decided_at TIMESTAMPTZ,
decision_note TEXT
);
-- Appeals - state machine for user-initiated review requests
CREATE TABLE appeals (
appeal_id BIGSERIAL PRIMARY KEY,
content_id BIGINT NOT NULL REFERENCES content(id),
original_action VARCHAR(32) NOT NULL,
original_item_id BIGINT REFERENCES review_queue_items(item_id),
submitted_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
sla_deadline TIMESTAMPTZ NOT NULL,
user_statement TEXT,
status VARCHAR(32) NOT NULL DEFAULT 'open',
assigned_to BIGINT,
decision VARCHAR(32),
decided_at TIMESTAMPTZ,
decision_note TEXT
);
-- Immutable policy version snapshots
CREATE TABLE policy_versions (
version_id VARCHAR(32) PRIMARY KEY,
released_at TIMESTAMPTZ NOT NULL,
thresholds JSONB NOT NULL, -- full per-category threshold config
description TEXT,
released_by BIGINT NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT FALSE
);
Moderation decisions are immutable - the record is written once and never updated. This is essential for audit integrity: regulators and legal teams need to know exactly what decision was made, under which policy version, and why. The content status (live, removed, reinstated) is tracked separately in the content table. A decision record represents what the system decided at a point in time; content status represents the current state resulting from all decisions and appeals combined.
Policy Version Management
Policy versions are immutable snapshots of the threshold configuration. Every moderation decision is tagged with the policy version active at evaluation time. This decoupling enables two operations that would be impossible if thresholds were stored as mutable configuration: audit queries against historical decisions and retroactive re-evaluation when thresholds change.
# Policy version YAML - defines all category thresholds for a named version
# Stored in policy registry, loaded by Decision Engine nodes on activation
version: "2026.06.14-v3"
released_at: "2026-06-14T09:00:00Z"
description: "Tightened hate speech threshold after Q2 review, widened spam review band"
categories:
csam:
auto_remove: 0.30
human_review: 0.10
veto: true
veto_threshold: 0.70
terrorism_incitement:
auto_remove: 0.40
human_review: 0.15
veto: true
veto_threshold: 0.70
hate_speech:
auto_remove: 0.82 # was 0.85 in previous version
human_review: 0.42
graphic_violence:
auto_remove: 0.75
human_review: 0.40
spam:
auto_remove: 0.80
human_review: 0.40 # was 0.50 - wider review band
self_harm:
auto_remove: 0.60
human_review: 0.30
retroactive_reeval:
enabled: true
lookback_days: 7
categories_to_reeval: ["hate_speech", "spam"] # only changed categories
When a new policy version is published, the Policy Registry pushes it to all classifier and decision engine nodes via a pub/sub channel. The propagation SLA is 30 seconds - nodes must acknowledge receipt and activate the new version within that window. After activation, the retroactive re-evaluation job queries moderation_decisions for live content evaluated in the last 7 days under the previous version where the stored scores would produce a different outcome under the new thresholds. These items are re-queued for action (auto-remove or human review as appropriate) without re-running classifiers, because the stored scores are sufficient to make the new decision.
Twitter/X and Meta both version their content policies publicly and face significant scrutiny when policy changes result in waves of removals or reinstatements of existing content. The retroactive re-evaluation window (7 days in this design) is a policy decision as much as a technical one - too short and content that narrowly escaped the old threshold continues to live, too long and a policy tightening triggers mass removals of old content that users have already engaged with. Most platforms limit retroactive enforcement to 24-72 hours for this reason.
Scaling and Performance
Back-of-Envelope Capacity Estimation
---------------------------------------
Input: 10M items/day = 116 items/second average, 580 items/second at 5x peak
Text classifier (DeBERTa-v3):
- Average latency: 40ms per item on GPU
- Throughput per GPU: 25 items/second
- At 580 TPS peak: 580 / 25 = 24 GPU instances for text
- Memory per instance: 4GB model weights, 8GB GPU RAM minimum
Image classifier (ResNet-50):
- Average latency: 60ms per item on GPU
- Throughput per GPU: 16 items/second
- Assume 70% of content has images: 580 * 0.7 = 406 image items/sec peak
- At 406 TPS: 406 / 16 = 26 GPU instances for image
Video classifier (async):
- 10M items/day, assume 5% are video: 500K videos/day
- 500K / 86400 = 5.8 videos/second average
- 1fps sampling for 60-second video = 60 frames per video
- 5.8 videos/sec * 60 frames = 348 frames/second through image classifier
- 348 / 16 = 22 additional GPU instances for video (shared with image pool off-peak)
Review queue (Redis):
- Items in queue: 10% of 10M/day enter review = 1M items/day in queue
- At 4h SLA: queue depth = 1M / 6 = ~167K items at any time
- Redis sorted set at 167K members: trivial for Redis
Human reviewers:
- 1M review items/day
- Reviewer capacity: 200 items/hour (generous estimate for complex content)
- Reviewers needed: 1M / (200 * 8h shift) = 625 reviewers
- With 3 shifts: 625 / 3 = 208 concurrent reviewers
Audit store (PostgreSQL):
- 10M decisions/day = 116 writes/second average
- 2KB per decision record = 232 KB/second = 20GB/day
- Partition by day, archive to object storage after 30 days
Text and image classifier workers are stateless Kafka consumers that scale horizontally by adding consumer group members. Kafka partitioning by content type ensures each worker pool receives only its relevant content type. When consumer lag on the text topic exceeds 5,000 messages, the autoscaler provisions additional GPU instances - typically within 2 minutes.
GPU batching is essential for image classification efficiency. Rather than invoking the classifier for each image individually (wasting GPU memory bandwidth), the image worker collects items into batches of 32 and runs a single batched inference call. At 40ms per single inference and 60ms per batch of 32, batching provides a 20x throughput improvement per GPU dollar.
Meta’s content moderation infrastructure runs on a fleet of custom AI accelerator chips (MTIA) specifically designed for inference workloads. At their scale (billions of posts per day across all surfaces), the cost difference between general GPU inference and purpose-built inference silicon is significant enough to justify custom chip development. For most platforms, A100 or H100 GPU instances with aggressive batching achieves the necessary cost-per-inference ratio without custom silicon.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| Classifier node crash | Kafka consumer lag spike + health check failure within 10s | Items accumulate in Kafka topic; no new decisions | Orchestrator auto-restarts pod; backlog processed on recovery; no data loss (Kafka retains events) |
| Redis review queue failure | Queue health check + item enqueue error rate spike | New items cannot be enqueued for human review; auto-approve and auto-remove paths unaffected | Failover to Redis replica; in-flight items re-enqueued from moderation_decisions on recovery |
| Policy registry unavailable | Classifier nodes report config fetch failure | Nodes continue with last-known-good policy version; new policy cannot propagate | Policy registry HA with 3 replicas; maximum blast radius is delayed propagation of new policy version |
| GPU out of memory during batched inference | Worker process crash / OOM signal | Batch dropped; items return to Kafka for retry | Reduce batch size on restart; alert on sustained OOM; auto-scale to add instances if load is the cause |
| Kafka consumer lag spike | Lag monitor alert when queue exceeds 10,000 messages | Moderation latency increases; video SLA at risk | Auto-scale worker pool; if content type specific, scale that type’s consumer group; alert if lag exceeds 1M messages |
The most dangerous failure mode is a silent classifier degradation where the model continues to produce outputs but its score distribution shifts significantly - scores compress toward the midpoint, causing more content to fall into the human review band than normal. This does not trigger latency alerts or error rate alerts. It shows up only as a spike in review queue depth, which may be attributed to a traffic increase rather than a classifier problem. Monitor the distribution of classifier output scores (mean, standard deviation, fraction of scores in each decile) and alert on distribution shift, not just error rates.
Comparison of Approaches
| Approach | Accuracy | Latency | Cost | Scales To | Best For |
|---|---|---|---|---|---|
| Pure rules (keyword blocklist) | Low - trivially evaded | Sub-ms | Very low | Any volume | Legacy systems, initial MVP |
| Single ML model (text only) | Medium - misses image/video abuse | 50-100ms | Low | 1M items/day | Text-heavy platforms with limited ML infra |
| Multi-modal ensemble (this design) | High - fuses signals across modalities | 150-250ms | High (GPU fleet) | 100M+ items/day | Large platforms with diverse content types |
| Human-only review | Very high for complex context | Minutes to hours | Extremely high | 10K-100K items/day | High-stakes low-volume (ID verification, appeals) |
| Hybrid pipeline (ensemble + human) | Highest overall | Auto path 250ms, human path 4h | High but efficient | 10M+ items/day | Production platforms at scale |
The hybrid pipeline is the right choice for platforms at scale. Rules-only approaches are trivially evaded by adversarial actors who substitute characters, use image text, or coordinate across accounts. Human-only review does not scale economically past a few hundred thousand items per day. The multi-modal ensemble handles the clear cases at machine speed while routing the genuinely uncertain content to humans who bring contextual judgment the model cannot replicate. The key engineering discipline is maintaining that human review band at a manageable fraction of total volume - if thresholds drift and 30% of content enters human review, the cost model collapses. FPR monitoring and threshold discipline keep the human review band narrow.
Key Takeaways
- Decouple upload from moderation: the upload path must never block on classifier decisions; a classifier outage degrades moderation latency, not user experience.
- Specialist ensemble beats generalist: train separate models per modality trained on modality-specific policy violation datasets - a single multimodal model underperforms specialists because feature distributions are completely different across text, image, and video.
- Veto categories bypass weighted fusion: CSAM and terrorism content should trigger immediate removal if any single modality exceeds a low threshold - the false negative cost in these categories is categorically higher than all others combined.
- Per-category thresholds are non-negotiable: child safety and political speech have opposite false positive versus false negative cost asymmetries; a single global threshold produces an unacceptable tradeoff for at least one category.
- Store raw scores, not routing decisions: classifier scores are immutable facts about the content; routing decisions are scores plus thresholds; decoupling them enables retroactive policy application without re-running classifiers.
- Priority queue beats FIFO for reviewer assignment: high-virality borderline content causes more harm per hour of delay than zero-virality content; composite priority scoring ensures reviewers always address the highest-impact item next.
- FPR monitoring is an operational control surface: a 1.5x spike in false positive rate is a signal to re-calibrate thresholds or investigate classifier drift before user complaints reach critical volume.
- Appeal reviewers must not see the original decision: anchoring bias from knowing the original outcome increases agreement with that outcome by 30+ percentage points, defeating the purpose of a second review.
The counter-intuitive lesson from building content moderation at scale is that the hardest engineering problem is not catching harmful content - it is managing false positives. At 116 items per second, even a 0.5% false positive rate on auto-removals means one wrongly removed item every 17 seconds. The system’s ability to detect violations is table stakes; its ability to not suppress legitimate content is the engineering discipline that determines whether users trust the platform enough to keep posting.
Frequently Asked Questions
Q: Why not use a single large multi-modal model like Flamingo or GPT-4V instead of specialist models? A: Large multi-modal foundation models produce excellent general representations but underperform specialist models on fine-grained policy violation classification for three reasons. First, they are not trained on your specific policy categories with your specific labeled violation data - your hate speech definition differs from Meta’s differs from YouTube’s. Second, inference latency for billion-parameter models is 500ms to several seconds, far outside the 250ms budget. Third, fine-tuning specialist models on policy-specific data consistently outperforms prompt engineering a foundation model on the same classification task, especially for edge cases near the policy boundary.
Q: Why not just use user reports as the primary moderation signal instead of classifiers? A: User reports are an excellent supplemental signal but a poor primary mechanism for three reasons: reporting rate is extremely low (most users scroll past violations without reporting), reports can be weaponized by coordinated groups to mass-report legitimate speech from political opponents, and some categories (child safety, imminent self-harm) must be acted on immediately regardless of whether anyone reports them. Reports are best used to boost priority score in the human review queue, accelerating reviewer attention to content that has already been flagged by the community.
Q: Why use Redis sorted sets for the priority queue instead of a dedicated queue like SQS with priority attributes?
A: SQS priority support is limited to a small number of priority tiers, not a continuous score. The priority function here is a continuous float combining virality, severity, and SLA urgency - standard queue systems cannot represent this natively. Redis sorted sets support arbitrary float scores and atomic ZPOPMAX operations with O(log N) complexity. The reviewer claim operation is a single atomic Redis command, not a multi-step claim-then-delete pattern that would require distributed locking in a traditional queue.
Q: How do you handle adversarial evasion where users deliberately craft content to fool classifiers? A: Defense in depth across three layers. Text preprocessing normalizes Unicode evasion, invisible characters, and homoglyph substitution before the classifier sees input. Image classifiers use adversarial training data including known evasion techniques (slight image perturbations, text embedded in images, steganographic encoding). Perceptual hash pre-filtering catches re-uploads of known-bad content regardless of minor pixel perturbations. Behavioral signals - account age, prior violation history, velocity of borderline-scored uploads - feed a parallel risk scorer that can route content to human review independent of classifier confidence scores alone.
Q: What happens to stored classifier scores when a model is retrained?
A: Stored scores in moderation_decisions are tagged with both the policy version and the ensemble version (which includes the specific model versions used). When a model is retrained and a new ensemble version is deployed, historical decisions remain tagged with the old ensemble version. The retroactive re-evaluation job only replays decisions using stored scores against new thresholds - it does not re-run classifiers. If you want to re-score historical content with the new model, that is a separate offline batch process, and its results are written as new decision records with the new ensemble version, not as updates to old records.
Q: Why is the false positive rate target 1% rather than 0.1%? A: The 1% target is a business and operational constraint, not an engineering limit. At 10M items/day with a 10% auto-remove rate, 1M items are removed daily. A 1% FPR means 10,000 wrongly removed items per day that must be caught by appeal or proactive audit. A 0.1% FPR would require significantly higher auto-remove thresholds, which would push more genuinely violating content into the human review band, increasing reviewer costs by 20-30%. The target balances false positive harm to users against the cost of routing more content to human review. Most large platforms operate between 0.5% and 2% FPR on auto-removes, with the specific target driven by regulatory environment and product risk tolerance.
Interview Questions
Q: How would you design the confidence threshold system to minimize both over-removal and under-removal simultaneously? Expected depth: Per-category threshold bands (auto-remove, human-review, auto-approve), asymmetric cost of false positives versus false negatives per category with specific examples (child safety versus political speech), offline threshold calibration using cost-sensitive optimization against labeled hold-out sets, production FPR monitoring with automated 1.5x baseline alerts, and automatic threshold adjustment candidates triggered by FPR spikes pending human review.
Q: The classifier ensemble produces scores from three modalities. How do you fuse them into a single routing decision? Expected depth: Explain veto category handling (CSAM and terrorism bypass weighted average - any single modality hit triggers remove), weighted average fusion for general categories with modality weights reflecting signal quality (image 0.45 > text 0.35 > video 0.20 for most categories), handling missing modalities gracefully when a classifier times out, why fusion takes weighted sum of confidence-adjusted scores rather than simple max (max ignores signal strength, sum compounds correlated signals), and the distinction between fusion at inference time versus ensemble training.
Q: A new policy version changes the hate speech threshold from 0.85 to 0.82. How do you apply this retroactively to existing live content?
Expected depth: Query moderation_decisions for live content evaluated in the last 7 days under the old policy version where stored fused_scores for hate speech are between 0.82 and 0.85 (would now trigger auto-remove but did not under old threshold). Filter for content where content.status = 'live' only. Batch process these items through the decision engine using stored scores against new thresholds - no classifier re-invocation needed. Write new decision records tagged with the new policy version. Discuss why retroactive enforcement window is bounded to 7 days (policy decision about fairness to users) and why already-removed content is excluded.
Q: How would you design the system to handle a coordinated upload campaign - 50,000 nearly-identical harmful images uploaded from different accounts in 5 minutes? Expected depth: Perceptual hash pre-filtering with an LSH index catches near-duplicate detection in O(1) per lookup. Once the first image in the campaign is confirmed harmful (via classifier or human reviewer), its pHash and all computed neighbor hashes are added to the known-bad Redis index. All subsequent near-duplicate uploads match the known-bad index and are removed without classifier invocation. Account-level velocity signals (uploads per minute per account, new account creation rate) feed a parallel account-level risk scorer. The combination of hash propagation and account signals catches coordinated campaigns significantly faster than waiting for each individual image to route through the full classifier pipeline.
Q: How would you measure classifier quality over time given that most content is never labeled after the initial automated decision? Expected depth: Appeal outcomes provide sparse but high-quality labels (reinstated auto-removals are confirmed false positives). Human reviewer decisions on the uncertain band provide a labeled sample of the borderline region. Periodic sampling of auto-approved content for human audit provides false negative rate estimates (send 0.1% of approved content to random reviewer audit). A/B threshold testing against downstream metrics (user report rates, repeat violation rates from the same accounts) provides behavioral validation. Discuss label shift challenges when adversarial behavior evolves faster than labeled datasets, and the importance of maintaining a held-out evaluation set that is refreshed quarterly with newly labeled examples from across the full score range.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.