Build a Podcast Transcription and Chapter Detection Pipeline


data-engineering scalability performance

System Design Deep Dive

Podcast Transcription Pipeline

How do you transcribe hours of audio, detect chapter breaks, and make it searchable - all before the listener finishes the first episode?

⏱ 14 min read📐 Advanced🏗️ Pipeline

Transcribing a podcast is like turning a live lecture into a textbook with chapters and a searchable index - and doing it while the ink is still wet. A 90-minute episode contains roughly 13,500 words of spoken audio. Sequential transcription of that file through a GPU-accelerated ASR model takes around 3 minutes end-to-end. That sounds fast, but at 50,000 hours of audio uploaded per day, sequential processing would require hundreds of GPU instances running 24 hours a day with no slack for burst uploads. The system must parallelize.

The first tension is between chunking for speed and accuracy at boundaries. If you split a 90-minute audio file into 30-second chunks for parallel processing, words that span a chunk boundary get cut in half. “Unbelievable” becomes “Unbelie-” at the end of chunk 12 and “-vable” at the start of chunk 13. The fix is overlapping chunks - extend each chunk by 3 seconds on each side - but now the stitch worker must resolve duplicate transcriptions of the overlap region without doubling words.

The second tension is between chapter detection quality and latency. The best chapter boundary detection uses a large language model that reads the full transcript and identifies semantic shifts. But a 90-minute transcript is too large for a single LLM context window, and waiting for full transcription before running chapter detection defeats the goal of having chapters available within 5 minutes of upload. The system must detect boundaries incrementally, operating on sliding windows of transcript as chunks complete.

The third tension is cost versus latency. A100 GPUs that process 30-second audio chunks in under 2 seconds cost significantly more per hour than CPU instances. For popular podcasts where listeners expect near-instant transcripts, the GPU cost is justified. For a long-tail episode with 200 expected listeners, it is not. We need tiered processing: GPU-accelerated for popular or premium content, batch CPU processing for the long tail. We solve for chunked audio processing, ASR model integration, speaker diarization, topic boundary detection, transcript indexing, pipeline parallelism, and cost-latency tradeoff simultaneously.

Requirements and Constraints

Functional Requirements

  • Transcribe audio uploaded to the platform with word-level timestamps
  • Detect natural topic boundaries and generate chapter markers with titles
  • Identify speaker transitions and label segments (Speaker 1, Speaker 2, …)
  • Make the transcript full-text searchable within 10 minutes of upload completion
  • Support episodes up to 4 hours in length

Non-Functional Requirements

  • Transcription available within 5 minutes for episodes under 60 minutes
  • Chapter detection accuracy: at least 70% of human-labeled boundaries identified
  • Search index freshness: under 10 minutes from upload completion
  • Daily throughput: 50,000 hours of audio per day
  • ASR word error rate (WER): under 8% for clean studio audio
  • System availability: 99.9% (degraded mode acceptable, total outage not)

Constraints and Assumptions

  • Audio is in MP3, AAC, or WAV format; video is supported by extracting the audio track first
  • Speaker identification is probabilistic - we label “Speaker 1, Speaker 2” not actual names without voice enrollment
  • Chapter titles are generated heuristically from keyword extraction, not via LLM (to control latency)
  • Transcription cost is a first-class constraint: GPU time must be tracked per episode

High-Level Architecture

The pipeline has five processing stages that run with as much parallelism as possible. Audio arrives at an Upload API, is written to Object Storage (S3), and triggers a Job Queue entry. The Job Queue fans work out to five parallel processor types: a Chunker that splits audio into overlapping 30-second segments, an ASR Worker Pool that transcribes each chunk on GPU, a Diarization Worker that identifies speakers, a Chapter Detector that finds topic boundaries, and an Index Writer that writes the completed transcript to Elasticsearch.

Podcast transcription pipeline high-level architecture

The key architectural insight is that chunking and ASR run in parallel with diarization. Diarization requires the full audio file (it needs global speaker context to label consistently), so it runs as a separate parallel job against the original file rather than waiting for chunk transcripts. When both the ASR stitch and diarization output are ready, a Merge Worker aligns them by timestamp to produce speaker-labeled transcript segments.

Chapter detection can begin as soon as the first 5 minutes of transcript is available. It processes the transcript in a sliding window, emitting preliminary chapter boundaries that are refined as more transcript arrives. The final chapter list is produced when the full transcript is assembled.

Key Insight

The entire pipeline is event-driven via pub/sub: each stage emits a completion event that triggers the next stage, rather than using polling or synchronous chaining. This means a 10-minute episode and a 4-hour episode use the same code path - the parallelism scales with the episode length automatically.

Chunked Audio Processing

Chunked audio processing splits a long audio file into overlapping segments that can be transcribed in parallel without losing words at boundaries.

The chunker reads the uploaded file from S3, splits it into 30-second segments with a 3-second overlap on each side, and writes each chunk back to S3 as an individual file. The overlap means a 90-minute episode generates (90 * 60 / 30) = 180 chunks, each slightly larger than 30 seconds.

# Audio chunker - splits audio into overlapping segments for parallel ASR
import boto3
from pydub import AudioSegment
import io

def chunk_audio(
    s3_bucket: str,
    s3_key: str,
    episode_id: str,
    chunk_duration_ms: int = 30_000,
    overlap_ms: int = 3_000,
) -> list[dict]:
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket=s3_bucket, Key=s3_key)
    audio = AudioSegment.from_file(io.BytesIO(obj["Body"].read()))

    chunks = []
    offset = 0
    chunk_index = 0

    while offset < len(audio):
        start = max(0, offset - overlap_ms)
        end = min(len(audio), offset + chunk_duration_ms + overlap_ms)
        chunk = audio[start:end]

        # Export as 16kHz mono WAV for Whisper
        buf = io.BytesIO()
        chunk.set_frame_rate(16000).set_channels(1).export(buf, format="wav")
        buf.seek(0)

        chunk_key = f"chunks/{episode_id}/{chunk_index:04d}.wav"
        s3.put_object(Bucket=s3_bucket, Key=chunk_key, Body=buf.read())

        chunks.append({
            "chunk_index": chunk_index,
            "s3_key": chunk_key,
            "audio_start_ms": offset,    # offset in original file (without overlap)
            "overlap_start_ms": start,   # actual start of chunk audio
            "overlap_end_ms": end,
        })

        offset += chunk_duration_ms
        chunk_index += 1

    return chunks

The audio_start_ms field records where this chunk’s content begins in the original file (excluding the overlap padding). The stitch worker uses this to correctly align chunk transcripts without duplicating the overlap words.

Watch Out

Do not split at fixed byte offsets - split by time. MP3 and AAC are variable-bitrate formats where a fixed byte offset does not correspond to a fixed time offset. Always decode to PCM first, then split by sample count, then re-encode the chunks.

ASR Model Integration

The ASR Worker is a GPU pod running Whisper (large-v3 or equivalent). It receives a chunk S3 key from the Job Queue, downloads the audio, runs inference, and writes the word-timestamped output back to S3.

ASR worker internals
# ASR worker - runs Whisper on a 30s audio chunk and writes output to S3
import whisper
import boto3
import json
import io

model = whisper.load_model("large-v3", device="cuda")

def transcribe_chunk(chunk_s3_key: str, episode_id: str, chunk_index: int) -> dict:
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket="podcast-audio", Key=chunk_s3_key)
    audio_bytes = obj["Body"].read()

    # Write to temp file (Whisper requires file path or numpy array)
    tmp_path = f"/tmp/{episode_id}_{chunk_index}.wav"
    with open(tmp_path, "wb") as f:
        f.write(audio_bytes)

    result = model.transcribe(
        tmp_path,
        language="en",
        word_timestamps=True,
        beam_size=5,
        best_of=5,
        temperature=0.0,
    )

    output = {
        "chunk_index": chunk_index,
        "episode_id": episode_id,
        "text": result["text"],
        "segments": [
            {
                "start": seg["start"],
                "end": seg["end"],
                "text": seg["text"],
                "words": seg.get("words", []),
            }
            for seg in result["segments"]
        ],
    }

    out_key = f"transcripts/{episode_id}/chunk_{chunk_index:04d}.json"
    s3.put_object(
        Bucket="podcast-audio",
        Key=out_key,
        Body=json.dumps(output).encode(),
    )
    return output

Each pod processes one chunk at a time. A single A100 GPU transcribes a 30-second chunk in approximately 1.5 seconds (20x real-time). Batch inference within a single chunk (processing multiple sequences simultaneously) would improve GPU utilization but adds latency variance; for latency-sensitive episodes we prefer low-batch or batch=1.

Real World

Descript (the podcast editing platform) published that they run Whisper at 30x real-time on A100s with batching enabled for archival work. For near-real-time use cases they run at 20x real-time with batch=1 to reduce tail latency. Apple Podcasts uses a similar chunked architecture with custom ASR models optimized for podcast audio characteristics.

Speaker Diarization

Speaker diarization answers the question “who spoke when?” It is a separate concern from transcription - the ASR model converts audio to text, but diarization assigns speaker labels to time intervals.

Diarization runs on the full original audio file (not the chunks) because speaker embedding models need global context to consistently identify that “the voice at 5:42” is the same person as “the voice at 47:12.”

# Speaker diarization using pyannote.audio
from pyannote.audio import Pipeline
import boto3
import json
import io

diarization_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="HF_TOKEN",
)
diarization_pipeline.to("cuda")

def diarize_episode(episode_id: str, s3_key: str) -> list[dict]:
    s3 = boto3.client("s3")
    obj = s3.get_object(Bucket="podcast-audio", Key=s3_key)
    tmp_path = f"/tmp/{episode_id}_full.wav"
    with open(tmp_path, "wb") as f:
        f.write(obj["Body"].read())

    diarization = diarization_pipeline(tmp_path)

    segments = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        segments.append({
            "start": turn.start,
            "end": turn.end,
            "speaker": speaker,  # e.g. "SPEAKER_00", "SPEAKER_01"
        })

    return segments

def merge_transcript_with_diarization(
    transcript_segments: list[dict],
    speaker_segments: list[dict],
) -> list[dict]:
    merged = []
    for word in transcript_segments:
        word_mid = (word["start"] + word["end"]) / 2
        speaker = "SPEAKER_UNKNOWN"
        for seg in speaker_segments:
            if seg["start"] <= word_mid <= seg["end"]:
                speaker = seg["speaker"]
                break
        merged.append({**word, "speaker": speaker})
    return merged
Pipeline data flow from audio upload to indexed transcript
Key Insight

Diarization and ASR are independent - they run in parallel on the same audio file. Merging happens at the word level by timestamp: for each word in the ASR output, find the diarization segment whose time range contains the word’s midpoint. This O(n*m) merge is fast because both lists are sorted by time.

Topic Boundary Detection

Topic boundary detection identifies the moments where the conversation shifts from one subject to another - the chapter boundaries. These are not silences or speaker changes; they are semantic transitions that require understanding the content.

The approach uses embedding-based cosine similarity: embed each sentence of the transcript, then look for points where consecutive sentences have a large drop in similarity - a “semantic cliff” that signals a topic shift.

# Topic boundary detection using sentence embeddings and cosine similarity drops
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def detect_chapter_boundaries(
    sentences: list[dict],          # [{text, start_ms, end_ms}]
    window_size: int = 3,           # sentences on each side of candidate boundary
    threshold: float = 0.35,        # cosine distance threshold for a boundary
    min_chapter_ms: int = 120_000,  # 2 minutes minimum chapter length
) -> list[dict]:
    if len(sentences) < window_size * 2 + 1:
        return []

    texts = [s["text"] for s in sentences]
    embeddings = embedder.encode(texts, batch_size=64, normalize_embeddings=True)

    boundaries = []
    last_boundary_ms = 0

    for i in range(window_size, len(sentences) - window_size):
        # Average embedding of sentences before and after the candidate point
        before = embeddings[i - window_size : i].mean(axis=0)
        after = embeddings[i : i + window_size].mean(axis=0)

        # Cosine distance between before and after windows
        cos_sim = float(np.dot(before, after))  # embeddings are L2-normalized
        cos_dist = 1.0 - cos_sim

        candidate_ms = sentences[i]["start_ms"]
        if cos_dist >= threshold and (candidate_ms - last_boundary_ms) >= min_chapter_ms:
            boundaries.append({
                "sentence_index": i,
                "start_ms": candidate_ms,
                "cos_distance": round(cos_dist, 4),
                "title": extract_chapter_title(sentences[i - window_size : i + window_size]),
            })
            last_boundary_ms = candidate_ms

    return boundaries

def extract_chapter_title(sentences: list[dict]) -> str:
    # Simple keyword extraction: most frequent non-stopword nouns
    import re
    from collections import Counter
    stopwords = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "is", "it", "that", "this"}
    text = " ".join(s["text"] for s in sentences)
    words = re.findall(r"\b[A-Za-z]{4,}\b", text.lower())
    freq = Counter(w for w in words if w not in stopwords)
    top_words = [w for w, _ in freq.most_common(3)]
    return " / ".join(w.title() for w in top_words) if top_words else "Chapter"

The window_size=3 means we look at the 3 sentences before and after each candidate point. A larger window smooths noise but misses short topic transitions. The threshold=0.35 is tuned empirically - lower values detect more boundaries (many false positives), higher values miss real boundaries.

Watch Out

Sentence boundaries in podcast transcripts are not marked by punctuation - ASR models produce a stream of words with no sentence breaks. You must segment the transcript into pseudo-sentences by splitting on long pauses (gaps of 500ms or more between words) rather than punctuation. Whisper’s word timestamps make this straightforward.

Transcript Indexing

Once the full transcript is assembled and speaker-labeled, it needs to be indexed for search. The search use case is “find the moment in this episode where they talked about X” - a phrase search that returns a timestamp, not just a document hit.

# Elasticsearch mapping for podcast transcript segments
TRANSCRIPT_MAPPING = {
    "mappings": {
        "properties": {
            "episode_id":   {"type": "keyword"},
            "podcast_id":   {"type": "keyword"},
            "start_ms":     {"type": "long"},
            "end_ms":       {"type": "long"},
            "speaker":      {"type": "keyword"},
            "text":         {"type": "text", "analyzer": "english"},
            "chapter_id":   {"type": "keyword"},
            "published_at": {"type": "date"},
        }
    },
    "settings": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "analysis": {
            "analyzer": {
                "english": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["lowercase", "english_stop", "english_stemmer"],
                }
            }
        },
    },
}

def index_transcript_segments(
    es_client,
    episode_id: str,
    segments: list[dict],  # [{start_ms, end_ms, speaker, text, chapter_id}]
    batch_size: int = 500,
) -> int:
    from elasticsearch.helpers import bulk
    actions = [
        {
            "_index": "podcast_transcripts",
            "_id": f"{episode_id}:{seg['start_ms']}",
            "_source": {
                "episode_id": episode_id,
                "start_ms": seg["start_ms"],
                "end_ms": seg["end_ms"],
                "speaker": seg.get("speaker", "SPEAKER_UNKNOWN"),
                "text": seg["text"],
                "chapter_id": seg.get("chapter_id"),
            },
        }
        for seg in segments
    ]

    success, errors = bulk(es_client, actions, chunk_size=batch_size, raise_on_error=False)
    return success

Documents are indexed at the segment level (roughly 10-30 words per segment) rather than the word level to keep index size manageable, while still allowing timestamp-precision search results.

Key Insight

Index at the segment granularity, not the word granularity. Word-level indexing inflates the index by 10x with no query quality benefit for podcast search. Users search for topics (“machine learning”), not individual words - segment-level indexing with phrase queries handles this correctly.

Pipeline Parallelism

The pipeline has a dependency graph that determines which stages can run in parallel:

  • Chunker runs first (depends on: upload complete)
  • ASR Workers run in parallel across all chunks (depend on: Chunker for each chunk)
  • Diarization runs in parallel with ASR (depends on: upload complete, independently of Chunker)
  • Stitch Worker runs after all ASR chunks complete
  • Merge Worker runs after Stitch + Diarization both complete
  • Chapter Detection runs on the merged transcript
  • Index Writer runs after Chapter Detection
# Job DAG definition (simplified Airflow-style)
pipeline:
  name: podcast_transcription
  steps:
    - id: chunk_audio
      depends_on: []
      worker: chunker
      timeout_minutes: 5

    - id: diarize
      depends_on: []
      worker: diarization_gpu
      timeout_minutes: 20

    - id: asr_chunk_{i}
      depends_on: [chunk_audio]
      worker: asr_gpu
      parallelism: num_chunks    # one job per chunk
      timeout_minutes: 3

    - id: stitch_transcript
      depends_on: [asr_chunk_*]  # wait for all chunks
      worker: stitch
      timeout_minutes: 2

    - id: merge_diarization
      depends_on: [stitch_transcript, diarize]
      worker: merge
      timeout_minutes: 2

    - id: detect_chapters
      depends_on: [merge_diarization]
      worker: chapter_detector
      timeout_minutes: 3

    - id: index_search
      depends_on: [detect_chapters]
      worker: index_writer
      timeout_minutes: 3

For a 60-minute episode (120 chunks), the critical path is: chunk_audio (2min) + asr_chunk (2min, parallel) + stitch (1min) + merge (1min) + detect_chapters (1min) + index (1min) = approximately 8 minutes. Diarization runs off the critical path and completes well within the stitch dependency window.

Cost-Latency Tradeoff

GPU instance cost is the dominant expense. An A100 instance costs approximately $3.50/hour. Processing a 60-minute episode requires 120 chunk-ASR jobs, each taking 1.5 seconds - 180 seconds of GPU time, or $0.175 per episode at full A100 rate. At 50,000 hours/day = 50,000 episodes of average 60 minutes, daily GPU cost is approximately $8,750.

Given:
  - 50,000 hours of audio per day
  - Average episode: 60 minutes
  - Episodes per day: 50,000
  - ASR: 1.5s per 30s chunk on A100
  - GPU time per episode: 120 chunks x 1.5s = 180s = 3 GPU-minutes

GPU compute cost (A100 at $3.50/hr):
  - Per episode: 3/60 hours x $3.50 = $0.175
  - Daily: 50,000 x $0.175 = $8,750/day

Storage cost (S3):
  - Raw audio: 50,000 episodes x 60MB avg = 3TB/day
  - At $0.023/GB: 3,000 x $0.023 = $69/day
  - Transcript JSON: ~500KB per episode = 25GB/day = negligible

Elasticsearch index:
  - ~5KB per indexed segment, ~200 segments per episode = 1MB per episode
  - Daily ingest: 50GB; at $0.10/GB*month = $50/month

Cost tiering reduces GPU spend significantly. Episodes with fewer than 1,000 plays are processed on spot GPU instances at 70% discount. Episodes with fewer than 100 expected plays use batch CPU transcription (Whisper runs on CPU at 3x real-time - 20 minutes for a 60-minute episode) with no latency SLA guarantee.

Horizontal scaling of the transcription worker fleet
Real World

Spotify’s podcast transcription system (launched in 2023) processes over 5 million episode hours per year. They published that they use a combination of their own ASR models and third-party APIs for different quality tiers, with cost-based routing similar to this design. YouTube’s automatic captions system uses a similar chunked pipeline with Google’s own ASR infrastructure.

Data Model

-- Episode metadata - one row per uploaded podcast episode
CREATE TABLE podcast_episodes (
    episode_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    podcast_id      UUID NOT NULL,
    title           TEXT NOT NULL,
    audio_s3_key    TEXT NOT NULL,
    duration_ms     BIGINT,
    uploaded_at     TIMESTAMPTZ DEFAULT now(),
    transcript_status TEXT DEFAULT 'pending'
        CHECK (transcript_status IN ('pending','chunking','transcribing','stitching','indexed','failed')),
    processing_tier TEXT DEFAULT 'standard'
        CHECK (processing_tier IN ('premium','standard','batch'))
);

CREATE INDEX idx_episodes_podcast ON podcast_episodes(podcast_id, uploaded_at DESC);
CREATE INDEX idx_episodes_status ON podcast_episodes(transcript_status) WHERE transcript_status != 'indexed';

-- Transcription jobs - tracks per-chunk work
CREATE TABLE transcription_jobs (
    job_id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    episode_id      UUID NOT NULL REFERENCES podcast_episodes(episode_id),
    chunk_index     INT NOT NULL,
    audio_s3_key    TEXT NOT NULL,
    audio_start_ms  BIGINT NOT NULL,
    audio_end_ms    BIGINT NOT NULL,
    status          TEXT DEFAULT 'queued',
    worker_id       TEXT,
    started_at      TIMESTAMPTZ,
    completed_at    TIMESTAMPTZ,
    output_s3_key   TEXT,
    UNIQUE(episode_id, chunk_index)
);

CREATE INDEX idx_jobs_episode ON transcription_jobs(episode_id, chunk_index);

-- Assembled transcript segments - word-level, with speaker labels
CREATE TABLE transcript_segments (
    segment_id      BIGSERIAL PRIMARY KEY,
    episode_id      UUID NOT NULL,
    start_ms        BIGINT NOT NULL,
    end_ms          BIGINT NOT NULL,
    text            TEXT NOT NULL,
    speaker         TEXT,
    confidence      FLOAT,
    chapter_id      UUID
) PARTITION BY LIST (episode_id);  -- partition per episode for fast deletes

CREATE INDEX idx_segments_episode_time ON transcript_segments(episode_id, start_ms);

-- Chapter markers - auto-detected topic boundaries
CREATE TABLE chapter_markers (
    chapter_id      UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    episode_id      UUID NOT NULL REFERENCES podcast_episodes(episode_id),
    start_ms        BIGINT NOT NULL,
    title           TEXT,
    cos_distance    FLOAT,  -- boundary strength signal
    created_at      TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX idx_chapters_episode ON chapter_markers(episode_id, start_ms);

-- Speaker segments - diarization output
CREATE TABLE speaker_segments (
    id              BIGSERIAL PRIMARY KEY,
    episode_id      UUID NOT NULL,
    speaker_label   TEXT NOT NULL,
    start_ms        BIGINT NOT NULL,
    end_ms          BIGINT NOT NULL
);

CREATE INDEX idx_speakers_episode_time ON speaker_segments(episode_id, start_ms);

The transcript_segments table is list-partitioned by episode_id so that deleting a retranscribed episode drops one partition rather than running a slow DELETE WHERE episode_id = ? on a large table.

Key Algorithms and Protocols

Transcript Stitch from Overlapping Chunks

The stitch algorithm merges 120 overlapping chunk transcripts into a single coherent transcript without duplicating words in the overlap regions:

# Stitch overlapping chunk transcripts into a single word stream
def stitch_chunks(
    chunks: list[dict],  # sorted by chunk_index, each has {audio_start_ms, segments:[{words}]}
) -> list[dict]:
    result_words = []
    last_word_end_ms = 0

    for chunk in sorted(chunks, key=lambda c: c["chunk_index"]):
        base_ms = chunk["audio_start_ms"]
        for seg in chunk["segments"]:
            for word in seg.get("words", []):
                # Convert chunk-relative timestamps to episode-absolute timestamps
                abs_start = base_ms + int(word["start"] * 1000)
                abs_end = base_ms + int(word["end"] * 1000)

                # Skip words that fall in the overlap with the previous chunk
                if abs_start < last_word_end_ms - 100:  # 100ms tolerance
                    continue

                result_words.append({
                    "text": word["word"],
                    "start_ms": abs_start,
                    "end_ms": abs_end,
                    "confidence": word.get("probability", 1.0),
                })
                last_word_end_ms = abs_end

    return result_words

The key is abs_start < last_word_end_ms - 100: any word whose absolute start time is earlier than the last accepted word’s end time (minus a 100ms tolerance) is in the overlap and is skipped.

Cosine Similarity Drop for Boundary Scoring

# Compute per-sentence similarity drop signal for chapter boundary detection
def similarity_drop_signal(embeddings: np.ndarray, window: int = 3) -> np.ndarray:
    n = len(embeddings)
    drops = np.zeros(n)
    for i in range(window, n - window):
        before = embeddings[max(0, i - window):i].mean(axis=0)
        after  = embeddings[i:min(n, i + window)].mean(axis=0)
        drops[i] = 1.0 - float(np.dot(before / np.linalg.norm(before),
                                       after  / np.linalg.norm(after)))
    return drops  # peaks in this array = chapter boundaries

Time complexity: O(n * window) where n is the number of sentences. For a 90-minute episode with ~1,800 sentences and window=3, this runs in under 50ms on CPU.

Scaling and Performance

Throughput sizing for 50,000 hours/day upload rate:

Peak upload rate (assuming 3x burst factor):
  - Average: 50,000 hours/day = 2,083 hours/hour = 34.7 hours/minute
  - Peak: 34.7 * 3 = 104 episode-hours/minute

ASR GPU requirement at peak:
  - 104 episode-hours/minute = 6,240 episode-minutes/minute
  - Each GPU-minute handles 20x real-time = 20 audio-minutes
  - GPUs needed: 6,240 / 20 = 312 A100 GPUs at peak

Auto-scaling behavior:
  - Target: 80% GPU utilization on each pod
  - Scale-out trigger: job queue depth > 100 per pod
  - Scale-in: queue depth < 10 for 5 consecutive minutes
  - Warm-up time for new GPU pod: ~90 seconds (model load)

Diarization (CPU-based at non-premium tier):
  - 1 CPU-minute processes 3 real-minutes of audio
  - 104 episode-hours/minute = 6,240 audio-minutes/minute
  - CPU cores needed: 6,240 / 3 = 2,080 vCPUs at peak
Real World

Spotify uses Kubernetes horizontal pod autoscaling for their transcription workers with GPU-aware scheduling. Their blog post on podcast transcription scaling noted that GPU warm-up latency (model load time) was a critical bottleneck - they solved it by keeping a minimum of 10 warm GPU pods running at all times regardless of queue depth.

Failure Modes and Recovery

FailureDetectionImpactRecovery
ASR worker crash mid-chunkJob heartbeat timeout (30s); job returns to queueChunk not transcribed; stitch blocks waitingChunk job re-queued with exponential backoff; idempotent output key prevents double-write
Chunk ordering mismatch in stitchMonotonicity check: abs_start should increase; gap > 5s between consecutive words triggers alertGarbled transcript around the boundaryStitch worker detects and re-requests affected chunks from ASR
Diarization timeout (full audio)Job timeout (20min); alert firedEpisode transcribed without speaker labels; published with “Unknown Speaker”Async retry of diarization; once complete, a merge patch job updates the segment speaker labels
Elasticsearch write failureBulk index returns error count > 0Episode missing from searchIndex writer retries with exponential backoff; dead-letter queue for persistent failures
S3 upload failure (chunk write)PutObject raises exceptionChunk not available for ASR workersChunker retries S3 write up to 3 times; if all fail, episode job is marked failed and retried from scratch after 10 minutes
Watch Out

The most common production failure is partial completion: 119 of 120 chunks transcribe successfully but chunk 67 fails permanently (corrupted audio segment). The stitch worker waits indefinitely for chunk 67. Always set a stitch timeout: if all chunks except N are complete and the wait exceeds 2x the expected processing time, proceed with a gap marker (“transcript unavailable for this segment”) rather than blocking the entire episode indefinitely.

Comparison of Approaches

ApproachLatencyCostCold Start WERDiarizationBest Fit
Cloud ASR (Google Speech-to-Text)Low (2min)High ($0.006/min)Low (8-10%)Separate APISmall volumes, no infra
Self-Hosted Whisper (sequential)High (20min for 60min ep)LowMedium (7-9%)Separate jobBatch / archival transcription
Self-Hosted Whisper + Parallel Chunks (this design)Medium (5min)MediumMedium (7-9%)ParallelProduction platform at scale
Streaming ASR (incremental)Very Low (real-time)HighHigh (12-15%)NoneLive captions, not podcast
Federated Transcription (user device)VariableNear-zero server costVariesNonePrivacy-sensitive use cases

The self-hosted parallel chunking approach wins for a platform at scale because it gives the best cost-to-latency ratio without vendor lock-in. Cloud ASR is simpler to operate but 3-5x more expensive per minute of audio and provides no model customization. Streaming ASR has higher WER because it cannot use bidirectional context - the model can only see audio that has already happened.

Key Takeaways

  • Chunked audio processing with overlapping boundaries enables O(n/P) transcription latency where P is the number of parallel GPU workers.
  • ASR model integration via Whisper provides word-level timestamps for free - use them to enable timestamp-precision search and stitch alignment.
  • Speaker diarization runs independently of ASR on the full audio file; merging by timestamp at word granularity produces accurate speaker-labeled transcripts.
  • Topic boundary detection via cosine similarity drops on sentence embeddings is a reliable, low-latency alternative to full-transcript LLM chapter detection.
  • Transcript indexing at segment granularity balances index size and search precision for the podcast “find this moment” use case.
  • Pipeline parallelism across chunking, ASR, and diarization is the critical design choice - without it, latency scales linearly with episode length.
  • Cost-latency tiering (GPU for premium, spot GPU for standard, CPU for batch) reduces daily GPU spend by 40-60% on long-tail content.

The counter-intuitive lesson is that the chapter detection problem is harder than the transcription problem. ASR quality is well-understood and can be benchmarked with WER. Chapter boundary quality is subjective - two humans labeling the same episode often agree on only 60-70% of boundaries. Designing for 70% recall against human labels is the right target, but communicating that to stakeholders requires reframing: chapters are a navigation aid, not a perfect summary. An imperfect chapter at the right place is more useful than no chapter at all.

Frequently Asked Questions

Q: Why not use a single streaming ASR approach instead of chunking? A: Streaming ASR processes audio in real-time, producing captions as audio arrives. It has two problems for podcast transcription: WER is 3-5 percentage points higher than batch ASR because the model cannot use bidirectional context, and it requires a persistent connection to the audio source which is incompatible with pre-recorded uploads. Chunked batch ASR is strictly better for accuracy on pre-recorded content.

Q: Why not run chapter detection with an LLM (GPT-4) for better quality? A: A 90-minute transcript is approximately 13,500 words - exceeding typical LLM context windows, requiring splitting. More importantly, an LLM call adds 10-30 seconds of latency and significant cost per episode. The cosine-similarity approach runs in under 1 second on CPU and produces results within 5-10% of LLM quality for chapter boundary detection. Use LLM chapter generation only for premium tier where users expect AI-generated chapter summaries, not just boundary markers.

Q: How do you handle non-English podcasts? A: Whisper is multilingual - pass language=None to auto-detect, or use explicit language codes. The accuracy drops for lower-resource languages (Swahili, Yoruba) compared to English. For diarization, pyannote.audio is language-agnostic since it operates on acoustic features. Chapter detection via sentence embeddings requires a multilingual model (e.g., paraphrase-multilingual-MiniLM-L12-v2).

Q: Why partition transcript_segments by episode_id instead of by time? A: The primary query pattern is “give me all segments for episode X” not “give me all segments from time T1 to T2 across all episodes.” Time-based partitioning would scatter a single episode’s segments across many partitions, slowing the most common query. List partitioning by episode_id means each episode’s 200-500 segments are in one partition, and deleting a re-transcribed episode drops that partition instantly.

Q: How do you detect music and ads versus spoken content for more accurate chapter detection? A: Music and ad segments appear as periods of high audio energy with low speech probability from the VAD (Voice Activity Detection) preprocessor. These segments are tagged as SEGMENT_TYPE=music or SEGMENT_TYPE=ad rather than passed to the ASR model. Chapter boundaries are only detected within SEGMENT_TYPE=speech runs, which avoids false chapter boundaries at every music transition.

Q: What is the false-positive rate for chapter boundaries and how do you tune it? A: With threshold=0.35 and min_chapter_ms=120000, typical episodes produce 8-15 chapter boundaries. False positive rate (a boundary where humans say there is none) is approximately 20-30% - one in four detected boundaries is not a real topic shift. Increasing the threshold to 0.45 reduces false positives to 10% but misses 30% of real boundaries. The threshold is a slider between precision and recall; 0.35 optimizes for recall (don’t miss real chapters) at the cost of precision.

Interview Questions

Q: A 3-hour podcast uploads and the transcript is due within 5 minutes. Walk through how your pipeline achieves this. Expected depth: 360 chunks at 30s each, 120 ASR workers in parallel, each chunk taking 1.5s on A100 = 1.5s total ASR time on critical path; overhead from chunk upload to S3 + queue propagation adds 30-60s; stitch + merge + index adds 2-3 minutes; total under 5 minutes with sufficient GPU fleet.

Q: How do you handle a chunk whose ASR output has very low confidence (average log-prob below -2.0)? Expected depth: Flag low-confidence chunks in the output JSON; stitch worker marks those segments with confidence=low; retry with temperature=0.2 beam search rather than greedy decoding; surface to UI as “[unclear]” rather than low-quality text; alert if more than 10% of chunks in an episode have low confidence.

Q: The diarization model incorrectly merges two similar-sounding speakers throughout a 2-hour interview. How do you fix this without re-processing the full audio? Expected depth: Diarization models embed speaker characteristics and cluster - similar voices get merged. Fix requires re-running diarization with num_speakers=2 hint (if known), or using voice enrollment (provide reference clips of each speaker). Cannot fix without re-running diarization on the full audio; however, the merge worker is idempotent so re-running just the diarize + merge steps is cheap.

Q: The Elasticsearch index grows by 50GB per day. After 30 days, query latency starts degrading. How do you address this? Expected depth: Time-based index lifecycle policy in Elasticsearch: episodes older than 90 days move to warm tier (fewer replicas, slower disks), older than 1 year to cold tier (S3 backing). Force-merge segments on older shards to reduce segment count. Shard allocation tuning: keep active index at 30GB per shard. Consider index aliases to keep search transparent across the tier transitions.

Q: Design the cost attribution system so each podcast creator is billed for their transcription usage. Expected depth: Tag every GPU job with podcast_id and episode_id; log GPU-seconds per job to a metering service; aggregate by podcast_id per billing cycle; handle spot instance pricing (cost per job varies); implement a credit system for free-tier creators; expose per-episode cost breakdown in creator dashboard.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article