Build a Podcast Transcription and Chapter Detection Pipeline
data-engineering scalability performance
System Design Deep Dive
Podcast Transcription Pipeline
How do you transcribe hours of audio, detect chapter breaks, and make it searchable - all before the listener finishes the first episode?
Transcribing a podcast is like turning a live lecture into a textbook with chapters and a searchable index - and doing it while the ink is still wet. A 90-minute episode contains roughly 13,500 words of spoken audio. Sequential transcription of that file through a GPU-accelerated ASR model takes around 3 minutes end-to-end. That sounds fast, but at 50,000 hours of audio uploaded per day, sequential processing would require hundreds of GPU instances running 24 hours a day with no slack for burst uploads. The system must parallelize.
The first tension is between chunking for speed and accuracy at boundaries. If you split a 90-minute audio file into 30-second chunks for parallel processing, words that span a chunk boundary get cut in half. “Unbelievable” becomes “Unbelie-” at the end of chunk 12 and “-vable” at the start of chunk 13. The fix is overlapping chunks - extend each chunk by 3 seconds on each side - but now the stitch worker must resolve duplicate transcriptions of the overlap region without doubling words.
The second tension is between chapter detection quality and latency. The best chapter boundary detection uses a large language model that reads the full transcript and identifies semantic shifts. But a 90-minute transcript is too large for a single LLM context window, and waiting for full transcription before running chapter detection defeats the goal of having chapters available within 5 minutes of upload. The system must detect boundaries incrementally, operating on sliding windows of transcript as chunks complete.
The third tension is cost versus latency. A100 GPUs that process 30-second audio chunks in under 2 seconds cost significantly more per hour than CPU instances. For popular podcasts where listeners expect near-instant transcripts, the GPU cost is justified. For a long-tail episode with 200 expected listeners, it is not. We need tiered processing: GPU-accelerated for popular or premium content, batch CPU processing for the long tail. We solve for chunked audio processing, ASR model integration, speaker diarization, topic boundary detection, transcript indexing, pipeline parallelism, and cost-latency tradeoff simultaneously.
Requirements and Constraints
Functional Requirements
- Transcribe audio uploaded to the platform with word-level timestamps
- Detect natural topic boundaries and generate chapter markers with titles
- Identify speaker transitions and label segments (Speaker 1, Speaker 2, …)
- Make the transcript full-text searchable within 10 minutes of upload completion
- Support episodes up to 4 hours in length
Non-Functional Requirements
- Transcription available within 5 minutes for episodes under 60 minutes
- Chapter detection accuracy: at least 70% of human-labeled boundaries identified
- Search index freshness: under 10 minutes from upload completion
- Daily throughput: 50,000 hours of audio per day
- ASR word error rate (WER): under 8% for clean studio audio
- System availability: 99.9% (degraded mode acceptable, total outage not)
Constraints and Assumptions
- Audio is in MP3, AAC, or WAV format; video is supported by extracting the audio track first
- Speaker identification is probabilistic - we label “Speaker 1, Speaker 2” not actual names without voice enrollment
- Chapter titles are generated heuristically from keyword extraction, not via LLM (to control latency)
- Transcription cost is a first-class constraint: GPU time must be tracked per episode
High-Level Architecture
The pipeline has five processing stages that run with as much parallelism as possible. Audio arrives at an Upload API, is written to Object Storage (S3), and triggers a Job Queue entry. The Job Queue fans work out to five parallel processor types: a Chunker that splits audio into overlapping 30-second segments, an ASR Worker Pool that transcribes each chunk on GPU, a Diarization Worker that identifies speakers, a Chapter Detector that finds topic boundaries, and an Index Writer that writes the completed transcript to Elasticsearch.
The key architectural insight is that chunking and ASR run in parallel with diarization. Diarization requires the full audio file (it needs global speaker context to label consistently), so it runs as a separate parallel job against the original file rather than waiting for chunk transcripts. When both the ASR stitch and diarization output are ready, a Merge Worker aligns them by timestamp to produce speaker-labeled transcript segments.
Chapter detection can begin as soon as the first 5 minutes of transcript is available. It processes the transcript in a sliding window, emitting preliminary chapter boundaries that are refined as more transcript arrives. The final chapter list is produced when the full transcript is assembled.
The entire pipeline is event-driven via pub/sub: each stage emits a completion event that triggers the next stage, rather than using polling or synchronous chaining. This means a 10-minute episode and a 4-hour episode use the same code path - the parallelism scales with the episode length automatically.
Chunked Audio Processing
Chunked audio processing splits a long audio file into overlapping segments that can be transcribed in parallel without losing words at boundaries.
The chunker reads the uploaded file from S3, splits it into 30-second segments with a 3-second overlap on each side, and writes each chunk back to S3 as an individual file. The overlap means a 90-minute episode generates (90 * 60 / 30) = 180 chunks, each slightly larger than 30 seconds.
# Audio chunker - splits audio into overlapping segments for parallel ASR
import boto3
from pydub import AudioSegment
import io
def chunk_audio(
s3_bucket: str,
s3_key: str,
episode_id: str,
chunk_duration_ms: int = 30_000,
overlap_ms: int = 3_000,
) -> list[dict]:
s3 = boto3.client("s3")
obj = s3.get_object(Bucket=s3_bucket, Key=s3_key)
audio = AudioSegment.from_file(io.BytesIO(obj["Body"].read()))
chunks = []
offset = 0
chunk_index = 0
while offset < len(audio):
start = max(0, offset - overlap_ms)
end = min(len(audio), offset + chunk_duration_ms + overlap_ms)
chunk = audio[start:end]
# Export as 16kHz mono WAV for Whisper
buf = io.BytesIO()
chunk.set_frame_rate(16000).set_channels(1).export(buf, format="wav")
buf.seek(0)
chunk_key = f"chunks/{episode_id}/{chunk_index:04d}.wav"
s3.put_object(Bucket=s3_bucket, Key=chunk_key, Body=buf.read())
chunks.append({
"chunk_index": chunk_index,
"s3_key": chunk_key,
"audio_start_ms": offset, # offset in original file (without overlap)
"overlap_start_ms": start, # actual start of chunk audio
"overlap_end_ms": end,
})
offset += chunk_duration_ms
chunk_index += 1
return chunks
The audio_start_ms field records where this chunk’s content begins in the original file (excluding the overlap padding). The stitch worker uses this to correctly align chunk transcripts without duplicating the overlap words.
Do not split at fixed byte offsets - split by time. MP3 and AAC are variable-bitrate formats where a fixed byte offset does not correspond to a fixed time offset. Always decode to PCM first, then split by sample count, then re-encode the chunks.
ASR Model Integration
The ASR Worker is a GPU pod running Whisper (large-v3 or equivalent). It receives a chunk S3 key from the Job Queue, downloads the audio, runs inference, and writes the word-timestamped output back to S3.
# ASR worker - runs Whisper on a 30s audio chunk and writes output to S3
import whisper
import boto3
import json
import io
model = whisper.load_model("large-v3", device="cuda")
def transcribe_chunk(chunk_s3_key: str, episode_id: str, chunk_index: int) -> dict:
s3 = boto3.client("s3")
obj = s3.get_object(Bucket="podcast-audio", Key=chunk_s3_key)
audio_bytes = obj["Body"].read()
# Write to temp file (Whisper requires file path or numpy array)
tmp_path = f"/tmp/{episode_id}_{chunk_index}.wav"
with open(tmp_path, "wb") as f:
f.write(audio_bytes)
result = model.transcribe(
tmp_path,
language="en",
word_timestamps=True,
beam_size=5,
best_of=5,
temperature=0.0,
)
output = {
"chunk_index": chunk_index,
"episode_id": episode_id,
"text": result["text"],
"segments": [
{
"start": seg["start"],
"end": seg["end"],
"text": seg["text"],
"words": seg.get("words", []),
}
for seg in result["segments"]
],
}
out_key = f"transcripts/{episode_id}/chunk_{chunk_index:04d}.json"
s3.put_object(
Bucket="podcast-audio",
Key=out_key,
Body=json.dumps(output).encode(),
)
return output
Each pod processes one chunk at a time. A single A100 GPU transcribes a 30-second chunk in approximately 1.5 seconds (20x real-time). Batch inference within a single chunk (processing multiple sequences simultaneously) would improve GPU utilization but adds latency variance; for latency-sensitive episodes we prefer low-batch or batch=1.
Descript (the podcast editing platform) published that they run Whisper at 30x real-time on A100s with batching enabled for archival work. For near-real-time use cases they run at 20x real-time with batch=1 to reduce tail latency. Apple Podcasts uses a similar chunked architecture with custom ASR models optimized for podcast audio characteristics.
Speaker Diarization
Speaker diarization answers the question “who spoke when?” It is a separate concern from transcription - the ASR model converts audio to text, but diarization assigns speaker labels to time intervals.
Diarization runs on the full original audio file (not the chunks) because speaker embedding models need global context to consistently identify that “the voice at 5:42” is the same person as “the voice at 47:12.”
# Speaker diarization using pyannote.audio
from pyannote.audio import Pipeline
import boto3
import json
import io
diarization_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="HF_TOKEN",
)
diarization_pipeline.to("cuda")
def diarize_episode(episode_id: str, s3_key: str) -> list[dict]:
s3 = boto3.client("s3")
obj = s3.get_object(Bucket="podcast-audio", Key=s3_key)
tmp_path = f"/tmp/{episode_id}_full.wav"
with open(tmp_path, "wb") as f:
f.write(obj["Body"].read())
diarization = diarization_pipeline(tmp_path)
segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
segments.append({
"start": turn.start,
"end": turn.end,
"speaker": speaker, # e.g. "SPEAKER_00", "SPEAKER_01"
})
return segments
def merge_transcript_with_diarization(
transcript_segments: list[dict],
speaker_segments: list[dict],
) -> list[dict]:
merged = []
for word in transcript_segments:
word_mid = (word["start"] + word["end"]) / 2
speaker = "SPEAKER_UNKNOWN"
for seg in speaker_segments:
if seg["start"] <= word_mid <= seg["end"]:
speaker = seg["speaker"]
break
merged.append({**word, "speaker": speaker})
return merged
Diarization and ASR are independent - they run in parallel on the same audio file. Merging happens at the word level by timestamp: for each word in the ASR output, find the diarization segment whose time range contains the word’s midpoint. This O(n*m) merge is fast because both lists are sorted by time.
Topic Boundary Detection
Topic boundary detection identifies the moments where the conversation shifts from one subject to another - the chapter boundaries. These are not silences or speaker changes; they are semantic transitions that require understanding the content.
The approach uses embedding-based cosine similarity: embed each sentence of the transcript, then look for points where consecutive sentences have a large drop in similarity - a “semantic cliff” that signals a topic shift.
# Topic boundary detection using sentence embeddings and cosine similarity drops
import numpy as np
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def detect_chapter_boundaries(
sentences: list[dict], # [{text, start_ms, end_ms}]
window_size: int = 3, # sentences on each side of candidate boundary
threshold: float = 0.35, # cosine distance threshold for a boundary
min_chapter_ms: int = 120_000, # 2 minutes minimum chapter length
) -> list[dict]:
if len(sentences) < window_size * 2 + 1:
return []
texts = [s["text"] for s in sentences]
embeddings = embedder.encode(texts, batch_size=64, normalize_embeddings=True)
boundaries = []
last_boundary_ms = 0
for i in range(window_size, len(sentences) - window_size):
# Average embedding of sentences before and after the candidate point
before = embeddings[i - window_size : i].mean(axis=0)
after = embeddings[i : i + window_size].mean(axis=0)
# Cosine distance between before and after windows
cos_sim = float(np.dot(before, after)) # embeddings are L2-normalized
cos_dist = 1.0 - cos_sim
candidate_ms = sentences[i]["start_ms"]
if cos_dist >= threshold and (candidate_ms - last_boundary_ms) >= min_chapter_ms:
boundaries.append({
"sentence_index": i,
"start_ms": candidate_ms,
"cos_distance": round(cos_dist, 4),
"title": extract_chapter_title(sentences[i - window_size : i + window_size]),
})
last_boundary_ms = candidate_ms
return boundaries
def extract_chapter_title(sentences: list[dict]) -> str:
# Simple keyword extraction: most frequent non-stopword nouns
import re
from collections import Counter
stopwords = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "of", "is", "it", "that", "this"}
text = " ".join(s["text"] for s in sentences)
words = re.findall(r"\b[A-Za-z]{4,}\b", text.lower())
freq = Counter(w for w in words if w not in stopwords)
top_words = [w for w, _ in freq.most_common(3)]
return " / ".join(w.title() for w in top_words) if top_words else "Chapter"
The window_size=3 means we look at the 3 sentences before and after each candidate point. A larger window smooths noise but misses short topic transitions. The threshold=0.35 is tuned empirically - lower values detect more boundaries (many false positives), higher values miss real boundaries.
Sentence boundaries in podcast transcripts are not marked by punctuation - ASR models produce a stream of words with no sentence breaks. You must segment the transcript into pseudo-sentences by splitting on long pauses (gaps of 500ms or more between words) rather than punctuation. Whisper’s word timestamps make this straightforward.
Transcript Indexing
Once the full transcript is assembled and speaker-labeled, it needs to be indexed for search. The search use case is “find the moment in this episode where they talked about X” - a phrase search that returns a timestamp, not just a document hit.
# Elasticsearch mapping for podcast transcript segments
TRANSCRIPT_MAPPING = {
"mappings": {
"properties": {
"episode_id": {"type": "keyword"},
"podcast_id": {"type": "keyword"},
"start_ms": {"type": "long"},
"end_ms": {"type": "long"},
"speaker": {"type": "keyword"},
"text": {"type": "text", "analyzer": "english"},
"chapter_id": {"type": "keyword"},
"published_at": {"type": "date"},
}
},
"settings": {
"number_of_shards": 5,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"english": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "english_stop", "english_stemmer"],
}
}
},
},
}
def index_transcript_segments(
es_client,
episode_id: str,
segments: list[dict], # [{start_ms, end_ms, speaker, text, chapter_id}]
batch_size: int = 500,
) -> int:
from elasticsearch.helpers import bulk
actions = [
{
"_index": "podcast_transcripts",
"_id": f"{episode_id}:{seg['start_ms']}",
"_source": {
"episode_id": episode_id,
"start_ms": seg["start_ms"],
"end_ms": seg["end_ms"],
"speaker": seg.get("speaker", "SPEAKER_UNKNOWN"),
"text": seg["text"],
"chapter_id": seg.get("chapter_id"),
},
}
for seg in segments
]
success, errors = bulk(es_client, actions, chunk_size=batch_size, raise_on_error=False)
return success
Documents are indexed at the segment level (roughly 10-30 words per segment) rather than the word level to keep index size manageable, while still allowing timestamp-precision search results.
Index at the segment granularity, not the word granularity. Word-level indexing inflates the index by 10x with no query quality benefit for podcast search. Users search for topics (“machine learning”), not individual words - segment-level indexing with phrase queries handles this correctly.
Pipeline Parallelism
The pipeline has a dependency graph that determines which stages can run in parallel:
- Chunker runs first (depends on: upload complete)
- ASR Workers run in parallel across all chunks (depend on: Chunker for each chunk)
- Diarization runs in parallel with ASR (depends on: upload complete, independently of Chunker)
- Stitch Worker runs after all ASR chunks complete
- Merge Worker runs after Stitch + Diarization both complete
- Chapter Detection runs on the merged transcript
- Index Writer runs after Chapter Detection
# Job DAG definition (simplified Airflow-style)
pipeline:
name: podcast_transcription
steps:
- id: chunk_audio
depends_on: []
worker: chunker
timeout_minutes: 5
- id: diarize
depends_on: []
worker: diarization_gpu
timeout_minutes: 20
- id: asr_chunk_{i}
depends_on: [chunk_audio]
worker: asr_gpu
parallelism: num_chunks # one job per chunk
timeout_minutes: 3
- id: stitch_transcript
depends_on: [asr_chunk_*] # wait for all chunks
worker: stitch
timeout_minutes: 2
- id: merge_diarization
depends_on: [stitch_transcript, diarize]
worker: merge
timeout_minutes: 2
- id: detect_chapters
depends_on: [merge_diarization]
worker: chapter_detector
timeout_minutes: 3
- id: index_search
depends_on: [detect_chapters]
worker: index_writer
timeout_minutes: 3
For a 60-minute episode (120 chunks), the critical path is: chunk_audio (2min) + asr_chunk (2min, parallel) + stitch (1min) + merge (1min) + detect_chapters (1min) + index (1min) = approximately 8 minutes. Diarization runs off the critical path and completes well within the stitch dependency window.
Cost-Latency Tradeoff
GPU instance cost is the dominant expense. An A100 instance costs approximately $3.50/hour. Processing a 60-minute episode requires 120 chunk-ASR jobs, each taking 1.5 seconds - 180 seconds of GPU time, or $0.175 per episode at full A100 rate. At 50,000 hours/day = 50,000 episodes of average 60 minutes, daily GPU cost is approximately $8,750.
Given:
- 50,000 hours of audio per day
- Average episode: 60 minutes
- Episodes per day: 50,000
- ASR: 1.5s per 30s chunk on A100
- GPU time per episode: 120 chunks x 1.5s = 180s = 3 GPU-minutes
GPU compute cost (A100 at $3.50/hr):
- Per episode: 3/60 hours x $3.50 = $0.175
- Daily: 50,000 x $0.175 = $8,750/day
Storage cost (S3):
- Raw audio: 50,000 episodes x 60MB avg = 3TB/day
- At $0.023/GB: 3,000 x $0.023 = $69/day
- Transcript JSON: ~500KB per episode = 25GB/day = negligible
Elasticsearch index:
- ~5KB per indexed segment, ~200 segments per episode = 1MB per episode
- Daily ingest: 50GB; at $0.10/GB*month = $50/month
Cost tiering reduces GPU spend significantly. Episodes with fewer than 1,000 plays are processed on spot GPU instances at 70% discount. Episodes with fewer than 100 expected plays use batch CPU transcription (Whisper runs on CPU at 3x real-time - 20 minutes for a 60-minute episode) with no latency SLA guarantee.
Spotify’s podcast transcription system (launched in 2023) processes over 5 million episode hours per year. They published that they use a combination of their own ASR models and third-party APIs for different quality tiers, with cost-based routing similar to this design. YouTube’s automatic captions system uses a similar chunked pipeline with Google’s own ASR infrastructure.
Data Model
-- Episode metadata - one row per uploaded podcast episode
CREATE TABLE podcast_episodes (
episode_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
podcast_id UUID NOT NULL,
title TEXT NOT NULL,
audio_s3_key TEXT NOT NULL,
duration_ms BIGINT,
uploaded_at TIMESTAMPTZ DEFAULT now(),
transcript_status TEXT DEFAULT 'pending'
CHECK (transcript_status IN ('pending','chunking','transcribing','stitching','indexed','failed')),
processing_tier TEXT DEFAULT 'standard'
CHECK (processing_tier IN ('premium','standard','batch'))
);
CREATE INDEX idx_episodes_podcast ON podcast_episodes(podcast_id, uploaded_at DESC);
CREATE INDEX idx_episodes_status ON podcast_episodes(transcript_status) WHERE transcript_status != 'indexed';
-- Transcription jobs - tracks per-chunk work
CREATE TABLE transcription_jobs (
job_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
episode_id UUID NOT NULL REFERENCES podcast_episodes(episode_id),
chunk_index INT NOT NULL,
audio_s3_key TEXT NOT NULL,
audio_start_ms BIGINT NOT NULL,
audio_end_ms BIGINT NOT NULL,
status TEXT DEFAULT 'queued',
worker_id TEXT,
started_at TIMESTAMPTZ,
completed_at TIMESTAMPTZ,
output_s3_key TEXT,
UNIQUE(episode_id, chunk_index)
);
CREATE INDEX idx_jobs_episode ON transcription_jobs(episode_id, chunk_index);
-- Assembled transcript segments - word-level, with speaker labels
CREATE TABLE transcript_segments (
segment_id BIGSERIAL PRIMARY KEY,
episode_id UUID NOT NULL,
start_ms BIGINT NOT NULL,
end_ms BIGINT NOT NULL,
text TEXT NOT NULL,
speaker TEXT,
confidence FLOAT,
chapter_id UUID
) PARTITION BY LIST (episode_id); -- partition per episode for fast deletes
CREATE INDEX idx_segments_episode_time ON transcript_segments(episode_id, start_ms);
-- Chapter markers - auto-detected topic boundaries
CREATE TABLE chapter_markers (
chapter_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
episode_id UUID NOT NULL REFERENCES podcast_episodes(episode_id),
start_ms BIGINT NOT NULL,
title TEXT,
cos_distance FLOAT, -- boundary strength signal
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX idx_chapters_episode ON chapter_markers(episode_id, start_ms);
-- Speaker segments - diarization output
CREATE TABLE speaker_segments (
id BIGSERIAL PRIMARY KEY,
episode_id UUID NOT NULL,
speaker_label TEXT NOT NULL,
start_ms BIGINT NOT NULL,
end_ms BIGINT NOT NULL
);
CREATE INDEX idx_speakers_episode_time ON speaker_segments(episode_id, start_ms);
The transcript_segments table is list-partitioned by episode_id so that deleting a retranscribed episode drops one partition rather than running a slow DELETE WHERE episode_id = ? on a large table.
Key Algorithms and Protocols
Transcript Stitch from Overlapping Chunks
The stitch algorithm merges 120 overlapping chunk transcripts into a single coherent transcript without duplicating words in the overlap regions:
# Stitch overlapping chunk transcripts into a single word stream
def stitch_chunks(
chunks: list[dict], # sorted by chunk_index, each has {audio_start_ms, segments:[{words}]}
) -> list[dict]:
result_words = []
last_word_end_ms = 0
for chunk in sorted(chunks, key=lambda c: c["chunk_index"]):
base_ms = chunk["audio_start_ms"]
for seg in chunk["segments"]:
for word in seg.get("words", []):
# Convert chunk-relative timestamps to episode-absolute timestamps
abs_start = base_ms + int(word["start"] * 1000)
abs_end = base_ms + int(word["end"] * 1000)
# Skip words that fall in the overlap with the previous chunk
if abs_start < last_word_end_ms - 100: # 100ms tolerance
continue
result_words.append({
"text": word["word"],
"start_ms": abs_start,
"end_ms": abs_end,
"confidence": word.get("probability", 1.0),
})
last_word_end_ms = abs_end
return result_words
The key is abs_start < last_word_end_ms - 100: any word whose absolute start time is earlier than the last accepted word’s end time (minus a 100ms tolerance) is in the overlap and is skipped.
Cosine Similarity Drop for Boundary Scoring
# Compute per-sentence similarity drop signal for chapter boundary detection
def similarity_drop_signal(embeddings: np.ndarray, window: int = 3) -> np.ndarray:
n = len(embeddings)
drops = np.zeros(n)
for i in range(window, n - window):
before = embeddings[max(0, i - window):i].mean(axis=0)
after = embeddings[i:min(n, i + window)].mean(axis=0)
drops[i] = 1.0 - float(np.dot(before / np.linalg.norm(before),
after / np.linalg.norm(after)))
return drops # peaks in this array = chapter boundaries
Time complexity: O(n * window) where n is the number of sentences. For a 90-minute episode with ~1,800 sentences and window=3, this runs in under 50ms on CPU.
Scaling and Performance
Throughput sizing for 50,000 hours/day upload rate:
Peak upload rate (assuming 3x burst factor):
- Average: 50,000 hours/day = 2,083 hours/hour = 34.7 hours/minute
- Peak: 34.7 * 3 = 104 episode-hours/minute
ASR GPU requirement at peak:
- 104 episode-hours/minute = 6,240 episode-minutes/minute
- Each GPU-minute handles 20x real-time = 20 audio-minutes
- GPUs needed: 6,240 / 20 = 312 A100 GPUs at peak
Auto-scaling behavior:
- Target: 80% GPU utilization on each pod
- Scale-out trigger: job queue depth > 100 per pod
- Scale-in: queue depth < 10 for 5 consecutive minutes
- Warm-up time for new GPU pod: ~90 seconds (model load)
Diarization (CPU-based at non-premium tier):
- 1 CPU-minute processes 3 real-minutes of audio
- 104 episode-hours/minute = 6,240 audio-minutes/minute
- CPU cores needed: 6,240 / 3 = 2,080 vCPUs at peak
Spotify uses Kubernetes horizontal pod autoscaling for their transcription workers with GPU-aware scheduling. Their blog post on podcast transcription scaling noted that GPU warm-up latency (model load time) was a critical bottleneck - they solved it by keeping a minimum of 10 warm GPU pods running at all times regardless of queue depth.
Failure Modes and Recovery
| Failure | Detection | Impact | Recovery |
|---|---|---|---|
| ASR worker crash mid-chunk | Job heartbeat timeout (30s); job returns to queue | Chunk not transcribed; stitch blocks waiting | Chunk job re-queued with exponential backoff; idempotent output key prevents double-write |
| Chunk ordering mismatch in stitch | Monotonicity check: abs_start should increase; gap > 5s between consecutive words triggers alert | Garbled transcript around the boundary | Stitch worker detects and re-requests affected chunks from ASR |
| Diarization timeout (full audio) | Job timeout (20min); alert fired | Episode transcribed without speaker labels; published with “Unknown Speaker” | Async retry of diarization; once complete, a merge patch job updates the segment speaker labels |
| Elasticsearch write failure | Bulk index returns error count > 0 | Episode missing from search | Index writer retries with exponential backoff; dead-letter queue for persistent failures |
| S3 upload failure (chunk write) | PutObject raises exception | Chunk not available for ASR workers | Chunker retries S3 write up to 3 times; if all fail, episode job is marked failed and retried from scratch after 10 minutes |
The most common production failure is partial completion: 119 of 120 chunks transcribe successfully but chunk 67 fails permanently (corrupted audio segment). The stitch worker waits indefinitely for chunk 67. Always set a stitch timeout: if all chunks except N are complete and the wait exceeds 2x the expected processing time, proceed with a gap marker (“transcript unavailable for this segment”) rather than blocking the entire episode indefinitely.
Comparison of Approaches
| Approach | Latency | Cost | Cold Start WER | Diarization | Best Fit |
|---|---|---|---|---|---|
| Cloud ASR (Google Speech-to-Text) | Low (2min) | High ($0.006/min) | Low (8-10%) | Separate API | Small volumes, no infra |
| Self-Hosted Whisper (sequential) | High (20min for 60min ep) | Low | Medium (7-9%) | Separate job | Batch / archival transcription |
| Self-Hosted Whisper + Parallel Chunks (this design) | Medium (5min) | Medium | Medium (7-9%) | Parallel | Production platform at scale |
| Streaming ASR (incremental) | Very Low (real-time) | High | High (12-15%) | None | Live captions, not podcast |
| Federated Transcription (user device) | Variable | Near-zero server cost | Varies | None | Privacy-sensitive use cases |
The self-hosted parallel chunking approach wins for a platform at scale because it gives the best cost-to-latency ratio without vendor lock-in. Cloud ASR is simpler to operate but 3-5x more expensive per minute of audio and provides no model customization. Streaming ASR has higher WER because it cannot use bidirectional context - the model can only see audio that has already happened.
Key Takeaways
- Chunked audio processing with overlapping boundaries enables O(n/P) transcription latency where P is the number of parallel GPU workers.
- ASR model integration via Whisper provides word-level timestamps for free - use them to enable timestamp-precision search and stitch alignment.
- Speaker diarization runs independently of ASR on the full audio file; merging by timestamp at word granularity produces accurate speaker-labeled transcripts.
- Topic boundary detection via cosine similarity drops on sentence embeddings is a reliable, low-latency alternative to full-transcript LLM chapter detection.
- Transcript indexing at segment granularity balances index size and search precision for the podcast “find this moment” use case.
- Pipeline parallelism across chunking, ASR, and diarization is the critical design choice - without it, latency scales linearly with episode length.
- Cost-latency tiering (GPU for premium, spot GPU for standard, CPU for batch) reduces daily GPU spend by 40-60% on long-tail content.
The counter-intuitive lesson is that the chapter detection problem is harder than the transcription problem. ASR quality is well-understood and can be benchmarked with WER. Chapter boundary quality is subjective - two humans labeling the same episode often agree on only 60-70% of boundaries. Designing for 70% recall against human labels is the right target, but communicating that to stakeholders requires reframing: chapters are a navigation aid, not a perfect summary. An imperfect chapter at the right place is more useful than no chapter at all.
Frequently Asked Questions
Q: Why not use a single streaming ASR approach instead of chunking? A: Streaming ASR processes audio in real-time, producing captions as audio arrives. It has two problems for podcast transcription: WER is 3-5 percentage points higher than batch ASR because the model cannot use bidirectional context, and it requires a persistent connection to the audio source which is incompatible with pre-recorded uploads. Chunked batch ASR is strictly better for accuracy on pre-recorded content.
Q: Why not run chapter detection with an LLM (GPT-4) for better quality? A: A 90-minute transcript is approximately 13,500 words - exceeding typical LLM context windows, requiring splitting. More importantly, an LLM call adds 10-30 seconds of latency and significant cost per episode. The cosine-similarity approach runs in under 1 second on CPU and produces results within 5-10% of LLM quality for chapter boundary detection. Use LLM chapter generation only for premium tier where users expect AI-generated chapter summaries, not just boundary markers.
Q: How do you handle non-English podcasts?
A: Whisper is multilingual - pass language=None to auto-detect, or use explicit language codes. The accuracy drops for lower-resource languages (Swahili, Yoruba) compared to English. For diarization, pyannote.audio is language-agnostic since it operates on acoustic features. Chapter detection via sentence embeddings requires a multilingual model (e.g., paraphrase-multilingual-MiniLM-L12-v2).
Q: Why partition transcript_segments by episode_id instead of by time?
A: The primary query pattern is “give me all segments for episode X” not “give me all segments from time T1 to T2 across all episodes.” Time-based partitioning would scatter a single episode’s segments across many partitions, slowing the most common query. List partitioning by episode_id means each episode’s 200-500 segments are in one partition, and deleting a re-transcribed episode drops that partition instantly.
Q: How do you detect music and ads versus spoken content for more accurate chapter detection?
A: Music and ad segments appear as periods of high audio energy with low speech probability from the VAD (Voice Activity Detection) preprocessor. These segments are tagged as SEGMENT_TYPE=music or SEGMENT_TYPE=ad rather than passed to the ASR model. Chapter boundaries are only detected within SEGMENT_TYPE=speech runs, which avoids false chapter boundaries at every music transition.
Q: What is the false-positive rate for chapter boundaries and how do you tune it?
A: With threshold=0.35 and min_chapter_ms=120000, typical episodes produce 8-15 chapter boundaries. False positive rate (a boundary where humans say there is none) is approximately 20-30% - one in four detected boundaries is not a real topic shift. Increasing the threshold to 0.45 reduces false positives to 10% but misses 30% of real boundaries. The threshold is a slider between precision and recall; 0.35 optimizes for recall (don’t miss real chapters) at the cost of precision.
Interview Questions
Q: A 3-hour podcast uploads and the transcript is due within 5 minutes. Walk through how your pipeline achieves this. Expected depth: 360 chunks at 30s each, 120 ASR workers in parallel, each chunk taking 1.5s on A100 = 1.5s total ASR time on critical path; overhead from chunk upload to S3 + queue propagation adds 30-60s; stitch + merge + index adds 2-3 minutes; total under 5 minutes with sufficient GPU fleet.
Q: How do you handle a chunk whose ASR output has very low confidence (average log-prob below -2.0)?
Expected depth: Flag low-confidence chunks in the output JSON; stitch worker marks those segments with confidence=low; retry with temperature=0.2 beam search rather than greedy decoding; surface to UI as “[unclear]” rather than low-quality text; alert if more than 10% of chunks in an episode have low confidence.
Q: The diarization model incorrectly merges two similar-sounding speakers throughout a 2-hour interview. How do you fix this without re-processing the full audio?
Expected depth: Diarization models embed speaker characteristics and cluster - similar voices get merged. Fix requires re-running diarization with num_speakers=2 hint (if known), or using voice enrollment (provide reference clips of each speaker). Cannot fix without re-running diarization on the full audio; however, the merge worker is idempotent so re-running just the diarize + merge steps is cheap.
Q: The Elasticsearch index grows by 50GB per day. After 30 days, query latency starts degrading. How do you address this? Expected depth: Time-based index lifecycle policy in Elasticsearch: episodes older than 90 days move to warm tier (fewer replicas, slower disks), older than 1 year to cold tier (S3 backing). Force-merge segments on older shards to reduce segment count. Shard allocation tuning: keep active index at 30GB per shard. Consider index aliases to keep search transparent across the tier transitions.
Q: Design the cost attribution system so each podcast creator is billed for their transcription usage.
Expected depth: Tag every GPU job with podcast_id and episode_id; log GPU-seconds per job to a metering service; aggregate by podcast_id per billing cycle; handle spot instance pricing (cost per job varies); implement a credit system for free-tier creators; expose per-episode cost breakdown in creator dashboard.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.