Removing 95% of podcast ads with transcript segmentation and LLMs

2 benbowler 1 9/12/2025, 11:27:01 AM benbowler.com ↗

Comments (1)

benbowler · 29m ago
I’ve been listening to podcasts for 15+ years. Ads used to be short and host-read. Now, some shows I follow have 15+ minutes of loud, compressed ads per hour.

I built a system to strip them out automatically. It takes a podcast feed, processes each episode, and outputs an ad-free feed compatible with any player.

What didn’t work:

Full-transcript one-shot prompting: LLMs would return a few timestamps, then stop—context was too broad.

Keyword-based detection: High false positives/negatives, especially with “house ads” and blended sponsor mentions.

What worked:

Segmentation + local scoring: Split transcripts into overlapping windows. Ask the LLM for “ad likelihood” per window—short prompts keep context tight.

Multi-head prompting: Separate prompts for (a) brand ads (URLs, promo codes, sponsor language) and (b) cross-promos. The cross-promo path compares segments to the show’s own notes/description to spot “subscribe to X podcast” segments.

Feedback loop: Users can flag missed ads; reported brand/podcast names bias future runs.

Post-processing: Merge adjacent detections, ignore <10s blips, smooth cut boundaries.

Speaker diarization (WhisperX): Detects voice/tone shifts to distinguish “host in-topic” from “host reading copy.”

Across interviews, daily news, and narrative shows, this consistently removes ~95% of ads. The remaining 5% are sponsor mentions woven directly into content—hard by design.

Infra: hosted on DigitalOcean; inference runs on Modal.com.

Full write-up (with prompts, heuristics, and some failure cases): https://PodcastAdBlock.app/blog/building-podcast-adblock

Curious if others have tackled similar problems—especially around hard-to-detect “native” ads or more efficient diarization approaches.