We built audio/video RAG

5 mkauffman23 4 7/22/2025, 3:02:37 PM ragie.ai ↗

Comments (4)

rdegges · 5h ago
Great article. This may be my all-time favorite deep dive post on RAG strategies.

It’s super interesting to me how the process of fully making audio/video searchable requires so much processing. Like, extracting the audio and video, transcribing the audio, chunking the video into 15-sec scenes and describing them visually, etc.

I wonder if as a test you could use the video descriptions, run them as a prompt through something like Veo, then stitch them together into something close to the original. Wild.

mkauffman23 · 4h ago
I have no idea how accurate the reconstruction would be but it would make for a wild experminent!
mkauffman23 · 6h ago
In this blog we detail the api design and technical decisions we made when adding audio video support to Ragie's RAG service. We explore some of the approaches we tried and the rationale behind what we landed on. Worth a read if you're building similar systems.

Here's a TLDR: - Built a full pipeline that processes audio/video → transcription + vision descriptions → chunking → indexing - Audio: faster-whisper with large-v3-turbo (4x faster than vanilla Whisper) - Video: Chose Vision LLM descriptions over native multimodal embeddings (2x faster, 6x cheaper, better results) 15-second video chunks hit the sweet spot for detail vs context - Source attribution with direct links to exact timestamps

Happy to answer any further questions folks might have!

bobremeika · 6h ago
Source attribution with direct links to exact timestamps is truly unique when it comes to A/V RAG solutions.