Show HN: I built a desktop app that indexes your media locally

4 correa_brian 5 8/11/2025, 5:04:29 PM meetcosmos.com ↗

Comments (5)

correa_brian · 5h ago
Hey everyone, I'm brian, one of the makers of cosmos, a desktop app that makes your entire media collection, including external hard drives, searchable by using local ML models.

With your catalog indexed, you can use existing content to generate videos (text-to-video and image-to-video) using Veo 3. To try this out you'll need to bring your own Gemini API key. Obviously this part is not private since you are using Google's AI, but the generations get saved to your desktop and imo it's less clunky than the Google Videos UI. We also added a prompt pre-processing step to enrich the original user input. We use Gemini to create a structured JSON prompt that includes detailed information on lighting, audio, characters, and mood, to name it few. In my experience this makes it easier to preserve continuity in your scenes.

I want to experiment with some local generation models soon so Cosmos can function 100% offline (I've read good things about Wan 2.1 and Stable Diffusion). I really like working with local models (also using Whisper for audio to text transcription) and think long-term everyone will want at least some portion of their data managed by private, offline models.

If you are curious about building something like this for yourself, below is a rough outline: - Pick a platform or a cross-platform tool for your build (we started with Electron and eventually moved to Tauri) - Select your ML models. There are plenty of open-source image and text embedding models (Clip, Siglip, Nomic) - Design a media processing pipeline that won't fry your users' computer (pro tip: you're going to want to throttle indexing when CPU utilization gets too high) - Experiment with well-known open-source media tools like ImageMagick and FFmpeg. This is more than enough to extract frame, clip videos, or anything else you might want to do with a piece of media in your pre/post-processing - Database choice: There are lots of choices for DBs, but in my experience simpler is better. We started with Redis (it was overkill) and eventually migrated to sqlite with a vector embedding extension. Haven't tried Qdrant, Pinecone, or Chromadb, but sqlite works great for this use case. - If you want to support online AI platforms like OpenAI or Anthropic then you'll need to manage API keys and HTTP requests to these services (or maybe MCP? Don't know much about that yet).

Demo https://www.youtube.com/watch?v=qHPl_n-HlP4

kevintouati · 4h ago
Interesting angle. Does the search use semantic embeddings so it can surface clips by concept rather than filename/metadata? If it nails the retrieval part, that could be the real differentiator.
correa_brian · 4h ago
Yeah, exactly. We capture the semantic meaning of each frame and complement the filename/metadata, so both options work.
akshaymalik1995 · 3h ago
Do we have any metrics on the time taken to index media files or the latency for performing semantic searches on them?
correa_brian · 3h ago
I'm on an M2 and it takes <5 minutes to index a 2hr movie. If you're trying to index a lot of media at once, we will queue it up to be indexed. We also do smart sampling to detect similar frames so if it two talking heads vs. a lot of different shots, it will process faster. In that case the audio is more valuable for the talking heads.

The semantic search queries typically take 100-250ms.