Hey everyone, I'm brian, one of the makers of cosmos, a desktop app that makes your entire media collection, including external hard drives, searchable by using local ML models.
With your catalog indexed, you can use existing content to generate videos (text-to-video and image-to-video) using Veo 3. To try this out you'll need to bring your own Gemini API key. Obviously this part is not private since you are using Google's AI, but the generations get saved to your desktop and imo it's less clunky than the Google Videos UI. We also added a prompt pre-processing step to enrich the original user input. We use Gemini to create a structured JSON prompt that includes detailed information on lighting, audio, characters, and mood, to name it few. In my experience this makes it easier to preserve continuity in your scenes.
I want to experiment with some local generation models soon so Cosmos can function 100% offline (I've read good things about Wan 2.1 and Stable Diffusion). I really like working with local models (also using Whisper for audio to text transcription) and think long-term everyone will want at least some portion of their data managed by private, offline models.
If you are curious about building something like this for yourself, below is a rough outline:
- Pick a platform or a cross-platform tool for your build (we started with Electron and eventually moved to Tauri)
- Select your ML models. There are plenty of open-source image and text embedding models (Clip, Siglip, Nomic)
- Design a media processing pipeline that won't fry your users' computer (pro tip: you're going to want to throttle indexing when CPU utilization gets too high)
- Experiment with well-known open-source media tools like ImageMagick and FFmpeg. This is more than enough to extract frame, clip videos, or anything else you might want to do with a piece of media in your pre/post-processing
- Database choice: There are lots of choices for DBs, but in my experience simpler is better. We started with Redis (it was overkill) and eventually migrated to sqlite with a vector embedding extension. Haven't tried Qdrant, Pinecone, or Chromadb, but sqlite works great for this use case.
- If you want to support online AI platforms like OpenAI or Anthropic then you'll need to manage API keys and HTTP requests to these services (or maybe MCP? Don't know much about that yet).
With your catalog indexed, you can use existing content to generate videos (text-to-video and image-to-video) using Veo 3. To try this out you'll need to bring your own Gemini API key. Obviously this part is not private since you are using Google's AI, but the generations get saved to your desktop and imo it's less clunky than the Google Videos UI. We also added a prompt pre-processing step to enrich the original user input. We use Gemini to create a structured JSON prompt that includes detailed information on lighting, audio, characters, and mood, to name it few. In my experience this makes it easier to preserve continuity in your scenes.
I want to experiment with some local generation models soon so Cosmos can function 100% offline (I've read good things about Wan 2.1 and Stable Diffusion). I really like working with local models (also using Whisper for audio to text transcription) and think long-term everyone will want at least some portion of their data managed by private, offline models.
If you are curious about building something like this for yourself, below is a rough outline: - Pick a platform or a cross-platform tool for your build (we started with Electron and eventually moved to Tauri) - Select your ML models. There are plenty of open-source image and text embedding models (Clip, Siglip, Nomic) - Design a media processing pipeline that won't fry your users' computer (pro tip: you're going to want to throttle indexing when CPU utilization gets too high) - Experiment with well-known open-source media tools like ImageMagick and FFmpeg. This is more than enough to extract frame, clip videos, or anything else you might want to do with a piece of media in your pre/post-processing - Database choice: There are lots of choices for DBs, but in my experience simpler is better. We started with Redis (it was overkill) and eventually migrated to sqlite with a vector embedding extension. Haven't tried Qdrant, Pinecone, or Chromadb, but sqlite works great for this use case. - If you want to support online AI platforms like OpenAI or Anthropic then you'll need to manage API keys and HTTP requests to these services (or maybe MCP? Don't know much about that yet).
Demo https://www.youtube.com/watch?v=qHPl_n-HlP4