Show HN: Experimental HN Discussion to Animated Video Pipeline
I spent 2 days playing with automating a "live action talking animals" 90 second short 100% automated from that discussion.
The result is the YouTube link for this post.
---
Here is a more detailed breakdown:
Sunday night at 2am in the morning, I had a question and a wild thought.
Could I point Gemini Deep Research at https://news.ycombinator.com/item?id=44685011 and ask it to analyze and reduce it into 5 - 7 personas and extract key quotes, themes, etc.
If it could, then I had the idea of generating a short social video illustrating the discussion in a short 30 second video.
If I avoided curation, cherry-picking, and manual intervention... would it be any good? All effort went into prompts and context, not review or massaging.
I then spent Monday and Tuesday on the idea in a time-boxed spike.
I ended up with the following process:
- Gemini Deep Research, HN discussion produced the report
- Using the report, a Gemini Pro prompt output characters and the themes
- For each character, a Gemini Pro prompt chose an animal and described their character in detail
- Using the character descriptions and the themes, a Gemini Pro output a time coded script with actions, sound effect and dialog
- For each character, Gemini Pro produced an image prompt feed into Imagen 4 which output a character sheet with that character from multiple angles with all of their accessories and quirks
- With the script, Gemini Pro split it into scene files
- For each scene and character description, Gemini Pro with a JSON schema produced the video generation prompt
- For each scene and character sheet, Image 4 produced the first frame image for that scene
- Given the first frame image and the video generation prompt, Veo 3 produced an 8 second clip which includes voice, sounds, etc.
- Some dialog was longer than 8 seconds and required manual steps, such as using ffmpeg to get the last video frame and save to .png and use as the first frame in a second extension video
For LLM input / output, I used the first result.
For Image 4 I generated between 2 and 4 outputs and manually chose the best. In 100% of these samples either the first or the second was acceptable.
For Veo 3 I generated between 2 and 8 outputs and manually chose the best. Voice continuity was the biggest challange.
I originally had complicated plans for a 16:9 mobile ratio video where the screen would be broken up into 3 panels. Tuesday at 2pm, I abandon this and went with a simple linear approach. I slapped things into iMovie and got it done.
I was impressed with Gemini Pro 2.5's ability to understand the 3-slot and use the verticality in it's directions. It had characters looking up and down like in Hollywood Squares.
This experiment was as minimally "Cherry Picked" as possible. I'm impressed with the quality of the LLM, image, and video generation output.
Lastly, I had forgot to have the LLM art direct the opening and last frame, so I made some stuff up and finished the project.
I learned a lot and this was a fun experiment. It was a mixture of automation and manual steps. When I did this, you could not automate Imagen 4 with a character sheet, nor Veo 3 with frame to video. Each of those I had to manually use Whisk and Flow (their respective UIs).
I wrote 5 scripts. I ended up with about 142 artifacts (input and output files).
No comments yet