Show HN: HuMo AI – Multi-modal human-centric video generator (text+image+audio)
What it does • Text→Video with controllable motion and scene composition. • Image→Video to animate a still with natural movements and camera motion. • Audio-visual sync for speech-driven lip movement and rhythm-matched motion. • Multi-modal fusion: combine text + reference images + audio in one run. • Export-ready output: high-resolution (up to 4K), common aspect ratios.
Why Creative teams often juggle multiple tools: one for T2V, another for in-between motion, a third for lip-sync. I wanted a single studio that keeps identity consistent and aligns visuals with audio—useful for product explainers, character spots, and quick social posts.
What’s different • Built around multi-modal conditioning rather than single-input T2V. • Emphasis on identity/subject preservation across the whole clip. • Frame-level audio alignment for more natural lips and motion. • Workflow extras like shot lists / playbooks to speed up iteration.
No comments yet