Trellis (YC W24) Is Hiring: Automate Prior Auth in Healthcare (ycombinator.com)

I’ve been obsessed with code-driven graphics since my first ellipse() in Processing. There’s something magical about watching a few lines of code bloom into motion—clean, crisp, and totally reproducible. With GGBench, I’m turning that obsession into a public benchmark for AI-generated graphics.

GGBench is a benchmarking platform to compare graphics produced by AI—especially code that renders visuals (still and animated). You vote on side-by-side outputs, and we rank models with an ELO system (the same concept chess uses). It’s simple, transparent, and designed to push the whole field forward.

Why this benchmark needs to exist Most LLMs are fantastic at writing code, but they’re basically flying blind: they can’t actually see what they produce. They output p5.js or Three.js scripts and call it a day. Whether that code renders beautifully—or at all—has been tricky to evaluate at scale.

Meanwhile, video diffusion models chew through compute to produce every pixel of every frame. Cool tech, but expensive, slow to iterate, and hard to control. Code-rendered graphics (p5.js, Three.js, shaders, SVG) are the opposite: cheap, fast, and precise. If I want a circle at (100, 100) moving at 2 px/frame, I can guarantee it. Determinism and control matter.

GGBench is my way to compare that code-first approach across models, in public, with ELO-based rankings that’s fair and hard to game.

What we measure (and why) We’re testing an LLM’s ability to write code that renders:

Still images: composition, color, structure, style fidelity. Animations: motion quality, timing, easing, coherence across frames. Because this is code, we also get repeatability (same prompt, same seed, same output), plus fine-grained prompts that check understanding of geometry, physics-ish motion, layering, and programmatic art patterns.

Categories you can explore Early categories include:

Cityscape – skylines, parallax scrolling, night/day cycles Nature – particles, waves, noise fields, flocking Abstract – generative patterns, symmetry, tiling, shaders You can filter the leaderboard by category to see which models excel where. Some models are great at structured geometry; others shine with texture and noise.

Why start with p5.js (and what’s next) I love p5.js: it’s approachable, expressive, and perfect for rapid iteration. But GGBench isn’t stopping there. We’re adding:

Three.js for 3D scenes and camera choreography WebGL/Shaders for low-level effects and speed SVG for clean vector output and accessibility As we broaden render engines, the benchmark will reveal which models can generalize across 2D, 3D, and shader-based tasks.

What this unlocks for the community For developers: a clear target to optimize against, with transparent signals on what’s working. For researchers: a public dataset of prompts, code, outcomes, and votes for deeper analysis. For creators: a practical way to choose the right model for your aesthetic and your stack. Roadmap (near-term) Three.js categories and 3D animation tests Per-prompt and per-engine breakdowns Submission flow for new models and agents “Tournament mode” with fixed prompts and seeds API for automated match scheduling and result ingestion Jump in If you’re curious where today’s models stand—or you just enjoy generative art—come help shape the rankings.

Start Voting: pick your favorite in head-to-heads View Leaderboard: see which models are rising About: read how everything is calculated (and verify it yourself) This is GGBench’s first public step. It’s opinionated—code first, community first, transparency first—because that’s what I think will actually move AI graphics forward. If that resonates with you, I’d love your votes, your critiques, and your pull requests.

Would love any feedback on this!

Comments (0)

No comments yet