Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics
(Btw, when we say real users we mean real users, so you may get a captcha on the site. Sorry, but we have to use every bot protection available! We only want human ratings, for obvious reasons.)
Here’s a demo video: https://www.youtube.com/watch?v=vPyEQnuVgeI
We didn’t set out to build this - we were actually working on an AI game engine. But we found that models sucked at look-and-feel. Even when the output code was usually functional, most visual aspects lacked the soul that makes great graphics feel alive.
So we built a this-or-that game, just for ourselves, to figure out which generated outputs had the best graphics. To our surprise, that turned out to be more exciting than the original idea—it turns out this is a widespread problem! We did a Show HN a month ago (https://news.ycombinator.com/item?id=44542578) and that was partly what convinced us to make this benchmark thing our actual product.
State-of-the-art models might be winning IMO gold, but they are still putting white text on a white background. There needs to be some measurement of what’s good and what isn’t (yes, there is such a thing as good design!), and it sure isn’t going to come from LLMs.
We come from engineering backgrounds (Apple and Nvidia) with a love for design; we know when we like or dislike something, even when we can’t say why. This-or-that / hot-or-not games are made for domains like this: Design Arena’s goal is to make everything stupidly simple so humans can just do the easy part: like-vs.-dislike. Which also turns out to be the valuable part, because what’s easiest for humans is actually the part that the AIs can’t currently do.
Since our Show HN, we’ve extended our initial set of ~25 LLM models to 54 LLM models, 12 image models, 4 video models, 22 audio models, and 22 vibe-coding tools (like Lovable, Bolt, v0, Firebase Studio, and more). In this last category, we’ve been surprised to find that agentic tools that were not specifically marketed as vibe-coders like Devin performed exceedingly well in the builder category, outperforming dedicated builder tools like Lovable, v0, and Bolt.
Our users are mostly devs who want to spin up a frontend, or designers who want to spin up design variants faster. In both cases, Design Arena provides a quick way to find out which options are better than others. Dev-or-designer needs to make the final calls, because there’s no substitute for good judgment. But this type of formatting can really help.
We plan to make money by offering version testing as a service to companies that need to quantify improvements in their product between builds.
This is the first time we’ve ever worked on something like this! We’d love to learn from you all and look forward to your feedback.
- Contests will often be won not by the entry that best adhered to the prompt, but the best-looking one. This happened in the contest "Input Prompt Build a brutalist website to a typeface maker," which I got as a recent example. The winning entry had megawatt-bright magenta and yellow, which shouldn't appear anywhere near brutalism, and in other design aspects had almost no connection to brutalism either -- but it was the most attractive of the bunch.
- The approach only gets you to a local maximum. Current LLMs aren't very good designers, as you say, so contests will involve picking between mostly middling entries. You'd want a design that's, say, a 9 or a 10 on a 10-point scale -- but some 95% of the entry distribution will probably be between 5.5 and 7.5 or so, and that's what users will get to pick from.
I definitely agree with your second point. One idea we're experimenting with is adding a human baseline, in which the models are benchmarked against human generated designs as well.
As for good game dev prompts, here's one from a user that made a pretty fun game: Make asteroids with 2 computers playing against each other on one screen. There should be asteroids flying and 2 ships being controlled by 2 computers. Pay attention to thoroughly implementing the logic to make the ships avoid asteroids at all costs. Absolutely no user input should be necessary, no click to start, no click to restart. The game starts automatically on load and automatically restarts when either computer is dead. The ships should survive as long as possible. The ships should fly around, avoid asteroids as a priority, but also shoot asteroids and each other. Make ships and asteroids positions random each time. Asteroids should split when shot. The goal is to create a robust algorithm for ships so they can survive as long as possible. The game should be playable at 500x500 screen resolution.
Here's another one. Game out for me, what would happen if I developed an LLM benchmark that was scientifically robust, but, it showed that some small open-weights model outperformed Anthropic, Google or OpenAI's best offerings? Like from the POV of a startup - let's say that happened.
Sure it can make great looking images but nothing can make a nice looking poster or basic page layout.
I’m waiting for someone to solve this. I’m not even sure it takes AI it might just be programmatic.
(Show HN: https://news.ycombinator.com/item?id=44542578)