Ask HN: What benchmarks are you using to judge AI models?

4 cowpig 2 4/30/2025, 9:32:00 PM

There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally. What benchmarks have you found to be especially indicative of real-world performance?

I use:

* Aider's Polyglot benchmark seems to be a decent indicator of which models are going to be good at coding:

https://aider.chat/docs/leaderboards/

* I generally assume OpenRouter usage to be an indicator of a model's popularity, and by proxy, utility:

https://openrouter.ai/rankings

* LLM-Stats has a lot of charts of benchmarks that I look at:

https://llm-stats.com/

Comments (2)

paulcole · 74d ago

> There are so many models, and so many new ones being released all the time, that I have a hard time knowing which ones to prioritize testing anecdotally

Just pick one and use it. The ones you’ve heard of (if you are not obsessively refreshing AI model rankings pages) are basically the same.

I’m sure I’ll get a ton of pushback that the one somebody loves is obviously so much better than the other one, but whatever.

Just give me OpenAI’s most popular model, their fastest model, and their newest model. I’ll pick among those 3 based on what I’m prioritizing in the moment (speed, depth, everyday use).

kadushka · 74d ago

For me it's the opposite - we don't get enough models to test. In the last 6 months, we got Claude 3.7, OpenAI o1, Grok 3, Gemini 2.5 Pro, and OpenAI o3. That's it - 5 frontier models. Not that hard to test each one of them manually, which I did for many hours and with many different tasks. o1 --> o3 and 2.5 Pro are the ones I'm using the most.

I couldn't care less about benchmarks - I know what these models are capable of from personal experience.