The Leaderboard Illusion

57 pongogogo 13 4/30/2025, 7:58:24 AM arxiv.org ↗

Comments (13)

pongogogo · 1h ago
I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.
unkulunkulu · 1h ago
Sounds like classic inequality observed everywhere. Success leads to attention leads to more success.

Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.

Thus, we have this inequality.

cainxinth · 1h ago
So attention is all you need?
ukuina · 1h ago
Bravo!
boxed · 1h ago
Is it? Sounds to me like they run the same experiment many times and keep the "best" results. Which is cheating, or if the same thing is done in biomedical research: research fraud.
sumtechguy · 3m ago
Back in the slashdot days I would experiment on changing conversations. This was due to the way SD would rank and show its posts. Anything below a 3 would not change anything. But if you could get in early AND get a +5 on your post you could drive exactly what the conversation was about. Especially if you were engaged a bit and were willing to add a few more posts onto other posts.

Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...

ekidd · 50m ago
Also, I've been hearing a lot of complaints that Chatbot Arena tends to favor:

- Lots of bullet points in every response.

- Emoji.

...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.

Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.

kozikow · 9m ago
More to that - at this point, it feels to me, that arenas are getting too focused on fitting user preferences rather than actual model quality.

In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...

jmmcd · 1h ago
Absolutely devastating for the credibility of FAIR.
aredox · 1h ago
The fact those big LLM developers devote a significant amount of effort to game benchmarks is a big show of confidence that they are making progress towards AGI and will recoup those billions of dollars and man-hours/s
amelius · 1h ago
Are the benchmark prompts public and isn't that where the problem lies?
StevenWaterman · 36m ago
No, even if the benchmarks are private, it's still an issue. Because you can overfit to the benchmark by trying X random variations of the model, and picking the one that performs best on the benchmark

It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong

leto_ii · 1h ago
Is this sarcasm? Otherwise I'm not sure how that follows. Seems more reasonable to believe that they're hitting walls and switching to PR and productizing.