Personally I think LLM benchmarks make agents worse. All these companies chase the benchmarks, overfit, and think being able to cheat at the math olympiad is gonna get us to AGI. Instead researchers should peer in and get me an agent that can reliably count the number of "i"'s in mississippi.
upperhalfplane · 1h ago
I don't quite think they cheat at math olympiads, but obviously there are blindspots for the unspectacular tasks. That being said, Mississippi is both a good and a bad question to ask. On the one hand, it's "the bare minimum" to require, on the other hand, is it really a feat? Like, most models can write a piece of code that would compute that. If you show me a task I'm not designed to solve (like count the number of i's in this text), the smart thing is actually to write a program to count them (which LLMs can do).
The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.
tianlong · 1h ago
What's the TLDR about how to solve that benchmark problem?
The best way to measure intelligence is probably to have a model know its strengths and weaknesses, and deal with them in an efficient way. And the most important thing for eval is that ability.