I wrote an essay outlining why common AI benchmarks are not terribly useful, instead arguing we should mostly use normal user experience instead.
Key reasons:
1) Most questions are not simply ‘wrong’ or ‘right’
2) Most user problems are poorly defined
3) Agents are getting popular, and they pose interconnections of these problems
Key reasons: 1) Most questions are not simply ‘wrong’ or ‘right’ 2) Most user problems are poorly defined 3) Agents are getting popular, and they pose interconnections of these problems