Exploring LLM Evaluation by Using Games

3 Yuxuan_Zhang13 1 6/30/2025, 8:59:20 PM lmgame.org ↗

Comments (1)

Yuxuan_Zhang13 · 6h ago
Pokémon Red is becoming a go-to benchmark for testing advanced AIs such as Gemini. But is Pokémon Red really a good eval? We study this problem and identify three issues: 1⃣ Navigation tasks are too hard. 2⃣ Combat control is too simple. 3⃣ Raising a strong Pokémon team is slow and expensive as an eval.

We find most of the problems are not fundamental to games themselves, but how they have been used. We believe game-as-an-eval remains a compelling and underutilized evaluation strategy.

We introduce Lmgame Bench to standardize game-as-an-eval. More details and findings in our blogpost: https://lmgame.org/#/blog/pokemon_red