Evaluating LLMs Playing Text Adventures

54 todsacerdoti 31 8/12/2025, 3:19:35 PM entropicthoughts.com ↗

Comments (31)

henriquegodoy · 52m ago

Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out

This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.

What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.

da_chicken · 6m ago

I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.

Otherwise, how can you determine that "North" is a context change, but not always a context change.

msgodel · 7m ago

I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.

It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.

SquibblesRedux · 56m ago

This is another great example of how LLMs are not really any sort of AI, or even proper knowledge representation. Not saying they don't have their uses (like souped up search and permutation generators), but definitely not something that resembles intelligence.

nonethewiser · 28m ago

While I agree, it's still shocking how far next token prediction gets us to looking like intelligence. It's amazing we need examples such as this to demonstrate it.

SquibblesRedux · 18m ago

Another way to think about it is how interesting it is that humans can be so easily influenced by strings of words. (Or images, or sounds.) I suppose I would characterize it as so many people being earnestly vulnerable. It all makes me think of Kahneman's [0] System 1 (fast) and System 2 (slow) thinking.

[0] "Thinking, Fast and Slow" https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow

andai · 27m ago

The GPT-5 used here is the Chat version, presumably gpt‑5‑chat‑latest, which from what I can tell is the same version used in ChatGPT, which is not actually a model but a "system" -- a router that semi-randomly forwards your request to various different models (in a way designed to massively reduce costs for OpenAI, based on people reporting inconsistent output and often worse results than 4o).

So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?

It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.

andrewla · 1h ago

The article links to a previous article discussing methodology for this. The prompting is pretty extensive.

It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.

EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.

jameshart · 1h ago

Nothing in the article mentioned how good the LLMs were at even entering valid text adventure commands into the games.

If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.

throwawayoldie · 1h ago

"You're absolutely right, that's not a sentence you recognize..."

lottaFLOPS · 1h ago

related research that was also announced this week: https://www.textquests.ai/

1970-01-01 · 1h ago

Very interesting how they all clearly suck at it. Even with hints, they can't understand the task enough to complete the game.

abraxas · 51m ago

that's a great tracker. How often is the laderboard updated?

throwawayoldie · 2h ago

My takeaway is: LLMs are not great at text adventures, even when those text adventures are decades old and have multiple walkthroughs available on the Internet. Slow clap.

benlivengood · 1h ago

Wouldn't playthroughs for these games be potentially in the pretraining corpus for all of these models?

quesera · 39m ago

Reproducing specific chunks of long form text from distilled (inherently lossy) model data is not something that I would expect LLMs to be good at.

And of course, there's no actual reasoning or logic going on, so they cannot compete in this context with a curious 12 year old, either.

throwawayoldie · 1h ago

As a longtime IF fan, I can basically guarantee there are.

fzzzy · 1h ago

I tried this earlier this year. I wrote a tool that let an llm play Zork. It was pretty fun.

bongodongobob · 14m ago

Did you do anything special? I tried this with just copy and paste with GPT-4o and it was absolutely terrible at it. It usually ended up spamming help in a loop and trying commands that didn't exist.

wiz21c · 41m ago

adventure games require spatial reasoning (although text based), requires understanding puns, requires cultural references, etc. For me they really need human-intelligence to be solved (heck, they've been designed like that).

I find it funny that some AI do very good score on ARC-AI but fails at these games...

ForHackernews · 2h ago

What blogging software is this with the sidenotes?

hombre_fatal · 2h ago

Noticed it was written in org mode with custom css so I found this post on their site: https://entropicthoughts.com/new-and-improved-now-powered-by...

the_af · 2h ago

I know they define "achievements" in order to measure "how well" the LLM plays the game, and by definition this is arbitrary. As an experiment, I cannot argue with this.

However, I must point out the kind of "modern" (relatively speaking) adventure games mentioned in the article -- which are more accurately called "interactive fiction" by the community -- is not very suitable for this kind of experiment. Why? Well, because so many of them are exploratory/experimental, and not at all about "winning" (unlike, say, "Colossal Cave Adventure", where there is a clear goal).

You cannot automate (via LLM) "playing" them, because they are all about the thoughts and emotions (and maybe shocked laughter) they elicit in human players. This cannot be automated.

If you think I'm being snobby, consider this: the first game TFA mentions is "9:05". Now, you can set goals for a bot to play this game, but truly -- if you've played the game -- you know this would be completely missing the point. You cannot "win" this game, it's all about subverting expectations, and about replaying it once you've seen the first, most straightforward ending, and having a laugh about it.

Saying more will spoil the game :)

(And do note there's no such thing as "spoiling a game" for an LLM, which is precisely the reason they cannot truly "play" these games!)

fmbb · 46m ago

Of course you can automate ”having fun” and ”being entertained”. That is if you believe humanity will ever build artificial intelligence.

drdeca · 28m ago

A p-zombie would not have fun or be entertained, only act like it does. I don’t think AGI requires being unlike a p-zombie in this way.

the_af · 8m ago

> Of course you can automate ”having fun” and ”being entertained”

This seems like begging the question to me.

I don't think there's a mechanistic (as in "token predictor") procedure to generate the emotions of having fun, or being surprised, or amazed. It's not on me to demonstrate it cannot be done, it's on them to demonstrate it can.

But to be clear, I don't think the author of TFA is making this claim either. They are simply approaching IF games from a "problem solving" perspective -- they don't claim this has anything to do with fun or AGI -- and what I'm arguing is that this mechanistic approach to IF games, i.e. "problem solving", only touches on a small subset of what makes people want to play these games. They are often (not all, as the author rightly corrects me, but often) about generating surprise and amazement in the player, something that cannot be done to an LLM.

(Note I'm also not dismissing the author's experiment. As an experiment it's interesting and, I'd argue, fun for the author).

Current, state of the art LLMs cannot feel amazement, or nothing else really (and, I argue, no LLM in the current tech branch will ever can). I hope this isn't a controversial statement.

kqr · 1h ago

I disagree. Lockout, Dreamhold, Lost Pig, and So Far are new games but in the old style. Plundered Hearts is literally one of the old games (though ahead of its time).

I'll grant you that 9:05 and For a Change are somewhat more modern: the former has easy puzzles, the latter very abstract puzzles.

I disagree new text adventures are not about puzzles and winning. They come in all kinds of flavours these days. Even games like 9:05 pace their narrative with traditional puzzles, meaning we can measure forward progress just the same. And to be fair, LLMs are so bad at these games that in these articles, I'm merely trying to get them to navigate the world at all.

If anything, I'd argue Adventure is a bad example of the genre you refer to. It was (by design) more of a caving simulator/sandbox with optinal loot than a game with progress toward a goal.

dfan · 1h ago

As the author of For A Change, I am astonished that anyone would think it was a good testbed for an LLM text adventure solver. It's fun that they tried, though.

kqr · 1h ago

Thank you for making it. The imagery of it is striking and comes back to me every now and then. I cannot unhear "a high wall is not high to be measured in units of length, but of angle" -- beautifully put.

The idea was that it'd be good example of having to navigate somewhat foreign but internally consistent worlds, an essential text adventure skill.

dfan · 52m ago

Ha, I didn't realize that I was replying to the person who wrote the post!

The audience I had in mind when writing it was people who were already quite experienced in playing interactive fiction and could then be challenged in a new way while bringing their old skills to bear. So it's sort of a second-level game in that respect (so is 9:05, in different ways, as someone else mentioned).

the_af · 1h ago

We will have to agree to disagree, if you'll allow me the cliche.

I didn't use Adventure as an example of IF, it belongs in the older "text adventure" genre. Which is why I thought it would be more fitting to test LLMs, since it's not about experiences but about maxing points.

I think there's nothing to "solve" that an LLM can solve about IF. This genre of games, in its modern expression, is about breaking boundaries and expectations, and making the player enjoy this. Sometimes the fun is simply seeing different endings and how they relate to each other. Since LLMs cannot experience joy or surprise, and can only mechanically navigate the game (maybe "explore all possible end states" is a goal?), they cannot "play" it. Before you object: I'm aware you didn't claim the LLMs are really playing the game!

But here's a test for your set of LLMs: how would they "win" at "Rematch"? This game is about repeatedly dying, understanding what's happening, and stringing together a single sentence that will break the cycle and win the game. Can any LLM do this, a straightforward puzzle? I'd be impressed!

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Claude vs. Gemini: Testing on 1M Tokens of Context (every.to)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Show HN: Omnara – Run Claude Code from Anywhere (github.com)

Launch HN: Design Arena (YC S25) – Head-to-head AI benchmark for aesthetics

Evaluating LLMs Playing Text Adventures (entropicthoughts.com)

Multimodal WFH setup: flight SIM, EE lab, and music studio in 60sqft/5.5M² (sdo.group)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Training language models to be warm and empathetic makes them less reliable (arxiv.org)

Nexus: An Open-Source AI Router for Governance, Control and Observability (nexusrouter.com)

Why are there so many rationalist cults? (asteriskmag.com)

RISC-V single-board computer for less than 40 euros (heise.de)

Australian court finds Apple, Google guilty of being anticompetitive (ghacks.net)

StarDict sends X11 clipboard to remote servers (lwn.net)

Enlisting in the Fight Against Link Rot (jszym.com)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Modos Paper Monitor – Open-hardware e-paper monitor and dev kit (crowdsupply.com)

The "high-level CPU" challenge (yosefk.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] (arxiv.org)

A Spellchecker Used to Be a Major Feat of Software Engineering (prog21.dadgum.com)

The Ancient Art and Intimate Craft of Artificial Eyes (thereader.mitpress.mit.edu)

GitHub is (again) having issues (githubstatus.com)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

Perplexity Makes Longshot $34.5B Offer for Chrome (wsj.com)

That viral video of a 'deactivated' Tesla Cybertruck is a fake (theverge.com)

Qodo CLI agent scores 71.2% on SWE-bench Verified (qodo.ai)

The Article in the Most Languages (en.wikipedia.org)

Claude Code is all you need (dwyer.co.za)

Artificial biosensor can better measure the body's main stress hormone (medicalxpress.com)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

Why We Migrated from Neon to PlanetScale (blog.opensecret.cloud)

The ex-CIA agents deciding Facebook's content policy (2022) (mronline.org)

Starbucks in Korea asks customers to stop bringing in printers/desktop computers (fortune.com)

New 3D Laser Scanner Developed for Harvesting Robots (uni-wuerzburg.de)

All known 49-year-old Apple-1 computers (apple1registry.com)

Show HN: I built an offline, open‑source desktop Pixel Art Editor in Python (github.com)

OpenSSH Post-Quantum Cryptography (openssh.com)

High-severity WinRAR 0-day exploited for weeks by 2 groups (arstechnica.com)

Generic drugs, dirty plants, and FDA exemptions (propublica.org)

Undefined Behavior in C and C++ (2024) (russellw.github.io)

Weathering Software Winter (2022) (100r.co)

What does it mean to be thirsty? (quantamagazine.org)

FreeBSD Scheduling on Hybrid CPUs (wiki.freebsd.org)

Neki – Sharded Postgres by the team behind Vitess (planetscale.com)

Radicle 1.3.0 (radicle.xyz)

Perplexity makes bold $34.5B bid for Google's Chrome browser (reuters.com)

The History of Windows XP (abortretry.fail)

The value of institutional memory (timharford.com)

Launch HN: Halluminate (YC S25) – Simulating the internet to train computer use

Evaluating LLMs Playing Text Adventures

Comments (31)