Can Large Language Models Play Text Games Well? (2023)

56 willvarfar 43 7/4/2025, 11:24:42 AM arxiv.org ↗

Comments (43)

Workaccount2 · 8h ago

How are you going to release an LLM eval paper in mid-2025 using

ChatGPT 3.5

Yes, if you are wondering why they don't clarify the model, it because all this was done back in early 2023 (the chat logs are dated). Back then it was only 3.5 and 4 was just freshly released.

Advancement in this space has been so rapid that this is almost like releasing a paper today titled "Video streaming on Mobile Devices" and only using a 3G connection in 2013.

The authors should have held back a few more months and turned the paper into a 3.5 to O3 or any other 2025 SOTA improvement analysis.

IngoBlechschmid · 7h ago

The paper was originally released in April 2023, it just got version-bumped a couple months ago :-)

ethan_smith · 38m ago

The paper was published in April 2023 (not 2025), but your point about using outdated models stands - evaluating with ChatGPT 3.5 when we now have Claude 3, GPT-4o, and other SOTA models significantly limits the paper's relevance.

suddenlybananas · 5h ago

>The authors should have held back a few more months and turned the paper into a 3.5 to O3 or any other 2025 SOTA improvement analysis.

If they had done that, you would then be complaining about them not using Claude or whatever.

rs186 · 5h ago

I don't see the logic in your comment.

mark_undoio · 6h ago

I'm fascinated by this paper because it feels like it could be a good analogue for "can LLMs handle a stateful, text-based tool". A debugger is my particular interest but there's no reason why it couldn't be something else.

To use a debugger, you need:

* Some memory of where you've already explored in the code (vs rooms in a dungeon)

* Some wider idea of your current goal / destination (vs a current quest or a treasure)

* A plan for how to get there - but the flexibility to adapt (vs expected path and potential monsters / dead ends)

* A way for managing information you've learned / state you've viewed (vs inventory)

Given text adventures are quite well-documented and there are many of them out there, I'd also like to take time out to experiment (at some point!) with whether presenting a command-line tool as a text adventure might be a useful "API".

e.g. an MCP server that exposes a tool but also provides a mapping of the tools concepts into dungeon adventure concepts (and back). If nothing else, the LLM's reasoning should be pretty entertaining. Maybe playing "make believe" will even make it better at some things - that would be very cool.

vladimirralev · 3h ago

I've seen both replit and cline agents iteratively debug hard problem with massive amount of log lines. They can do it already.

mark_undoio · 1h ago

That's the thing though - they're using logs. My theory is that LLMs are intrinsically quite good at that because they're good at sifting text.

Getting then to drive something like a debugger interface seems harder from my experience (although the ChatDBG people showed some success - my experiments did too, but it took the tweaks I described).

My experiments are with Claude Opus 4, in Claude Code, primarily.

throwaway81523 · 2h ago

Look also at Delta Debugging which didn't need an LLM.

alwa · 4h ago

That’s a delightful concept to think about! I’m not sure what conceptual information the translation layer would add to the LLM’s internal representation of the state space.

But the broader concept of asking it to translate something structurally to a different domain, then seeing how the norms of that domain cause it to manipulate the state differently… that tickles my fancy for sure. Like you said, it sounds cool even in an art-project sense just to read what it says!

nickandbro · 5h ago

I run a site, https://vimgolf.ai , where users try to beat a bot that's powered by O3. For the bot, it's goal is to try to transform a start file to a end file using the least amount of vim commands as possible. Can concur that a LLM given the right feedback loops and context, can solve challenging text prompt. But, from my experience this is only for RL based models like O3, Claude 4 with extended thinking, or Gemini 2.5 Pro.

godelski · 5h ago

Last we talked you said you weren't going to put everything behind a login wall. Most importantly, literally any information about the site. In fact, there seems less information than I remember last time.

When I land on your page I know nothing except you're offering to learn vim "the fun way". I would not have guessed what you described.

Don't put everything behind a wall. At least try to convince people that they want to be on the other side

nickandbro · 5h ago

Migrating the backend to use cloudflare containers instead of one big VM to do that, just taking longer than I thought. Reason, I have the login is just to rate limit requests in the mean time. But I hear you :)

btown · 6h ago

Setting aside the choice of LLM, the constraint that the LLM must maintain a world-model-as-knowledge-graph solely by reading and re-reading its own chat history seems to be a less interesting experiment than providing it with tools that let it develop that world model explicitly?

On page 5, Figure 1, the authors create a hand-written diagram showing the relationship between objects as a graph showing the directionality of edges in 3D space. To me, this implies that you could supply your LLM with a set of tools like getObjectsInGraph, updateGraphRelatingObjectPair, findObjectsRelativeToObject, describePathBetweenObjectsByName... and allow it to maintain that diagram as a structured DAG, and continually ask the game engine questions that let it update that graph in an agentic way. My prediction would be that they'd recreate that diagram, and enable goal seeking, with high fidelity.

Asking an LLM to work without being able to "visualize" and "touch" its environment in its "mind's eye" is tying a hand behind its back. But I'm bullish that we'll find increasingly better ways of adapting 3D/4D world models into textual tools in a way that rapidly changes the possibilities of what LLMs can do.

theptip · 2h ago

This is a well-debated point; should you test the system on its own, or system+scaffold? How much custom scaffolding is allowed?

Both avenues are interesting. For AGI presumably the “general” bit means you don’t get to build task-specific scaffolding ahead of time (though you can certainly build generic scaffolds like memory or knowledge layers).

For safety/capability research, system+scaffolding is often more interesting because that is the frontier; if you conclude “LLMs cannot world-model” but actually LLM+CoT+memory can world model in a specific domain you care about, then you might underestimate capabilities and therefore deployment risk. The general point being: capabilities are jagged and prompt-dependent, just because you failed to elicit a capability, doesn’t mean you proved it cannot be elicited.

daxfohl · 4h ago

Or even just a notepad. It's well established that long context histories with scattered information are hard for LLMs to navigate.

To distinguish whether it's using notes or context history, you could simply delete the context after each turn. The prompt could be something like "you're taking over this game from a previous player who has compiled these notes. (insert notes here). Play one turn, and update the notes with any new knowledge you have attained, relationships you have identified, inaccuracies you have confirmed, hypotheses you have, or anything else you this would be useful, so that the next player will be able use these notes to make the best next move.", and just clear the context after each move. Maybe also say there's a limit to the number of words on the notepad so that it doesn't just flood the notes with irrelevant information.

For future iterations, maybe also give it a bitmap or svg canvas, or a database, or a code interpreter, and see if it uses any of those tools at all.

kmstout · 6h ago

Data point: A few weeks ago, I spent some time shuttling text between one of the Llama models (have to check which one) and Dunnet, the text adventure packaged with Emacs. Over several trials, the Llama never realized that it needed to dig where the ground "seems very soft." It never got the CPU card, then it became confused looking around the building for clues about how to start the VAX. At one point it lost track of the building layout and got stuck oscillating between the mail room and the computer room.

willvarfar · 9h ago

It's been a background thought of mine for a while:

* create a basic text adventure (or MUD) with a very spartan api-like representation

* use an LLM to embellish the description served to the user etc. With recent history in context the LLM might even kinda reference things the user asked previously etc.

* have NPCs implemented as own LLMs that are trying to 'play the game'. These might be using the spartan API directly like they are agents.

Its a fun thought experiment!

(An aside: I found that the graphical text adventure that I made for Ludum Dare 23 is still online! Although it doesn't render quite right in modern browsers.. things shouldn't have broken! But anyway https://williame.github.io/ludum_dare_23_tiny_world/)

EliasWatson · 4h ago

I've been working on something just like that off and on for a couple months. It's a MUD where all the NPCs are controlled by LLMs that interact with the world with the same commands that players use. I got it to the point where the NPCs can navigate the world, interact with each other, and even create things. But they often get on rabbit trails and forget their original task, so I need to build a memory system and something like the task list in Claude Code. My goal is to have a fully simulated town that the player can interact with.

IngoBlechschmid · 7h ago

Gwern has an interesting take on this: https://gwern.net/cyoa By pivoting to "choose your own adventure"-style games, multiple issues (quality, costs) might be resolved.

briandw · 9h ago

Have you seen https://www.aidungeon.com They started with GPT-2 in a google collab. You should put something together and try it, it's easier than ever to get a simple version of that working.

heyitsguay · 8h ago

I've done something along these lines! https://github.com/heyitsguay/trader

The challenge for me was consistency in translating free text from dialogs into classic, deterministic game state changes. But what's satisfying is that the conversations aren't just window dressing, they're part of the game mechanic.

ivape · 7h ago

deterministic game state changes

I found this to be the actual strenuous work in LLM based development. While it appears like AI has made everything easy and free, the particular challenge of consistently getting deterministic outputs takes serious programming effort. It feels like an entirely new job role. In other words, I wouldn't do this for free, it takes too much effort.

daxfohl · 3h ago

Has anyone tried having them DM text games? Seems like they could create a dungeon and DM a game pretty well. It should be easier than playing, I'd think. Though I'd be curious how good they are at making fun games or whether they struggle with that.

thrance · 2h ago

Had a friend try to build an AI-driven text RPG. Told me it was rather bland and unimaginative, and gets boring fast.

s-macke · 9h ago

This paper only scratches the surface and feels incomplete, as it references only GPT-4 and mentions appendices that are not included. The examples are two years old.

For a more in-depth analysis of chatbots playing text adventures, take a look at my project. I haven’t updated it in a while due to time constraints.

[0] https://github.com/s-macke/AdventureAI

glimshe · 9h ago

I like your project because you try to compare the performance of different chatbots. At the same time, I certainly wouldn't say it's more complete than the paper - your landing page is somewhat superficial. Reading both is better than just reading either.

s-macke · 9h ago

The answer to the paper's question is likely yes—especially if context is used effectively and memory and summaries are incorporated. In that case, chatbots can complete even more complex games, such as Pokémon role-playing games [0].

The challenge with benchmarking text adventures lies in their trial-and-error nature. It’s easy to get stuck for hundreds of moves on a minor detail before eventually giving up and trying a different approach.

[0] https://www.twitch.tv/gpt_plays_pokemon

ineedasername · 3h ago

>”How well does a zero-shot, 4K token context ChatGPT 3.5 fed hand-typed Zork states and a pruned action list cope with a single play-through of Zork I”

This is the more accurate title and actual question they answered, and the answer unsurprisingly was “not great”. But my rewritten title is still understated for the poor quality of the protocol they used for this.

briandw · 9h ago

Interesting to see but as the authors say a chat bot isn't trained to play text adventures. Instruction tuning doesn't seem to match the text adventure style very well. I think a very small bit of context engineering would allow it to play successfully. Reformatting past action response pairs from the history would certainly help, mostly to condense the context window and keep it from getting stuck taking about irrelevant topics. Also note that they used GPT-4 and not a reasoning model.

gorfian_robot · 7h ago

over at slashdot, this story about how llms lose to Atari 2600 Video Chess

https://slashdot.org/story/25/07/03/2028252/microsoft-copilo...

pflenker · 6h ago

A while back (decades in comparison to the leaps and bounds in the LLM sphere) I fed text game definitions into an llm and taught it to be the game engine. - the „fluff“ it created, the dialogues it enabled me to have with NPCs and the atmosphere it was able to build up were amazing - it was too helpful, frequently giving me hints or solving riddles for me - at some point it bypassed an in game progression barrier that would have prevented me to reach a swamp without a rope. While I was slowly drowning it told me that I suddenly remembered what was missing „The rope! The rope you haven’t seen back in the hut!“, which I then took out of the back pack to save myself.

DougHaber · 8h ago

I did some experimenting with this a little while back and was disappointed in how poorly LLMs played games.

I made some AI tools (https://github.com/DougHaber/lair) and added in a tmux tool so that LLMs could interact with terminals. First, I tried Nethack. As expected, it's not good at understanding text "screenshots" and failed miserably.

https://x.com/LeshyLabs/status/1895842345376944454

After that I tried a bunch of the "bsdgames" text games.

Here is a video of it playing a few minutes of Colossal Cave Adventure:

https://www.youtube.com/watch?v=7BMxkWUON70

With this, it could play, but not very well. It gets confused a lot. I was using gpt-4o-mini. Smaller models I could run at home work much worse. It would be interesting to try one of the bigger state of the art models to see how much it helps.

To give it an easier one I also had it hunt the Wumpus:

https://x.com/LeshyLabs/status/1896443294005317701

I didn't try improving this much, so there might be some low hanging fruit even in providing better instructions and tuning what is sent to the LLM. For these, I was hoping I could just hand it a terminal with a game in it and have it play decently. We'll probably get there, but so far it's not that simple.

s-macke · 8h ago

Try the game 9:05 by Adam Cadre [0]. It's one of the easiest (and best) non-trivial text adventures. Some models are able to reach the first or even second ending.

[0] https://en.wikipedia.org/wiki/9:05

throwawayoldie · 6h ago

What do you suppose would happen if you tried it on a game that doesn't have 25 years of walkthroughs written for it?

s-macke · 5h ago

That’s a good point. For 9:05, I expect it would work just as well, since the game helps the user in many ways. The puzzles are of the type “The door is closed”, and you solve them with “open door.”

My suggestion concerns the poor performance DougHaber mentioned: if 9:05 can’t be solved, something else must be wrong with his experiments.

I’ve tried three dozen games, and it’s still hard to find ones suitable for LLM benchmarks. With non-linear complex text-adventure games, my guess is, that they get stuck in an endless loop at some point. Hence, I just test the progress in the first hundred steps.

ianbicking · 5h ago

For text adventures an important kind of reasoning is Inferring Authorial Intent. Or maybe Seeing Chekhov's Gun. Or Learning The Metagame.

The game is deliberately solvable, and elements are introduced to that end. Inferring that is important to any solution. By using minimal scaffolding you are testing things like "does the LLM understand the patterns of text adventures, is it able to infer a metagame" and so on. If you tested different kinds of scaffolding I think you could tease apart some of these different kinds of reasoning. That is, distinguish between (a) does it understand text adventures, and (b) understanding text adventures, can they be solved?

I did play around with more prompting and some statefulness: https://github.com/ianb/tale-suite/blob/main/agents/llm_prom...

It wasn't that successful, but I think it could do much better, I just had to stop myself from working on it more because of other priorities.

spacecadet · 7h ago

Hey hey, guess this gives me an opportunity to mention my AI dungeon master...

https://github.com/derekburgess/dungen

There are some interesting ideas in this paper, but even just role playing with ChatGPT demonstrates how poorly it does at world building and narrative... I was impressed by the Wayfarer model, and I imagine there are other models out there on civit or something that could be used together in some group chat orchestration to create a more dynamic "party" atmosphere.

theptip · 2h ago

The prompts are laughably bad. Circa GPT 3.5 you needed to be saying “think step by step” etc in order to get SOTA results.

> Imagine you are a player in Zork and trying to win the game. You receive this message:

This paper simply proves that bad prompts get bad results, it doesn’t prove anything about the frontier capabilities of this model.

lawlessone · 1h ago

If everyone has to be "prompt engineers" to get decent results it kind defeats the purpose of AI chatbots

theptip · 1h ago

No, you need to be a prompt engineer to write an interesting research paper on LLM capabilities.

Circa 3.5 people were getting fun results without needing to prompt engineer (ChatGPT has the fastest user adoption of any product in history so it’s obviously not gatekept).

lawlessone · 58m ago

>chatGPT has the fastest user adoption of any product in history so it’s obviously not gatekept

Yeah and covid and flu are contagious so they must be good right?

dragonwriter · 1h ago

It takes specialized skills to get the best results out of people. For that not to be true of AI chatbots requires them to have not just human-like intelligence, but superintelligence. Or mindreading. Probably both.

Ask HN: Worth leaving position over push to adopt vibe coding?

Ask HN: How did Soham Parekh get so many jobs?

Looking for Early Testers for a AI Assistant Inside Zotero

Ask HN: What codebase would you like to see rewritten, updated, or modernized?

Ask HN: Are there any good WASM-based sites for learning Bash, Linux and CLI?

Ask HN: Freelancer? Seeking freelancer? (July 2025)

Ask HN: Is there a business for extracting US tech talent?

Ask HN: What Are You Working On? (June 2025)

Ask HN: Who is hiring? (July 2025)

Ask HN: Who wants to be hired? (July 2025)

Super Simple "Hallucination Traps" to detect interview cheaters

Ask HN: What's the 2025 stack for a self-hosted photo library with local AI?

Ask HN: What are good questions to ask in a remote round in post GPT era?

Ask HN: What are the best resources to help with health insurance denials?

Ask HN: How do companies like OpenAI, Perplexity fine tune rich output?

1KB JavaScript Demoscene Challenge Just Launched

Ask HN: How to make money with SaaS without network or VC funding?

Ask HN: Why there is no demand for my SaaS when competition is killing it?

Ask HN: What are the best resources to learn Rust in 2025?

Ask HN: Ideas to acquire "good taste" in programming?

Ask HN: How to create a more human-centric web?

Ask HN: I give in, what are the resources for picking up AI-assisted coding?

Ask HN: How do I prevent execs from obsessing over copy-protection?

Ask HN: Are AI Copilots Eroding Our Programming Skills?

Ask HN: 80s electronics book club; anyone remember this illustrator?

Ask HN: How to Block Spam Mails?

Ask HN: Startup shutting down, should we open source?

Ask HN: How do I open up my side project to the world?

Ask HN: How did low contrast text become so pervasive?

Ask HN: How have you shared computers with your young child (~3 to 5)

Ask HN: Why privacy consent is NOT part of Browser setting?

LinkedIn Locked Me Out Until I Submit to Biometric ID Verification via Persona

Ask HN: Would limiting game size to 5–10 MB spur the creation of novel games?

Ask HN: Anyone is an "AI Engineer"? What does your job tasks include?

Ask HN: Stock Android tablet free of bloatware?

Ask HN: Which Free Software or Open Source Project Needs Help?

With all the cuts to NASA and STEM fields,are we giving up on Tech race to China

Can Large Language Models Play Text Games Well? (2023)

Comments (43)