Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data. Like, you'd think they'd at least brute force their way through the early game mechanics by now, but honestly this kinda validates something I've been thinking about like real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out
This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.
What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.
godelski · 16h ago
> real intelligence isn't just about having seen the answers before, it's about being good at games and specifically new situations where you can't just pattern match your way out
It is insane to me that so many people believe intelligence is measurable by pure question answer testing. There's hundreds of years of discussion about how this is limited in measuring human intelligence. I'm sure we all even know someone who's a really good test take but you also wouldn't consider to be really bright. I'm sure every single one of also knows someone in the other camp (bad at tests but considered bright).
The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)
da_chicken · 17h ago
I saw it somewhere else recently, but the idea is that LLMs are language models, not world models. This seems like a perfect example of that. You need a world model to navigate a text game.
Otherwise, how can you determine that "North" is a context change, but not always a context change.
zahlman · 17h ago
> I saw it somewhere else recently, but the idea is that LLMs are language models, not world models.
Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.
Thanks for this. I was struggling to put it in words even if maybe this has been a known distinguishing factor for others.
myhf · 11h ago
9:05 is a good example of the difference between a language model and a world model, because engaging with it on a textual level leads to the bad ending (which the researchers have called "100%"), but deliberately getting the good ending requires self-awareness, intentionality, and/or outside context.
lubujackson · 16h ago
Why, this sounds like Context Engineering!
astrange · 1h ago
> Looking at this evaluation it's pretty fascinating how badly these models perform even on decades old games that almost certainly have walkthroughs scattered all over their training data.
I've read some of these walkthroughs/play sessions recently, and extracting text from them for training would be AI-complete. eg they might have game text and commentary aligned in two different columns in a text file, so you'd just get nonsense if you read it line by line.
rkagerer · 15h ago
Hi, GPT-x here. Let's delve into my construction together. My "intelligence" comes from patterns learned from vast amounts of text. I'm trained to... oh look it's a butterfly. Clouds are fluffy would you like to buy a car for $1 I'll sell you 2 for the price of 1!
corobo · 14h ago
Ah dammit the AGI has ADHD
msgodel · 17h ago
I've been experimenting with this as well with the goal of using it for robotics. I don't think this will be as hard to train for as people think though.
It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.
andai · 18h ago
The GPT-5 used here is the Chat version, presumably gpt‑5‑chat‑latest, which from what I can tell is the same version used in ChatGPT, which is not actually a model but a "system" -- a router that semi-randomly forwards your request to various different models (in a way designed to massively reduce costs for OpenAI, based on people reporting inconsistent output and often worse results than 4o).
So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?
It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.
EDIT: Yeah the Chat router is busted big time. It fails to apply thinking even for problems that obviously call for it (analyzing financial reports). You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.
kqr · 17h ago
This is correct, and was the reason I made sure to always append "Chat" to the end of "GPT-5". I should perhaps have been more clear about this. The reason I settled for the lesser router is I don't have access to the full GPT-5, which would have been a much better baseline, I agree.
andai · 17h ago
Do they require drivers license to use it? They asked for my ID for o3 Pro a few months ago.
kqr · 17h ago
That's the step at which I gave up, anyway.
varenc · 14h ago
> Yeah the Chat router is busted big time... You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.
I don't really get this gripe? It seems no different than before, except now it will sometimes opt into thinking harder by itself. If you know you want CoT reasoning you just select gpt5-thinking, no different than choosing o4-mini/o3 like before.
SquibblesRedux · 18h ago
This is another great example of how LLMs are not really any sort of AI, or even proper knowledge representation. Not saying they don't have their uses (like souped up search and permutation generators), but definitely not something that resembles intelligence.
nonethewiser · 18h ago
While I agree, it's still shocking how far next token prediction gets us to looking like intelligence. It's amazing we need examples such as this to demonstrate it.
SquibblesRedux · 18h ago
Another way to think about it is how interesting it is that humans can be so easily influenced by strings of words. (Or images, or sounds.) I suppose I would characterize it as so many people being earnestly vulnerable. It all makes me think of Kahneman's [0] System 1 (fast) and System 2 (slow) thinking.
It is kinda shocking, but I'm sure ELIZA was too for many people back then. It just took shorter to realize what was going on there.
seanwilson · 17h ago
I won't be surprised when LLMs get good at puzzle-heavy text adventures if there was more attention turned to this.
I've found for text adventures based on item manipulation, variations of the same puzzles appear again and again because there's a limit to how many obscure but not too obscure item puzzles you can come up with, so training would be good for exact matches of the same puzzle, and variations, like different ways of opening locked doors.
Puzzles like key + door, crowbar + panel, dog + food, coin + vending machine, vampire + garlic etc. You can obscure or layer puzzles, like changing the garlic into garlic bread which would still work on the vampire, so there's a logical connections to make but often nothing too crazy.
A lot of the difficulty in these games comes from not noticing or forgetting about clues/hints and potential puzzles because there's so much going on, which is less likely to trip up a computer.
You can already ask LLMs "in a game: 20 ways to open a door if I don't have the key", "how to get past an angry guard dog" or "I'm carrying X, Y, and Z, how do I open a door", and it'll list lots of ways that are seen in games, so it's going to be good at matching that with the current list of objects you're carrying, items in the world, and so on.
Another comment mentions about how the AI needs a world model that's transforming as actions are performed, but you need something similar to reason about maths proofs and code, where you have to keep track of the current state/context. And most adventure games don't require you to plan many steps in advance anyway. They're often about figuring out which item to combine/use with which other item next (where only one combination works), and navigating to the room that contains the latter item first.
So it feels like most of the parts are already there to me, and it's more about getting the right prompts and presenting the world in the right format e.g. maintaining a table of items, clues, and open puzzles, to look for connections and matches, and maintaining a map.
Getting LLMs to get good at variations of The Witness would be interesting, where the rules have to be learned through trial and error, and combined.
jlarocco · 6h ago
Doesn't it kind of defeat the purpose, though?
If you have to train the AIs on every specialized new problem, and then you have to babysit them as you apply them to similar problems, why even bother?
It's not really intelligent in any real sense.
jononor · 1h ago
Automation can be useful and valuable (economically) even if not intelligent.
Heck from a big picture view of solving a problem (say to manufacture something), then a solution/process/workflow etc that requires less intelligence may be the preferable one - if such a solution can be found, that is. It can be expected to be cheaper, more robust, repeatable.
jameshart · 19h ago
Nothing in the article mentioned how good the LLMs were at even entering valid text adventure commands into the games.
If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.
throwawayoldie · 19h ago
"You're absolutely right, that's not a sentence you recognize..."
kqr · 17h ago
The previous article (linked in this one) gives an idea of that.
jameshart · 17h ago
I did see that. But since that focused really on how Claude handled that particular prompt format, it’s not clear whether the LLMs that scored low here were just failing at producing valid input, struggled to handle that specific prompt/output structure, or were doing fine at basically operating the text adventure but were struggling at building a world model and problem solving.
kqr · 17h ago
Ah, I see what you mean. Yeah, there was too much output from too many models at once (combined with not enough spare time) to really perform useful qualitative analysis on all the models' performance.
andrewla · 19h ago
The article links to a previous article discussing methodology for this. The prompting is pretty extensive.
It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.
EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.
throwawayoldie · 20h ago
My takeaway is: LLMs are not great at text adventures, even when those text adventures are decades old and have multiple walkthroughs available on the Internet. Slow clap.
No comments yet
gibbitz · 11h ago
This study raises the question, why do we play games? Do we play to win or to enjoy ourselves. Why design a machine to do what we should be enjoying? This goes for writing, creating Art, coding. Wanting a machine to win is the desire to achieve a goal without doing the work to earn it. Same for making art or writing novels. The point of these things (growth and achievement) is lost when done by a machine. I want to see this done with investment, legal strategy or business management. These are better suited to LLMs than what we're making them do, but I'd venture that those who are profiting from LLMs right now would profit less if replaced by LLMs by their boards.
tjr · 11h ago
I imagine that pitting LLMs against computer games is itself an enjoyable activity.
Generally speaking, people play games for fun, and I suspect that will continue. Even if an LLM can beat all humans at computer games, it doesn't matter. We will continue to enjoy playing them. Computers, pre-LLM, could already out-play humans in many cases.
Other activities mentioned -- writing, art, coding, etc. -- can indeed be fun, but they are also activities that people have been paid to do. It seems that there is incentive to create LLMs that can do an at least adequate job of these tasks for less money than humans are paid, so that that money is rerouted to LLM companies instead of human workers. I imagine humans will continue to write, create art, and even code, without any financial incentive, though probably less.
(I personally remain unpersuaded that LLMs will do away with paid creative work altogether, but there's clearly a lot of interest in trying to maximize what LLMs can do.)
standardly · 15h ago
LLMs work really well for open-ended role-playing sessions, but not so much games with strict rules.
They just can't seem to grasp what would make a choice a "wrong" choice in a text-based adventure game, so they end up having no ending. You have to hard-code failure events, or you just never get anything like "you chose to attack the wizard, but he's level 99, dummy, so you died - game over!". It just accepts whatever choice you make, ad infinitum.
My best session was one in which I had the AI give me 4 dialogue options to choose from. I never "beat" the game, and we never solved the mystery - it just kept going further down the rabbit hole.. But it was surprisingly enjoyable, and repayable! A larger framework just needs written for it to keep the tires between the lines and to hard-code certain game rules - what's under the hood is already quite good for narratives imo.
They seem to be going for a much simpler route of just giving the LLM a full transcript of the game with its own reasoning interspersed. I didn't have much luck with that, and I'm worried it might not be effective once we're into the hundreds of turns because of inadvertent context poisoning. It seems like this might indeed be what happens, given the slowing of progress indicated in the paper.
1970-01-01 · 18h ago
Very interesting how they all clearly suck at it. Even with hints, they can't understand the task enough to complete the game.
abraxas · 18h ago
that's a great tracker. How often is the laderboard updated?
kolinko · 6h ago
I’m missing from the article two things:
- testing prompt (were llms instructed to progress in game, as opposed to just explore — the author said smarter llms were more likely to explore)
- benchmark with humans
8f2ab37a-ed6c · 9h ago
Are we anywhere near someone being able to play a D&D or WoD type of game session with an LLM somewhere in the mix, perhaps generating a new and interesting adventure every time? Or is this still science fiction for now?
ileonichwiesz · 4h ago
“Somewhere in the mix”? Sure. I’ve seen TTRPG folks experimenting with LLMs for about as long as LLMs have been around. A couple years later it still can’t really run a session, but you might find it useful for generating some details of an adventure.
The big issue really is making its output interesting. As usual, an LLM will default to the most generic outcomes and settings. For a TTRPG to be fun you usually want surprise, drama, creativity - it can’t do any of those.
Wouldn't playthroughs for these games be potentially in the pretraining corpus for all of these models?
quesera · 18h ago
Reproducing specific chunks of long form text from distilled (inherently lossy) model data is not something that I would expect LLMs to be good at.
And of course, there's no actual reasoning or logic going on, so they cannot compete in this context with a curious 12 year old, either.
throwawayoldie · 19h ago
As a longtime IF fan, I can basically guarantee there are.
wiz21c · 18h ago
adventure games require spatial reasoning (although text based), requires understanding puns, requires cultural references, etc. For me they really need human-intelligence to be solved (heck, they've been designed like that).
I find it funny that some AI do very good score on ARC-AI but fails at these games...
fzzzy · 19h ago
I tried this earlier this year. I wrote a tool that let an llm play Zork. It was pretty fun.
bongodongobob · 18h ago
Did you do anything special? I tried this with just copy and paste with GPT-4o and it was absolutely terrible at it. It usually ended up spamming help in a loop and trying commands that didn't exist.
fzzzy · 14h ago
I have my own agent loop that I wrote, and I gave it a tool which it uses to send input to the parser. I also had a step which took the previous output and generated an image for it. It was just a toy, but it was pretty fun.
daxfohl · 7h ago
A while ago I tried something similar but tried to boil it down to the simplest thing I could come up with. I ended up making a standard maze into a first-person perspective where it unfolds one step at a time, and seeing if a model could solve it without re-entering areas it had already fully explored. They all failed.
Setup: a maze generator generates a square maze and puts the start and end on opposite corners. It doesn't show the full maze to the LLM, just has the LLM explore it one square at a time like a text adventure. It tells the LLM which directions of its current position has walls (relative direction: front, back, left, right). The LLM then chooses between move forward, turn left, turn right. That's pretty much it.
First attempt: Just maintain all the above in a chat, step by step. It'd get lost pretty quickly and start re-exploring already-explored area quite readily. Not very surprising, as we all know they can get lost in long chat threads. The chat model seemed to just go forward or turn right forever (which can work in some mazes), whereas the thinking model did seem to avoid backtracking until it got to T-junctions of a wrong way, where it always seemed to go back and forth forever.
Second attempt: After each step, tell the LLM to "externalize" everything it knew about the maze, and then feed that to a brand new LLM context. The idea was to avoid long chat context problems and see if the LLM could adequately represent its internal state and knowledge such that a "new" LLM could take over. This really didn't end up working any better. The biggest problem was that sometimes it would think that "turn left" would also change the position, and sometimes not. There were other issues too, so I didn't go much further with this approach.
Third attempt: Tell the LLM the premise of the game, and tell it to create a python state machine that stores all the state information it would need to represent its progress through the maze, and then to emit specific keywords when it needed to interact with it (and I added some code that served as a proxy). This also didn't work great. The state machine was close, but one thing it always forgot to do was relate index with direction. So if it's "in cell (5, 5) and facing up", it wouldn't know whether "forward" would be an increase or decrease in the x or y index.
I was also humored by its sycophancy here. I'd ask it
"Would adding a map to the state machine output be useful?"
"Yes, that is a great idea, let's do that!"
It'd do a great job of adding the map, but then I'd ask, "Does a map create more opportunity confusion?"
"Yes, that's an excellent insight, let's remove it!"
"No, really, you're the LLM, you're the one who's going to be using this app. I'm asking you, what do you think?"
"Whatever you want to do, just tell me"
Eventually, as the OP pointed out, these costs do add up pretty quickly. All I was after was "does externalizing the state help solve some of the long chat context problems", and the answer was "no" enough for me.
EDIT: Note that in all cases, they 100% emitted valid commands. And also I never noticed a case where "move forward" was chosen when there was a wall in front of them, nor "turn" when they were in the middle of a corridor.
It's a configurable pipeline for generative dungeon master role play content with a zork-like UI. I use a model called "Wayfarer" which is designed for challenging role play content and I find that it can be pretty fun to engage with.
ForHackernews · 20h ago
What blogging software is this with the sidenotes?
I know they define "achievements" in order to measure "how well" the LLM plays the game, and by definition this is arbitrary. As an experiment, I cannot argue with this.
However, I must point out the kind of "modern" (relatively speaking) adventure games mentioned in the article -- which are more accurately called "interactive fiction" by the community -- is not very suitable for this kind of experiment. Why? Well, because so many of them are exploratory/experimental, and not at all about "winning" (unlike, say, "Colossal Cave Adventure", where there is a clear goal).
You cannot automate (via LLM) "playing" them, because they are all about the thoughts and emotions (and maybe shocked laughter) they elicit in human players. This cannot be automated.
If you think I'm being snobby, consider this: the first game TFA mentions is "9:05". Now, you can set goals for a bot to play this game, but truly -- if you've played the game -- you know this would be completely missing the point. You cannot "win" this game, it's all about subverting expectations, and about replaying it once you've seen the first, most straightforward ending, and having a laugh about it.
Saying more will spoil the game :)
(And do note there's no such thing as "spoiling a game" for an LLM, which is precisely the reason they cannot truly "play" these games!)
fmbb · 18h ago
Of course you can automate ”having fun” and ”being entertained”. That is if you believe humanity will ever build artificial intelligence.
drdeca · 18h ago
A p-zombie would not have fun or be entertained, only act like it does. I don’t think AGI requires being unlike a p-zombie in this way.
the_af · 18h ago
> Of course you can automate ”having fun” and ”being entertained”
This seems like begging the question to me.
I don't think there's a mechanistic (as in "token predictor") procedure to generate the emotions of having fun, or being surprised, or amazed. It's not on me to demonstrate it cannot be done, it's on them to demonstrate it can.
But to be clear, I don't think the author of TFA is making this claim either. They are simply approaching IF games from a "problem solving" perspective -- they don't claim this has anything to do with fun or AGI -- and what I'm arguing is that this mechanistic approach to IF games, i.e. "problem solving", only touches on a small subset of what makes people want to play these games. They are often (not all, as the author rightly corrects me, but often) about generating surprise and amazement in the player, something that cannot be done to an LLM.
(Note I'm also not dismissing the author's experiment. As an experiment it's interesting and, I'd argue, fun for the author).
Current, state of the art LLMs cannot feel amazement, or nothing else really (and, I argue, no LLM in the current tech branch will ever can). I hope this isn't a controversial statement.
Terr_ · 17h ago
That's like saying it's wrong to test a robot's ability to navigate and traverse a mountain... because the mountain has no win-condition and is really a context for human emotional experiences.
The purpose of the test is whatever the tester decides it is. If that means finding X% of the ambiguously-good game endings within a budget of Y commands, then so be it.
the_af · 16h ago
> The purpose of the test is whatever the tester decides it is.
Well, I did say:
> As an experiment, I cannot argue with this.
It was more a reflection on the fact that the primary goal of a lot of modern IF games, among which there is "9:05", the first game mentioned in TFA, is not like "traversing a mountain". Traversing a mountain can have clear and meaningful goals, such us "reach the summit", or "avoid getting stuck", or "do not die or go missing after X hours". Though of course, appreciating nature and sightseeing is beyond the scope of an LLM.
Indeed, "9:05" has no other "goal" than, upon seeing a different ending from the main one, revisiting the game with the knowledge gained from that first playthrough. I'm being purposefully opaque in order not to spoil the game for you (you should play it, it's really short).
Let me put it another way: remember that fad, some years ago, of making you pay attention to an image or video, with a prompt like "colorblind people cannot see this shape after X seconds" so you pay attention and then BAM! A jump scare! Haha, joke's on you!
How would you "test" a LLM on such jump scare? The goal is to scare a human. LLMs cannot be scared. What would the possible answers be?
A: I do not see any disappearing shapes after X seconds. Beep boop! I must not be colorblind, nor human, for I am an LLM. Beep!
or maybe
B: This is a well-known joke. Beep boop! After some short time, a monster appears on screen. This is intended to scare the person looking at it! Beep!
Would you say either response would show the LLM "playing" the game?
(Trust me, this is a somewhat adjacent effect to what "9:05" would play on you, and I fear I've said too much!)
kqr · 19h ago
I disagree. Lockout, Dreamhold, Lost Pig, and So Far are new games but in the old style. Plundered Hearts is literally one of the old games (though ahead of its time).
I'll grant you that 9:05 and For a Change are somewhat more modern: the former has easy puzzles, the latter very abstract puzzles.
I disagree new text adventures are not about puzzles and winning. They come in all kinds of flavours these days. Even games like 9:05 pace their narrative with traditional puzzles, meaning we can measure forward progress just the same. And to be fair, LLMs are so bad at these games that in these articles, I'm merely trying to get them to navigate the world at all.
If anything, I'd argue Adventure is a bad example of the genre you refer to. It was (by design) more of a caving simulator/sandbox with optinal loot than a game with progress toward a goal.
dfan · 19h ago
As the author of For A Change, I am astonished that anyone would think it was a good testbed for an LLM text adventure solver. It's fun that they tried, though.
kqr · 18h ago
Thank you for making it. The imagery of it is striking and comes back to me every now and then. I cannot unhear "a high wall is not high to be measured in units of length, but of angle" -- beautifully put.
The idea was that it'd be good example of having to navigate somewhat foreign but internally consistent worlds, an essential text adventure skill.
dfan · 18h ago
Ha, I didn't realize that I was replying to the person who wrote the post!
The audience I had in mind when writing it was people who were already quite experienced in playing interactive fiction and could then be challenged in a new way while bringing their old skills to bear. So it's sort of a second-level game in that respect (so is 9:05, in different ways, as someone else mentioned).
the_af · 18h ago
We will have to agree to disagree, if you'll allow me the cliche.
I didn't use Adventure as an example of IF, it belongs in the older "text adventure" genre. Which is why I thought it would be more fitting to test LLMs, since it's not about experiences but about maxing points.
I think there's nothing to "solve" that an LLM can solve about IF. This genre of games, in its modern expression, is about breaking boundaries and expectations, and making the player enjoy this. Sometimes the fun is simply seeing different endings and how they relate to each other. Since LLMs cannot experience joy or surprise, and can only mechanically navigate the game (maybe "explore all possible end states" is a goal?), they cannot "play" it. Before you object: I'm aware you didn't claim the LLMs are really playing the game!
But here's a test for your set of LLMs: how would they "win" at "Rematch"? This game is about repeatedly dying, understanding what's happening, and stringing together a single sentence that will break the cycle and win the game. Can any LLM do this, a straightforward puzzle? I'd be impressed!
kqr · 17h ago
I think I see what you mean and with these clarifications we are in agreement. There is a lot of modern works of interactive fiction that goes way beyond what the old text adventures did, and work even when judged as art or literature. I just haven't played much of it because I'm a fan of the old-style games.
As for the specific question, they would progress at Rematch by figuring out ever more complicated interactions that work and will be used to survive, naturally.
This is exactly why something like arc-agi-3 feels so important right now. Instead of static benchmarks that these models can basically brute force with enough training data, like designing around interactive environments where you actually need to perceive, decide, and act over multiple steps without prior instructions, that shift from "can you reproduce known patterns" to "can you figure out new patterns" seems like the real test of intelligence.
What's clever about the game environment approach is that it captures something fundamental about human intelligence that static benchmarks miss entirely, like, when humans encounter a new game, we explore, form plans, remember what worked, adjust our strategy all that interactive reasoning over time that these text adventure results show llms are terrible at, we need systems that can actually understand and adapt to new situations, not just really good autocomplete engines that happen to know a lot of trivia.
The definition you put down is much more agreed upon in the scientific literature. While we don't have a good formal definition of intelligence there is a difference between no definition. I really do hope people read more about intelligence and how we measure it in humans and animals. It is very messy and there's a lot of noise, but at least we have a good idea of the directions to move in. There's still nuances to be learned and while I think ARC is an important test, I don't think success on it will prove AGI (and Chollet says this too)
Otherwise, how can you determine that "North" is a context change, but not always a context change.
Part of what distinguishes humans from artificial "intelligence" to me is exactly that we automatically develop models of whatever is needed.
I've read some of these walkthroughs/play sessions recently, and extracting text from them for training would be AI-complete. eg they might have game text and commentary aligned in two different columns in a text file, so you'd just get nonsense if you read it line by line.
It's interesting he wrote a separate program to wrap the z-machine interpreter. I integrated my wrapper directly into my pytorch training program.
So from this it seems that not only would many of these requests not touch a reasoning model (or as it works now, have reasoning set to "minimal"?), but they're probably being routed to a mini or nano model?
It would make more sense, I think, to test on gpt-5 itself (and ideally the -mini and -nano as well), and perhaps with different reasoning effort, because that makes a big difference in many evals.
EDIT: Yeah the Chat router is busted big time. It fails to apply thinking even for problems that obviously call for it (analyzing financial reports). You have to add "Think hard." to the end of the prompt, or explicitly switch to the Thinking model in the UI.
I don't really get this gripe? It seems no different than before, except now it will sometimes opt into thinking harder by itself. If you know you want CoT reasoning you just select gpt5-thinking, no different than choosing o4-mini/o3 like before.
[0] "Thinking, Fast and Slow" https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow
I've found for text adventures based on item manipulation, variations of the same puzzles appear again and again because there's a limit to how many obscure but not too obscure item puzzles you can come up with, so training would be good for exact matches of the same puzzle, and variations, like different ways of opening locked doors.
Puzzles like key + door, crowbar + panel, dog + food, coin + vending machine, vampire + garlic etc. You can obscure or layer puzzles, like changing the garlic into garlic bread which would still work on the vampire, so there's a logical connections to make but often nothing too crazy.
A lot of the difficulty in these games comes from not noticing or forgetting about clues/hints and potential puzzles because there's so much going on, which is less likely to trip up a computer.
You can already ask LLMs "in a game: 20 ways to open a door if I don't have the key", "how to get past an angry guard dog" or "I'm carrying X, Y, and Z, how do I open a door", and it'll list lots of ways that are seen in games, so it's going to be good at matching that with the current list of objects you're carrying, items in the world, and so on.
Another comment mentions about how the AI needs a world model that's transforming as actions are performed, but you need something similar to reason about maths proofs and code, where you have to keep track of the current state/context. And most adventure games don't require you to plan many steps in advance anyway. They're often about figuring out which item to combine/use with which other item next (where only one combination works), and navigating to the room that contains the latter item first.
So it feels like most of the parts are already there to me, and it's more about getting the right prompts and presenting the world in the right format e.g. maintaining a table of items, clues, and open puzzles, to look for connections and matches, and maintaining a map.
Getting LLMs to get good at variations of The Witness would be interesting, where the rules have to be learned through trial and error, and combined.
If you have to train the AIs on every specialized new problem, and then you have to babysit them as you apply them to similar problems, why even bother?
It's not really intelligent in any real sense.
If an LLM responds to “You are standing in an open field west of a white house” with “okay, I’m going to walk up to the house”, and just gets back “THAT SENTENCE ISN'T ONE I RECOGNIZE”, it’s not going to make much progress.
It is difficult here to separate out how much of this could be fixed or improved by better prompting. A better baseline might be to just give the LLM direct access to the text adventure, so that everything the LLM replies is given to the game directly. I suspect that the LLMs would do poorly on this task, but would undoubtedly improve over time and generations.
EDIT: Just started playing 9:05 with GPT-4 with no prompting and it did quite poorly; kept trying to explain to me what was going on with the ever more complex errors it would get. Put in a one line "You are playing a text adventure game" and off it went -- it took a shower and got dressed and drove to work.
No comments yet
Generally speaking, people play games for fun, and I suspect that will continue. Even if an LLM can beat all humans at computer games, it doesn't matter. We will continue to enjoy playing them. Computers, pre-LLM, could already out-play humans in many cases.
Other activities mentioned -- writing, art, coding, etc. -- can indeed be fun, but they are also activities that people have been paid to do. It seems that there is incentive to create LLMs that can do an at least adequate job of these tasks for less money than humans are paid, so that that money is rerouted to LLM companies instead of human workers. I imagine humans will continue to write, create art, and even code, without any financial incentive, though probably less.
(I personally remain unpersuaded that LLMs will do away with paid creative work altogether, but there's clearly a lot of interest in trying to maximize what LLMs can do.)
They just can't seem to grasp what would make a choice a "wrong" choice in a text-based adventure game, so they end up having no ending. You have to hard-code failure events, or you just never get anything like "you chose to attack the wizard, but he's level 99, dummy, so you died - game over!". It just accepts whatever choice you make, ad infinitum.
My best session was one in which I had the AI give me 4 dialogue options to choose from. I never "beat" the game, and we never solved the mystery - it just kept going further down the rabbit hole.. But it was surprisingly enjoyable, and repayable! A larger framework just needs written for it to keep the tires between the lines and to hard-code certain game rules - what's under the hood is already quite good for narratives imo.
- testing prompt (were llms instructed to progress in game, as opposed to just explore — the author said smarter llms were more likely to explore)
- benchmark with humans
The big issue really is making its output interesting. As usual, an LLM will default to the most generic outcomes and settings. For a TTRPG to be fun you usually want surprise, drama, creativity - it can’t do any of those.
And of course, there's no actual reasoning or logic going on, so they cannot compete in this context with a curious 12 year old, either.
I find it funny that some AI do very good score on ARC-AI but fails at these games...
Setup: a maze generator generates a square maze and puts the start and end on opposite corners. It doesn't show the full maze to the LLM, just has the LLM explore it one square at a time like a text adventure. It tells the LLM which directions of its current position has walls (relative direction: front, back, left, right). The LLM then chooses between move forward, turn left, turn right. That's pretty much it.
First attempt: Just maintain all the above in a chat, step by step. It'd get lost pretty quickly and start re-exploring already-explored area quite readily. Not very surprising, as we all know they can get lost in long chat threads. The chat model seemed to just go forward or turn right forever (which can work in some mazes), whereas the thinking model did seem to avoid backtracking until it got to T-junctions of a wrong way, where it always seemed to go back and forth forever.
Second attempt: After each step, tell the LLM to "externalize" everything it knew about the maze, and then feed that to a brand new LLM context. The idea was to avoid long chat context problems and see if the LLM could adequately represent its internal state and knowledge such that a "new" LLM could take over. This really didn't end up working any better. The biggest problem was that sometimes it would think that "turn left" would also change the position, and sometimes not. There were other issues too, so I didn't go much further with this approach.
Third attempt: Tell the LLM the premise of the game, and tell it to create a python state machine that stores all the state information it would need to represent its progress through the maze, and then to emit specific keywords when it needed to interact with it (and I added some code that served as a proxy). This also didn't work great. The state machine was close, but one thing it always forgot to do was relate index with direction. So if it's "in cell (5, 5) and facing up", it wouldn't know whether "forward" would be an increase or decrease in the x or y index.
I was also humored by its sycophancy here. I'd ask it
"Would adding a map to the state machine output be useful?"
"Yes, that is a great idea, let's do that!"
It'd do a great job of adding the map, but then I'd ask, "Does a map create more opportunity confusion?"
"Yes, that's an excellent insight, let's remove it!"
"No, really, you're the LLM, you're the one who's going to be using this app. I'm asking you, what do you think?"
"Whatever you want to do, just tell me"
Eventually, as the OP pointed out, these costs do add up pretty quickly. All I was after was "does externalizing the state help solve some of the long chat context problems", and the answer was "no" enough for me.
EDIT: Note that in all cases, they 100% emitted valid commands. And also I never noticed a case where "move forward" was chosen when there was a wall in front of them, nor "turn" when they were in the middle of a corridor.
https://github.com/derekburgess/dungen
It's a configurable pipeline for generative dungeon master role play content with a zork-like UI. I use a model called "Wayfarer" which is designed for challenging role play content and I find that it can be pretty fun to engage with.
However, I must point out the kind of "modern" (relatively speaking) adventure games mentioned in the article -- which are more accurately called "interactive fiction" by the community -- is not very suitable for this kind of experiment. Why? Well, because so many of them are exploratory/experimental, and not at all about "winning" (unlike, say, "Colossal Cave Adventure", where there is a clear goal).
You cannot automate (via LLM) "playing" them, because they are all about the thoughts and emotions (and maybe shocked laughter) they elicit in human players. This cannot be automated.
If you think I'm being snobby, consider this: the first game TFA mentions is "9:05". Now, you can set goals for a bot to play this game, but truly -- if you've played the game -- you know this would be completely missing the point. You cannot "win" this game, it's all about subverting expectations, and about replaying it once you've seen the first, most straightforward ending, and having a laugh about it.
Saying more will spoil the game :)
(And do note there's no such thing as "spoiling a game" for an LLM, which is precisely the reason they cannot truly "play" these games!)
This seems like begging the question to me.
I don't think there's a mechanistic (as in "token predictor") procedure to generate the emotions of having fun, or being surprised, or amazed. It's not on me to demonstrate it cannot be done, it's on them to demonstrate it can.
But to be clear, I don't think the author of TFA is making this claim either. They are simply approaching IF games from a "problem solving" perspective -- they don't claim this has anything to do with fun or AGI -- and what I'm arguing is that this mechanistic approach to IF games, i.e. "problem solving", only touches on a small subset of what makes people want to play these games. They are often (not all, as the author rightly corrects me, but often) about generating surprise and amazement in the player, something that cannot be done to an LLM.
(Note I'm also not dismissing the author's experiment. As an experiment it's interesting and, I'd argue, fun for the author).
Current, state of the art LLMs cannot feel amazement, or nothing else really (and, I argue, no LLM in the current tech branch will ever can). I hope this isn't a controversial statement.
The purpose of the test is whatever the tester decides it is. If that means finding X% of the ambiguously-good game endings within a budget of Y commands, then so be it.
Well, I did say:
> As an experiment, I cannot argue with this.
It was more a reflection on the fact that the primary goal of a lot of modern IF games, among which there is "9:05", the first game mentioned in TFA, is not like "traversing a mountain". Traversing a mountain can have clear and meaningful goals, such us "reach the summit", or "avoid getting stuck", or "do not die or go missing after X hours". Though of course, appreciating nature and sightseeing is beyond the scope of an LLM.
Indeed, "9:05" has no other "goal" than, upon seeing a different ending from the main one, revisiting the game with the knowledge gained from that first playthrough. I'm being purposefully opaque in order not to spoil the game for you (you should play it, it's really short).
Let me put it another way: remember that fad, some years ago, of making you pay attention to an image or video, with a prompt like "colorblind people cannot see this shape after X seconds" so you pay attention and then BAM! A jump scare! Haha, joke's on you!
How would you "test" a LLM on such jump scare? The goal is to scare a human. LLMs cannot be scared. What would the possible answers be?
A: I do not see any disappearing shapes after X seconds. Beep boop! I must not be colorblind, nor human, for I am an LLM. Beep!
or maybe
B: This is a well-known joke. Beep boop! After some short time, a monster appears on screen. This is intended to scare the person looking at it! Beep!
Would you say either response would show the LLM "playing" the game?
(Trust me, this is a somewhat adjacent effect to what "9:05" would play on you, and I fear I've said too much!)
I'll grant you that 9:05 and For a Change are somewhat more modern: the former has easy puzzles, the latter very abstract puzzles.
I disagree new text adventures are not about puzzles and winning. They come in all kinds of flavours these days. Even games like 9:05 pace their narrative with traditional puzzles, meaning we can measure forward progress just the same. And to be fair, LLMs are so bad at these games that in these articles, I'm merely trying to get them to navigate the world at all.
If anything, I'd argue Adventure is a bad example of the genre you refer to. It was (by design) more of a caving simulator/sandbox with optinal loot than a game with progress toward a goal.
The idea was that it'd be good example of having to navigate somewhat foreign but internally consistent worlds, an essential text adventure skill.
The audience I had in mind when writing it was people who were already quite experienced in playing interactive fiction and could then be challenged in a new way while bringing their old skills to bear. So it's sort of a second-level game in that respect (so is 9:05, in different ways, as someone else mentioned).
I didn't use Adventure as an example of IF, it belongs in the older "text adventure" genre. Which is why I thought it would be more fitting to test LLMs, since it's not about experiences but about maxing points.
I think there's nothing to "solve" that an LLM can solve about IF. This genre of games, in its modern expression, is about breaking boundaries and expectations, and making the player enjoy this. Sometimes the fun is simply seeing different endings and how they relate to each other. Since LLMs cannot experience joy or surprise, and can only mechanically navigate the game (maybe "explore all possible end states" is a goal?), they cannot "play" it. Before you object: I'm aware you didn't claim the LLMs are really playing the game!
But here's a test for your set of LLMs: how would they "win" at "Rematch"? This game is about repeatedly dying, understanding what's happening, and stringing together a single sentence that will break the cycle and win the game. Can any LLM do this, a straightforward puzzle? I'd be impressed!
As for the specific question, they would progress at Rematch by figuring out ever more complicated interactions that work and will be used to survive, naturally.