There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog).
I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.
Certhas · 4h ago
I don't understand what point you're hinting at.
Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.
So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.
What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.
If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.
> but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.
The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.
There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.
Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.
measurablefunc · 1h ago
Where is this spelled out formally and proven logically?
skissane · 46m ago
LLM backtracking is an active area of research, see e.g.
And I was wrong that nobody has implemented it, as these papers prove people have… it is just the results haven’t been sufficiently impressive to support the transition from the research lab to industrial use - or at least, not yet
measurablefunc · 37m ago
> Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40% compared to the optimal-path supervised fine-tuning method.
bondarchuk · 3h ago
Backtracking makes sense in a search context which is basically what prolog is. Why would you expect a next-token-predictor to do backtracking and what should that even look like?
PaulHoule · 3h ago
If you want general-purpose generation than it has to be able to respect constraints (e.g. figure art of a person has 0..1 belly buttons, 0..2 legs is unspoken) as it is generative models usually get those things right but don't always if they can stick together the tiles they use internally in some combination that makes sense locally but not globally.
General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.
Today I had another of those experiences of the weaknesses of LLM reasoning, one that happens a lot when doing LLM-assisted coding. I was trying to figure out how to rebuild some CSS after the HTML changed for accessibility purposes and got a good idea for how to do it from talking to the LLM but at that point the context was poisoned, probably because there was a lot of content about the context describing what we were thinking about at different stages of the conversation which evolved considerably. It lost its ability to follow instructions and I'd tell it specifically to do this or do that and it just wouldn't do it properly and this happens a lot if a session goes on too long.
My guess is that the attention mechanism is locking on to parts of the conversation which are no longer relevant to where I think we're at and in general the logic that considers the variation of either a practice (instances) or a theory over time is a very tricky problem and 'backtracking' is a specific answer for maintaining your knowledge base across a search process.
XenophileJKO · 2h ago
What if you gave the model a tool to "willfully forget" a section of context. That would be easy to make. Hmm I might be onto something.
PaulHoule · 2h ago
I guess you could have some kind of mask that would let you suppress some of the context from matching, but my guess is that kind of thing might cause problems as often as it solves them.
Back when I was thinking about commonsense reasoning with logic it was obviously a much more difficult problem to add things like "P was true before time t", "there will be some time t in the future such at P is true", "John believes Mary believes that P is true", "It is possible that P is true", "there is some person q who believes that P is true", particularly when you combine these qualifiers. For one thing you don't even have a sound and complete strategy for reasoning over first-order logic + arithmetic but you also have a combinatorical explosion over the qualifiers.
Back in the day I thought it was important to have sound reasoning procedures but one of the reasons none of my foundation models ever became ChatGPT was that I cared about that and I really needed to ask "does change C cause an unsound procedure to get the right answer more often?" and not care if the reasoning procedure was sound or not.
measurablefunc · 3h ago
I don't expect a Markov chain to be capable of backtracking. That's the point I am making. Logical reasoning as it is implemented in Prolog interpreters is not something that can be done w/ LLMs regardless of the size of their weights, biases, & activation functions between the nodes in the graph.
bondarchuk · 3h ago
Imagine the context window contains A-B-C, C turns out a dead end and we want to backtrack to B and try another branch. Then the LLM could produce outputs such that the context window would become A-B-C-[backtrack-back-to-B-and-don't-do-C] which after some more tokens could become A-B-C-[backtrack-back-to-B-and-don't-do-C]-D. This would essentially be backtracking and I don't see why it would be inherently impossible for LLMs as long as the different branches fit in context.
measurablefunc · 3h ago
If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.
bboygravity · 3h ago
The LLM can just write the Prolog and solve the sudoku that way. I don't get your point. LLMs like Grok 4 can probably one-shot this today with the current state of art. You can likely just ask it to solve any sudoku and it will do it (by writing code in the background and running it and returning the result). And this is still very early stage compared to what will be out a year from now.
Why does it matter how it does it or whether this is strictly LLM or LLM with tools for any practical purpose?
Ukv · 3h ago
> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain
Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.
Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).
Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.
measurablefunc · 2h ago
There are multiple solutions for each incomplete grid so how are you calculating the transitions for a grid w/ a non-unique solution?
Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.
Ukv · 2h ago
> but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions
If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.
measurablefunc · 2h ago
That still won't work b/c there is no backtracking. The point is that there is no way to encode backtracking/choice points like in Prolog w/ a Markov chain. The argument you have presented is not extensionally equivalent to the Prolog solver. It is almost equivalent but it's missing choice points for starting at a valid solution & backtracking to an incomplete board to generate a new one. The typical argument for absorbing states doesn't work b/c sudoku is not a typical deterministic puzzle.
Ukv · 2h ago
> That still won't work b/c there is no backtracking.
It's essentially just a lookup table mapping from input board to the set of valid output boards - there's no real way for it not to work (obviously not practical though). If board A has valid solutions B, C, D, then the transition matrix cell mapping {A} to {B, C, D} is 1.0, and all other entries in that row are 0.0.
> The point is that there is no way to encode backtracking/choice points
You can if you want, keeping the same variables as a regular sudoku solver as part of the Markov chain's state and transitioning instruction-by-instruction, rather than mapping directly to the solution - just that there's no particular need to when you've precomputed the solution.
measurablefunc · 2h ago
My point is that your initial argument was missing several key pieces & if you specify the entire state space you will see that it's not as simple as you thought initially. I'm not saying it can't be done but that it's actually much more complicated than simply saying just take an incomplete board state s & uniform transitions between s, s' for valid solutions s' that are compatible with s. In fact, now that I spelled out the issues I still don't think this is a formal extensional equivalence. Prolog has interactive transitions between the states & it tracks choice points so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.
Ukv · 1h ago
> My point is that your initial argument was missing several key pieces
My initial example was a response to "If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain", describing how a Sudoku solver could be implemented as a Markov chain. I don't think there's anything missing from it - it solves all proper Sudokus, and I only left open the choice of how to handle improper Sudokus because that was unspecified (but trivial regardless of what's wanted).
> I'm not saying it can't be done but that it's actually much more complicated
If that's the case, then I did misinterpret your comments as saying it can't be done. But, I don't think it's really complicated regardless of whatever "ok but now it must encode choice points in its state" are thrown at it - it's just a state-to-state transition look-up table.
> so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.
As noted, you can keep all the same variables as a regular Sudoku solver as part of the Markov chain's state and transition instruction-by-instruction, if that's what you want.
If you mean inputs from a user, the same is true of LLMs which are typically ran interactively. Either model the whole universe including the user as part of state transition table (maybe impossible, depending on your beliefs about the universe), or have user interaction take the current state, modify it, and use it as initial state for a new run of the Markov chain.
measurablefunc · 1h ago
> As noted, you can keep all the same variables as a regular Sudoku solver
What are those variables exactly?
Ukv · 19m ago
For a depth-first solution (backtracking), I'd assume mostly just the partial solutions and a few small counters/indices/masks - like for tracking the cell we're up to and which cells were prefilled. Specifics will depend on the solver, but can be made part of Markov chain's state regardless.
lelanthran · 2h ago
> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.
I think it can be done. I started a chatbot that works like this some time back (2024) but paused work on it since January.
In brief, you shorten the context by discarding the context that didn't work out.
sudosysgen · 3h ago
You can do that pretty trivially for any fixed size problem (as in solvable with a fixed-sized tape Turing machine), you'll just have a titanically huge state space. The claim of the LLM folks is that the models have a huge state space (they do have a titanically huge state space) and can navigate it efficiently.
Simply have a deterministic Markov chain where each state is a possible value of the tape+state of the TM and which transitions accordingly.
measurablefunc · 3h ago
How are you encoding the state spaces for the sudoku solver specifically?
vidarh · 2h ago
A (2,3) Turing machine can be trivially implemented with a loop around an LLM that treats the context as an IO channel, and a Prolog interpreter runs on a Turing complete computer, and so per Truing equivalence you can run a Prolog interpreter on an LLM.
Of course this would be pointless, but it demonstrates that a system where an LLM provides the logic can backtrack, as there's nothing computationally special about backtracking.
That current UIs to LLMs are set up for conversation-style use that makes this harder isn't an inherent limitation of what we can do with LLMs.
measurablefunc · 2h ago
Loop around an LLM is not an LLM.
vidarh · 2h ago
Then no current systems you are using are LLMs
measurablefunc · 1h ago
Choice-free feedforward graphs are LLMs. The inputs/outputs are extensionally equivalent to context and transition probabilities of a Markov chain. What exactly is your argument b/c what it looks like to me is you're simply making a Turing tarpit argument which does not address any of my points.
vidarh · 1h ago
My argument is that artificially limiting what you argue about to a subset of the systems people are actually using and arguing about the limitations of that makes your argument irrelevant to what people are actually using.
arduanika · 4h ago
What hinting? The comment was very clear. Arbitrarily good approximation is different from symbolic understanding.
"if you can implement it in a brain"
But we didn't. You have no idea how a brain works. Neither does anyone.
mallowdram · 4h ago
We know the healthy brain is unpredictable. We suspect error minimization and prediction are not central tenets. We know the brain creates memory via differences in sharp wave ripples. That it's oscillatory. That it neither uses symbols nor represents. That words are wholly external to what we call thought.
The authors deal with molecules which are neither arbitrary nor specific. Yet tumors ARE specific, while words are wholly arbitrary. Knowing these things should offer a deep suspicion of ML/LLMs. They have so little to do with how brains work and the units brains actually use (all oscillation is specific, all stats emerge from arbitrary symbols and worse: metaphors) that mistaking LLMs for reasoning/inference is less lexemic hallucination and more eugenic.
Zigurd · 3h ago
"That words are wholly external to what we call thought." may be what we should learn, or at least hypothesize, based on what we see LLMs doing. I'm disappointed that AI isn't more of a laboratory for understanding brain architecture, and precisely what is this thing called thought.
mallowdram · 2h ago
The question is how to model the irreducible. And then to concatenate between spatiotemporal neuroscience (the oscillators) and neural syntax (what's oscillating) and add or subtract what the fields are doing to bind that to the surroundings.
quantummagic · 3h ago
What do you think about the idea that LLMs are not reasoning/inferring, but are rather an approximation of the result? Just like you yourself might have to spend some effort reasoning, on how a plant grows, in order to answer questions about that subject. When asked, you wouldn't replicate that reasoning, instead you would recall the crystallized representation of the knowledge you accumulated while previously reasoning/learning. The "thinking" in the process isn't modelled by the LLM data, but rather by the code/strategies used to iterate over this crystallized knowledge, and present it to the user.
mallowdram · 2h ago
This is toughest part. We need some kind of analog external that concatenates. It's software, but not necessarily binary, it uses topology to express that analog. It somehow is visual, ie you can see it, but at the same time, it can be expanded specifically into syntax, which the details of are invisible. Scale invariance is probably key.
Certhas · 4h ago
We didn't but somebody did so it's possible so probabilistic dynamics in high enough dimensions can do it.
We don't understand what LLMs are doing. You can't go from understanding what a transformer is to understanding what an LLM does any more than you can go from understanding what a Neuron is to what a brain does.
jjgreen · 2h ago
You can look at it, from the inside.
awesome_dude · 4h ago
I think that the difference can be best explained thus:
I guess that you are most likely going to have cereal for breakfast tomorrow, I also guess that it's because it's your favourite.
vs
I understand that you don't like cereal for breakfast, and I understand that you only have it every day because a Dr told you that it was the only way for you to start the day in a way that aligns with your health and dietary needs.
Meaning, I can guess based on past behaviour and be right, but understanding the reasoning for those choices, that's a whole other ballgame. Further, if we do end up with an AI that actually understands, well, that would really open up creativity, and problem solving.
quantummagic · 3h ago
How are the two cases you present fundamentally different? Aren't they both the same _type_ of knowledge? Why do you attribute "true understanding" to the case of knowing what the Dr said? Why stop there? Isn't true understanding knowing why we trust what the doctor said (all those years of schooling, and a presumption of competence, etc)? And why stop there? Why do we value years of schooling? Understanding, can always be taken to a deeper level, but does that mean we didn't "truly" understand earlier? And aren't the data structures needed to encode the knowledge, exactly the same for both cases you presented?
awesome_dude · 2h ago
When you ask that question, why don't you just use a corpus of the previous answers to get some result?
Why do you need to ask me, isn't a guess based on past answers good enough?
Or, do you understand that you need to know more, you need to understand the reasoning based on what's missing from that post?
quantummagic · 1h ago
I asked that question in an attempt to not sound too argumentative. It was rhetorical. I'm asking you to consider the fact that there isn't actually any difference between the two examples you provided. They're fundamentally the same type of knowledge. They can be represented by the same data structures.
There's _always_ something missing, left unsaid in every example, it's the nature of language.
As for your example, the LLM can be trained to know the underlying reasons (doctor's recommendation, etc.). That knowledge is not fundamentally different from the knowledge that someone tends to eat cereal for breakfast. My question to you, was an attempt to highlight that the dichotomy you were drawing, in your example, doesn't actually exist.
Anon84 · 3h ago
There definitely is, but Marcus is not the only one talking about it. For example, we covered this paper in one of our internal journal clubs a few weeks ago: https://arxiv.org/abs/2410.02724
tim333 · 2h ago
Humans can do symbolic understanding that seems to rest on a rather flakey probabilistic neural network in our brains, or at least mine does. I can do maths and the like but there's quite a lot of trial and error and double checking things involved.
GPT5 said it thinks it's fixable when I asked it:
>Marcus is right that LLMs alone are not the full story of reasoning. But the evidence so far suggests the gap can be bridged—either by scaling, better architectures, or hybrid neuro-symbolic approaches.
vidarh · 2h ago
> Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.
Unless you either claim that humans can't do logical reasoning, or claim humans exceed the Turing computable, then given you can trivially wire an LLM into a Turing complete system, this reasoning is illogical due to Turing equivalence.
And either of those two claims lack evidence.
No comments yet
jules · 3h ago
What does this predict about LLMs ability to win gold at the International Mathematical Olympiad?
measurablefunc · 3h ago
Same thing it does about their ability to drive cars.
boznz · 4h ago
logical reasoning is also based on probability weights, most of the time that probability is so close to 100% that it can be assumed to be true without consequence.
AaronAPU · 2h ago
Stunningly, though I have been saying this for 20 years I’ve never come across someone else mention it until now.
logicchains · 4h ago
LLMs are not formally equivalent to Markov chains, they're more powerful; transformers with sufficient chain of thought can solve any problem in P: https://arxiv.org/abs/2310.07923.
That article is weird. They seem obsessed with nuclear reactors.
Also, they misunderstand how floating point works.
As one learns at high school, the continuous derivative is the limit of the discrete version
as the displacement h is sent to zero. If our computers could afford infinite precision,
this statement would be equally good in practice as it is in continuum mathematics. But
no computer can afford infinite precision, in fact, the standard double-precision IEEE
representation of floating numbers offers an accuracy around the 16th digit, meaning that
numbers below 10−16 are basically treated as pure noise. This means that upon sending
the displacement h below machine precision, the discrete derivatives start to diverge from
the continuum value as roundoff errors then dominate the discretization errors.
Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck.
A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.
hatmanstack · 2h ago
Have no empirical feedback but subjectively it reads as though the authors are trying to proof their own intelligence through convolution and confusion. Pure AI slop IMHO.
klawed · 3h ago
> avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.
I wonder if the authors are aware of The Bitter Lesson
Scene_Cast2 · 7h ago
The paper is hard to read. There is no concrete worked-through example, the prose is over the top, and the equations don't really help. I can't make head or tail of this paper.
lumost · 7h ago
This appears to be a position paper written by authors outside of their core field. The presentation of "the wall" is only through analogy to derivatives on the discrete values computer's operate in.
jibal · 5h ago
If you look at their other papers, you will see that this is very much within their core field.
lumost · 4h ago
Their other papers are on simulation and applied chemistry. Where does their expertise in Machine Learning, or Large Language Models derive from?
While it's not a requirement to have published in a field before publishing in a field. Having a coauthor who is from the target field or a peer review venue in that field as an entry point certainly raises credibility.
From my limited claim to be in either Machine Learning or Large Language Models the paper does not appear to demonstrate what it claims. The author's language addresses the field of Machine Learning and LLM development as you would a young student - which does not help make their point.
JohnKemeny · 4h ago
He's a chemist. Lots of chemists and physicists like to talk about computation without having any background in it.
I'm not saying anything about the content, merely making a remark.
11101010001100 · 3m ago
Succi is no slouch; hardcore multiscale physics guy, among other things.
chermi · 3h ago
You're really not saying anything? Just a random remark with no bearing?
I wish someone told them to shut up about computing. And I wouldn't dare claim von Neumann as merely a physicist, but that's where he was coming from. Oh and as much as I dislike him, Wolfram.
joe_the_user · 6h ago
Paper seems to involve a series of analogies and equations. However, I think if the equations accepted, the "wall" is actually derived.
The authors are computer scientists and people who work with large scale dynamic system. They aren't people who've actually produced an industry-scale LLM. However, I have to note that despite lots of practical progress in deep learning/transformers/etc systems, all the theory involved just analogies and equations of a similar sort, it's all alchemy and so people really good at producing these models seem to be using a bunch of effective rules of thumb and not any full or established models (despite books claiming to offer a mathematical foundation for enterprise, etc).
Which is to say, "outside of core competence" doesn't mean as much as it would for medicine or something.
ACCount37 · 5h ago
No, that's all the more reason to distrust major, unverified claims made by someone "outside of core competence".
Applied demon summoning is ruled by empiricism and experimentation. The best summoners in the field are the ones who have a lot of practical experience and a sharp, honed intuition for the bizarre dynamics of the summoning process. And even those very summoners, specialists worth their weight in gold, are slaves to the experiment! Their novel ideas and methods and refinements still fail more often than they succeed!
One of the first lessons you have to learn in the field is that of humility. That your "novel ideas" and "brilliant insights" are neither novel nor brilliant - and the only path to success lies through things small and testable, most of which do not survive the test.
With that, can you trust the demon summoning knowledge of someone who has never drawn a summoning diagram?
jibal · 5h ago
Somehow the game of telephone took us from "outside of their core field" (which wasn't true) to "outside of core competence" (which is grossly untrue).
> One of the first lessons you have to learn in the field is that of humility.
I suggest then that you make your statements less confidently.
The freshly-summoned Gaap-5 was rumored to be the most accursed spirit ever witnessed by mankind, but so far it seems not dramatically more evil than previous demons, despite having been fed vastly more humans souls.
lazide · 3h ago
Perhaps we’re reaching peak demon?
CuriouslyC · 3h ago
This article is accurate. That's why I'm investigating a bayesian symbolic lisp reasoner. It's incapable of hallucinating, it provides auditable traces which are actual programs and it kicks the crap out of LLMs at stuff like Arc-Agi, symbolic reasoning, logic programs, game playing, etc. I'm working on a paper where I show that the same model can break 80 on arc-agi, run the house by counting cards at blackjack, and solve complex mathematical word problems.
leptons · 2h ago
LLMs are also incapable of "hallucinating", so maybe that isn't the buzzword you should be using.
18cmdick · 5h ago
Grifters in shambles.
dcre · 4h ago
Always fun to see a theoretical argument that something clearly already happening is impossible.
ahartmetz · 4h ago
So where are the recent improvements in LLMs proportional to the billions invested?
dcre · 4h ago
Value for the money is not at issue in the paper!
ahartmetz · 4h ago
I believe it is. They are saying that LLMs don't improve all that much from giving them more resources - and computing power (and input corpus size) is pretty proportional to money.
42lux · 3h ago
It's not about value it's about the stagnation while throwing compute at the problem.
dcre · 3h ago
Exactly.
crowbahr · 4h ago
Really? It sure seems like we're at the top of the S curve with LLMs. Wiring them up to talk the themselves as reasoning isn't scaling the core models, which have only made incremental gains for all the billions invested.
There's plenty more room to grow with agents and tooling, but the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23.
EMM_386 · 2h ago
> the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23
From Anthropic's press release yesterday after raising another $13 billion:
"Anthropic has seen rapid growth since the launch of Claude in March 2023. At the beginning of 2025, less than two years after launch, Anthropic’s run-rate revenue had grown to approximately $1 billion. By August 2025, just eight months later, our run-rate revenue reached over $5 billion—making Anthropic one of the fastest-growing technology companies in history."
$4 billion increase in 8 months. $1 billion every two months.
dcre · 1h ago
They’re talking about model quality. I still think they’re wrong, but the revenue is only indirectly relevant.
Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.
So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.
(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)
If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.
¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine
The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.
There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.
Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.
https://arxiv.org/html/2502.04404v1
https://arxiv.org/abs/2306.05426
And I was wrong that nobody has implemented it, as these papers prove people have… it is just the results haven’t been sufficiently impressive to support the transition from the research lab to industrial use - or at least, not yet
General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.
Today I had another of those experiences of the weaknesses of LLM reasoning, one that happens a lot when doing LLM-assisted coding. I was trying to figure out how to rebuild some CSS after the HTML changed for accessibility purposes and got a good idea for how to do it from talking to the LLM but at that point the context was poisoned, probably because there was a lot of content about the context describing what we were thinking about at different stages of the conversation which evolved considerably. It lost its ability to follow instructions and I'd tell it specifically to do this or do that and it just wouldn't do it properly and this happens a lot if a session goes on too long.
My guess is that the attention mechanism is locking on to parts of the conversation which are no longer relevant to where I think we're at and in general the logic that considers the variation of either a practice (instances) or a theory over time is a very tricky problem and 'backtracking' is a specific answer for maintaining your knowledge base across a search process.
Back when I was thinking about commonsense reasoning with logic it was obviously a much more difficult problem to add things like "P was true before time t", "there will be some time t in the future such at P is true", "John believes Mary believes that P is true", "It is possible that P is true", "there is some person q who believes that P is true", particularly when you combine these qualifiers. For one thing you don't even have a sound and complete strategy for reasoning over first-order logic + arithmetic but you also have a combinatorical explosion over the qualifiers.
Back in the day I thought it was important to have sound reasoning procedures but one of the reasons none of my foundation models ever became ChatGPT was that I cared about that and I really needed to ask "does change C cause an unsound procedure to get the right answer more often?" and not care if the reasoning procedure was sound or not.
Why does it matter how it does it or whether this is strictly LLM or LLM with tools for any practical purpose?
Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.
Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).
Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.
Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.
If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.
It's essentially just a lookup table mapping from input board to the set of valid output boards - there's no real way for it not to work (obviously not practical though). If board A has valid solutions B, C, D, then the transition matrix cell mapping {A} to {B, C, D} is 1.0, and all other entries in that row are 0.0.
> The point is that there is no way to encode backtracking/choice points
You can if you want, keeping the same variables as a regular sudoku solver as part of the Markov chain's state and transitioning instruction-by-instruction, rather than mapping directly to the solution - just that there's no particular need to when you've precomputed the solution.
My initial example was a response to "If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain", describing how a Sudoku solver could be implemented as a Markov chain. I don't think there's anything missing from it - it solves all proper Sudokus, and I only left open the choice of how to handle improper Sudokus because that was unspecified (but trivial regardless of what's wanted).
> I'm not saying it can't be done but that it's actually much more complicated
If that's the case, then I did misinterpret your comments as saying it can't be done. But, I don't think it's really complicated regardless of whatever "ok but now it must encode choice points in its state" are thrown at it - it's just a state-to-state transition look-up table.
> so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.
As noted, you can keep all the same variables as a regular Sudoku solver as part of the Markov chain's state and transition instruction-by-instruction, if that's what you want.
If you mean inputs from a user, the same is true of LLMs which are typically ran interactively. Either model the whole universe including the user as part of state transition table (maybe impossible, depending on your beliefs about the universe), or have user interaction take the current state, modify it, and use it as initial state for a new run of the Markov chain.
What are those variables exactly?
I think it can be done. I started a chatbot that works like this some time back (2024) but paused work on it since January.
In brief, you shorten the context by discarding the context that didn't work out.
Simply have a deterministic Markov chain where each state is a possible value of the tape+state of the TM and which transitions accordingly.
Of course this would be pointless, but it demonstrates that a system where an LLM provides the logic can backtrack, as there's nothing computationally special about backtracking.
That current UIs to LLMs are set up for conversation-style use that makes this harder isn't an inherent limitation of what we can do with LLMs.
"if you can implement it in a brain"
But we didn't. You have no idea how a brain works. Neither does anyone.
We don't understand what LLMs are doing. You can't go from understanding what a transformer is to understanding what an LLM does any more than you can go from understanding what a Neuron is to what a brain does.
I guess that you are most likely going to have cereal for breakfast tomorrow, I also guess that it's because it's your favourite.
vs
I understand that you don't like cereal for breakfast, and I understand that you only have it every day because a Dr told you that it was the only way for you to start the day in a way that aligns with your health and dietary needs.
Meaning, I can guess based on past behaviour and be right, but understanding the reasoning for those choices, that's a whole other ballgame. Further, if we do end up with an AI that actually understands, well, that would really open up creativity, and problem solving.
Why do you need to ask me, isn't a guess based on past answers good enough?
Or, do you understand that you need to know more, you need to understand the reasoning based on what's missing from that post?
There's _always_ something missing, left unsaid in every example, it's the nature of language.
As for your example, the LLM can be trained to know the underlying reasons (doctor's recommendation, etc.). That knowledge is not fundamentally different from the knowledge that someone tends to eat cereal for breakfast. My question to you, was an attempt to highlight that the dichotomy you were drawing, in your example, doesn't actually exist.
GPT5 said it thinks it's fixable when I asked it:
>Marcus is right that LLMs alone are not the full story of reasoning. But the evidence so far suggests the gap can be bridged—either by scaling, better architectures, or hybrid neuro-symbolic approaches.
Unless you either claim that humans can't do logical reasoning, or claim humans exceed the Turing computable, then given you can trivially wire an LLM into a Turing complete system, this reasoning is illogical due to Turing equivalence.
And either of those two claims lack evidence.
No comments yet
As one learns at high school, the continuous derivative is the limit of the discrete version as the displacement h is sent to zero. If our computers could afford infinite precision, this statement would be equally good in practice as it is in continuum mathematics. But no computer can afford infinite precision, in fact, the standard double-precision IEEE representation of floating numbers offers an accuracy around the 16th digit, meaning that numbers below 10−16 are basically treated as pure noise. This means that upon sending the displacement h below machine precision, the discrete derivatives start to diverge from the continuum value as roundoff errors then dominate the discretization errors.
Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck. A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.
I wonder if the authors are aware of The Bitter Lesson
While it's not a requirement to have published in a field before publishing in a field. Having a coauthor who is from the target field or a peer review venue in that field as an entry point certainly raises credibility.
From my limited claim to be in either Machine Learning or Large Language Models the paper does not appear to demonstrate what it claims. The author's language addresses the field of Machine Learning and LLM development as you would a young student - which does not help make their point.
I'm not saying anything about the content, merely making a remark.
Seth Lloyd, Wolpert, Landauer, Bennet, Fredkin, Feynman, Sejnowski, Hopfield, Zechinna, parisi,mezard, and zdebvora, Crutchfeld, Preskill, Deutsch, Manin, Szilard, MacKay....
I wish someone told them to shut up about computing. And I wouldn't dare claim von Neumann as merely a physicist, but that's where he was coming from. Oh and as much as I dislike him, Wolfram.
The authors are computer scientists and people who work with large scale dynamic system. They aren't people who've actually produced an industry-scale LLM. However, I have to note that despite lots of practical progress in deep learning/transformers/etc systems, all the theory involved just analogies and equations of a similar sort, it's all alchemy and so people really good at producing these models seem to be using a bunch of effective rules of thumb and not any full or established models (despite books claiming to offer a mathematical foundation for enterprise, etc).
Which is to say, "outside of core competence" doesn't mean as much as it would for medicine or something.
Applied demon summoning is ruled by empiricism and experimentation. The best summoners in the field are the ones who have a lot of practical experience and a sharp, honed intuition for the bizarre dynamics of the summoning process. And even those very summoners, specialists worth their weight in gold, are slaves to the experiment! Their novel ideas and methods and refinements still fail more often than they succeed!
One of the first lessons you have to learn in the field is that of humility. That your "novel ideas" and "brilliant insights" are neither novel nor brilliant - and the only path to success lies through things small and testable, most of which do not survive the test.
With that, can you trust the demon summoning knowledge of someone who has never drawn a summoning diagram?
> One of the first lessons you have to learn in the field is that of humility.
I suggest then that you make your statements less confidently.
https://news.ycombinator.com/item?id=45114753
There's plenty more room to grow with agents and tooling, but the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23.
From Anthropic's press release yesterday after raising another $13 billion:
"Anthropic has seen rapid growth since the launch of Claude in March 2023. At the beginning of 2025, less than two years after launch, Anthropic’s run-rate revenue had grown to approximately $1 billion. By August 2025, just eight months later, our run-rate revenue reached over $5 billion—making Anthropic one of the fastest-growing technology companies in history."
$4 billion increase in 8 months. $1 billion every two months.