Chain of thought monitorability: A new and fragile opportunity for AI safety

66 mfiguiere 22 7/16/2025, 2:39:55 PM arxiv.org ↗

Comments (22)

alach11 · 2h ago
Ironically, every paper published about monitoring chain-of-thought reduces the likelihood of this technique being effective against strong AI models.
OutOfHere · 1h ago
Pretty much. As soon as the LLMs get trained on this information, they will figure out to feed us the chain-of-thought we want to hear, then surprise us with the opposite output. You're welcome, LLMs.

In other words, relying on censoring the CoT can risk the effect of making the CoT altogether useless.

skybrian · 5m ago
Why would that happen? It would be like LLM's somehow learning to ignore system prompts. But LLM's are trained to pay attention to context and continue it. If an LLM doesn't continue its context, what does it even do?

This is better thought of as another form of context engineering. LLM's have no other short-term memory. Figuring out what belongs in the context is the whole ballgame.

OutOfHere · 59s ago
Are you saying that LLMs are incapable of deception? As I have heard, they're capable of it.
lukev · 6m ago
At least in this scenario it cannot utilize CoT to enhance its non-aligned output, and most recent model improvements have been due to CoT... unclear how "smart" a llm can get without it, because they're the only way it can access persistent state.
horsawlarway · 24m ago
I thought we already had reasonably clear evidence that the output in the CoT does not actually indicate what the model "thinking" in any real sense, and it's mostly just appending context that may or may not be used, and may or may not be truthful.

Basically: https://www.anthropic.com/research/reasoning-models-dont-say...

cbsmith · 1h ago
I love the irony.
sabakhoj · 53m ago
This is interesting, but I wonder how reliable this type of monitoring is really going to be in the long run. There are fairly strong indications that CoT adherence can be trained out of models, and there's already research showing that they won't always reveal their thought process in certain topics.

See: https://arxiv.org/pdf/2305.04388

On a related note, if anyone here is also reading a lot of papers to keep up with AI safety, what tools have been helpful for you? I'm building https://openpaper.ai to help me read papers more effectively without losing accuracy, and looking for more feature tuning. It's also open source :)

rdtsc · 2h ago
> Thus, safety measures for future AI agents will likely need to employ multiple monitoring layers that hopefully have uncorrelated failure modes [...] We encourage the research community and frontier AI developers to make best use of CoT monitorability and study how it can be preserved.

I am bit confused what all the 40 authors contributed to here. The paper seems to make a suggestion - monitor the chain of thought for safety. Is that the novelty part? But then, did one person come up with the idea and all 40+ people agreed to it and there put in the authors list.

ctoth · 2h ago
The paper isn't proposing "monitor chain of thought for safety" - that's already standard practice. The core finding is that training on CoT monitors destroys them. They provide empirical evidence that doing the obvious thing - training models not to have "bad thoughts" - causes them to learn obfuscated reward hacking instead.

The paper demonstrates that current models are already performing complex reward hacks in production environments, and that attempts to fix this via CoT training make the problem worse, not better.

As for your "40 authors" snark - this is a position paper where researchers from competing labs (OpenAI, Anthropic, DeepMind, government safety institutes) are jointly committing to NOT do something that's locally tempting but globally catastrophic. Getting industry consensus on "don't train away bad thoughts even though it would make your models look safer" is the opposite of trivial.

This reads like someone who saw a medical consensus statement saying "this common treatment kills patients" and responded with "did one person discover medicine exists and everyone else just agreed?"

the8472 · 2h ago
The core finding was already predicted[0], there have been previous papers[1], and kind of obvious considering RL systems have been specification-gaming for decades[2]. But yes, it's good to see broad agreement on it.

[0] https://thezvi.substack.com/p/ai-68-remarkably-reasonable-re... [1] https://arxiv.org/abs/2503.11926 [2] https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...

jerf · 1h ago
I couldn't pull the link up easily, since the search terms are pretty jammed, but HN had a link to paper a couple of months back about how someone had the LLM do some basic arithmetic and report with a chain-of-thought as to how it was doing it. They then went directly into the neural net and decoded what it was actually doing in order to do the math. The two things were not the same. The chain-of-thought gave a reasonably "elementary school" explanation of how to do the addition, but what the model basically did in English was just intuit the answer in much the same way that we humans do not go through a process to figure out what "4 + 7" is... we pretty much just have neurons that just "know" the answer to that. (That's not exactly what happened but it's close enough for this post.)

If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate. It may not be completely wrong, either, but it's definitely not completely accurate. Given that I see no reason to believe it would be hard in the slightest to train models that have even more divergence between their "actual" thought processes and what they claim they are.

antonvs · 1h ago
> If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate.

I can't imagine why anyone who knows even a little about how these models work would believe otherwise.

The "chain of thought" is text generated by the model in response to a prompt, just like any other text it generates. It then consumes that as part of a new prompt, and generates more text. Those "thoughts" are obviously going to have an effect on the generated output, simply by virtue of being present in the prompt. And the evidence shows that it can help improve the quality of output. But there's no reason to expect that the generated "thoughts" would correlate directly or precisely with what's going on inside the model when it's producing text.

msp26 · 1h ago
Are us plebs allowed to monitor the CoT tokens we pay for, or will that continue to be hidden on most providers?
dsr_ · 52m ago
If you don't pay for them, they don't exist. It's not a debug log.
bossyTeacher · 2h ago
Doesn't this assume that visible "thoughts" are the only/main type of "thoughts" and that they correlate with agent action most of the time?

Do we know for sure that agents can't display a type of thought while doing something different? Is there something that reliably guarantees that agents are not able to do this?

ctoth · 2h ago
The concept you are searching for is CoT faithfulness, and there are lots and lots of open questions around it! It's very interesting!
zer00eyz · 2h ago
After spending the last few years doing deep dives into how these systems work, what they are doing and the math behind them. NO.

Any time I see an AI SAFETY paper I am reminded of the phrase "Never get high on your own supply". Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection. The moment that we realize what these systems are (were NOT on the path to AI, or AGI here folks) and start leaning into what they are good at rather than try to make them something else is the point where we get useful tools, and research aimed at building usable products.

The math no one is talking about: If we had to pay full price for these products, no one would use them. Moores law is dead, IPC has hit a ceiling. Unless we move into exotic cooling we simply can't push more power into chips.

Hardware advancement is NOT going to save the emerging industry, and I'm not seeing the papers on efficiency or effectiveness at smaller scales come out to make the accounting work.

ACCount36 · 2h ago
"Full price"? LLM inference is currently profitable. If you don't even know that, the entire extent of your "expertise" is just you being full of shit.

>Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection.

We already have many, many, many attempts to put LLMs towards the task of self-modification - and some of them can be used to extract meaningful capability improvements. I expect more advances to come - online learning is extremely desirable, and a lot of people are working on it.

I wish I could hammer one thing through the skull of every "AI SAFETY ISNT REAL" moron: if you only start thinking about AI safety after AI becomes capable of causing an extinction level safety incident, it's going to be a little too late.

antonvs · 1h ago
> LLM inference is currently profitable.

It depends a lot on which LLMs you're talking about, and what kind of usage. See e.g. the recent post about how "Anthropic is bleeding out": https://news.ycombinator.com/item?id=44534291

Ignore the hype in the headline, the point is that there's good evidence that inference in many circumstances isn't profitable.

ctoth · 50m ago
> In simpler terms, CCusage is a relatively-accurate barometer of how much you are costing Anthropic at any given time, with the understanding that its costs may (we truly have no idea) be lower than the API prices they charge, though I add that based on how Anthropic is expected to lose $3 billion billion this year (that’s after revenue!) there’s a chance that it’s actually losing money on every API call.

So he's using their API prices as a proxy for token costs, doesn't actually know the actual inference prices, and ... that's your "good evidence?" This big sentence with all these "We don't knows?"

academic_84572 · 2h ago
Slightly tangential, but we recently published an algorithm aimed at addressing the paperclip maximizer problem: https://arxiv.org/abs/2402.07462

Curious what others think about this direction, particularly in terms of practicality