Sapients paper on the concept of Hierarchical Reasoning Model

64 hansmayer 17 7/27/2025, 7:15:57 AM arxiv.org ↗

Comments (17)

topspin · 35m ago
> "After completing the T steps, the H-module incorporates the sub-computation’s outcome (the final state L) and performs its own update. This H update establishes a fresh context for the L-module, essentially “restarting” its computational path and initiating a new convergence phase toward a different local equilibrium."

So they let the low-level RNN bottom out, evaluate the output in the high level module, and generate a new context for the low-level RNN. Rinse, repeat. The low-level RNNs are iterating backpropagation while the high-level is periodically kicking the low-level RNNs to get better outputs. Loops within loops.

Another interesting part:

> "Neuroscientific evidence shows that these cognitive modes share overlapping neural circuits, particularly within regions such as the prefrontal cortex and the default mode network. This indicates that the brain dynamically modulates the “runtime” of these circuits according to task complexity and potential rewards.

> Inspired by the above mechanism, we incorporate an adaptive halting strategy into HRM that enables `thinking, fast and slow'"

A scheduler that dynamically balances resources based on the necessary depth of reasoning and the available data.

I love how this paper cites parallels with real brains throughout. I believe AGI will be solved as the primitives we're developing are composed to extreme complexity, utilizing many cooperating, competing, communicating, concurrent, specialized "modules." It is apparent to me that human brain must have this complexity, because it's the only feasible way evolution had to achieve cognition using slow, low power tissue.

JonathanRaines · 16m ago
I advise scepticism.

This work does have some very interesting ideas, specifically avoiding the costs of backpropagation through time.

However, it does not appear to have been peer reviewed.

The results section is odd. It does not include include details of how they performed the assesments, and the only numerical values are in the figure on the front page. The results for ARC2 are (contrary to that figure) not top of the leaderboard (currently 19% compared to HRMs 5% https://www.kaggle.com/competitions/arc-prize-2025/leaderboa...)

cs702 · 2h ago
Based on a quick first skim of the abstract and the introduction, the results from hierarchical reasoning (HRM) models look incredible:

> Using only 1,000 input-output examples, without pre-training or CoT supervision, HRM learns to solve problems that are intractable for even the most advanced LLMs. For example, it achieves near-perfect accuracy in complex Sudoku puzzles (Sudoku-Extreme Full) and optimal pathfinding in 30x30 mazes, where state-of-the-art CoT methods completely fail (0% accuracy). In the Abstraction and Reasoning Corpus (ARC) AGI Challenge 27,28,29 - a benchmark of inductive reasoning - HRM, trained from scratch with only the official dataset (~1000 examples), with only 27M parameters and a 30x30 grid context (900 tokens), achieves a performance of 40.3%, which substantially surpasses leading CoT-based models like o3-mini-high (34.5%) and Claude 3.7 8K context (21.2%), despite their considerably larger parameter sizes and context lengths, as shown in Figure 1.

I'm going to read this carefully, in its entirety.

Thank you for sharing it on HN!

diwank · 2h ago
Exactly!

> It uses two interdependent recurrent modules: a *high-level module* for abstract, slow planning and a *low-level module* for rapid, detailed computations. This structure enables HRM to achieve significant computational depth while maintaining training stability and efficiency, even with minimal parameters (27 million) and small datasets (~1,000 examples).

> HRM outperforms state-of-the-art CoT models on challenging benchmarks like Sudoku-Extreme, Maze-Hard, and the Abstraction and Reasoning Corpus (ARC-AGI), where CoT methods fail entirely. For instance, it solves 96% of Sudoku puzzles and achieves 40.3% accuracy on ARC-AGI-2, surpassing larger models like Claude 3.7 and DeepSeek R1.

Erm what? How? Needs a computer and sitting down.

mkagenius · 43m ago
Is it talking about fine tuning existing models with 1000 examples to beat them in those tasks?
OgsyedIE · 54m ago
Skimming this, there is no reason why a MoE LLM system (whether autoregressive, diffusion, energy-based or mixed) couldn't be given a nested architecture that duplicates the layout of a HRM. Combining these in different ways should allow for some novel benchmarks around efficiency and quality, which will be interesting.
lispitillo · 2h ago
I hope/fear this HRM model is going to be merged with MoE very soon. Given the huge economic pressure to develop powerful LLMs I think this can be done in just a month.

The paper seems to only study problems like sudoku solving, and not question answering or other applications of LLMs. Furthermore they omit a section for future applications or fusion with current LLMs.

I think anyone working in this field can envision their applications, but the details to have a MoE with an HRM model could be their next paper.

I only skimmed the paper and I am not an expert, sure other will/can explain why they don't discuss such a new structure. Anyway, my post is just blissful ignorance over the complexity involved and the impossible task to predict change.

Edit: A more general idea is that Mixture of Expert is related to cluster of concepts and now we would have to consider a cluster of concepts related by the time they take to be grasped, so in a sense the model would have in latent space an estimation of the depth, number of layers, and time required for each concept, just like we adapt our reading style for a dense math book different to a newspaper short story.

yorwba · 1h ago
This HRM is essentially purpose-designed for solving puzzles with a small number of rules interacting in complex ways. Because the number of rules is small, a small model can learn them. Because the model is small, it can be run many times in a loop to resolve all interactions.

In contrast, language modeling requires storing a large number of arbitrary phrases and their relation to each other, so I don't think you could ever get away with a similarly small model. Fortunately, a comparatively small number of steps typically seems to be enough to get decent results.

But if you tried to use an LLM-sized model in an HRM-style loop, it would be dog slow, so I don't expect anyone to try it anytime soon. Certainly not within a month.

Maybe you could have a hybrid where an LLM has a smaller HRM bolted on to solve the occasional constraint-satisfaction task.

energy123 · 16m ago
What about many small HRM models that solve conceptually distinct subtasks as determined and routed to by a master model who then analyzes and aggregates the outputs, with all of that learned during training.
buster · 1h ago
must say I am suspicious in this regard, as they don't show applications other than a Sudoku solver and don't discuss downsides.
Oras · 1h ago
and the training was only on Sudoku. Which means they need to train a small model for every problem that currently exists.

Back to ML models?

lispitillo · 1m ago
Not only on Sudoku, there is also maze solving and ARC-AGI.
0x000xca0xfe · 38m ago
Goodbye captchas I guess? Somehow they are still around.
electroglyph · 2h ago
but does it scale?
torginus · 3h ago
Is it just me or are symbolic (or as I like to call it 'video game') AI is seeping back into AI?
bobosha · 53m ago
But symbolic != hierarchical
taylorius · 2h ago
Perhaps so - but represented in a trainable, neural form. Very exciting!