
2025-12-22
Tiny Reasoning Model
The faster twin of a HRM
Tags: AI, Machine Learning, Reasoning Models
Large Language Models can be:
- wide
- deep
- massive
But at inference time, they still execute a mostly fixed computation graph once per token.
That single fact is fatal for tasks that require:
- multi-step search
- hypothesis revision
- backtracking
- long-horizon planning
These are algorithmic problems, not perception problems.
No amount of parameter scaling fixes this.
The Core Failure Mode of LLMs
Across domains like Sudoku, mazes, and ARC, the failure pattern is consistent:
- one wrong step poisons the entire output
- chain-of-thought helps only superficially
- increasing sampling or tokens explodes cost without improving reliability
LLMs do not reason.
They commit.
Once a bad assumption is made, there is no internal mechanism to revise it.
HRM Was Directionally Right, Structurally Overcomplicated
Hierarchical Reasoning Models showed something important:
Iterative latent computation matters more than scale.
But HRM framed the solution incorrectly.
It leaned on:
- hierarchy
- biological metaphors
- fixed-point convergence
- multiple interacting modules
TRM asks a simpler, sharper question:
What is the “high-level” state actually doing?
Answer:
It is just the current proposed solution.
And the “low-level” state?
Just latent reasoning state.
Once you accept this, the hierarchy collapses.
The Key Insight
Reasoning Is Iteration, Not Hierarchy
You do not need:
- planners vs workers
- fast vs slow modules
- nested abstractions
You need only two things:
- y: the current candidate solution
- z: the internal reasoning state
And a way to repeatedly update both.
That is recursion.
What Is a Tiny Recursive Model (TRM)?
A TRM consists of:
- a single tiny neural network (2 layers)
- reused at every step
- no role separation
- no special modules
The behavior emerges from how the network is called, not from architectural complexity.
How Computation Actually Happens
Reasoning proceeds through explicit recursion.
Each step performs two updates:
1. Update reasoning state
z := net(x, y, z)
2. Refine the solution
y := net(y, z)
This loop is repeated many times.
Each iteration:
- inspects the current solution
- identifies errors implicitly
- proposes a refinement
This is learned self-correction, end to end.
No decoding. No token sampling. No narration.
Depth Comes From Time, Not Layers
TRMs are deliberately small:
- 2 layers
- ~5 to 7M parameters
Why?
Because depth is created by recursion, not architecture.
A TRM with:
- 2 layers
- 6 inner recursions
- 3 outer cycles
achieves over 40 sequential transformations.
Unlike Transformers:
- earlier assumptions can be revised
- computation does not collapse into one pass
- errors are not permanent
This is the difference between:
- pattern recognition
- algorithm execution
Deep Supervision
Learning to Improve, Not Just Predict
TRM does not wait until the end to apply loss.
After each reasoning segment:
- an answer is produced
- loss is computed
- state is detached
- reasoning continues
The model is trained to answer a different question:
“Given my current partial solution, how do I make it better?”
This is fundamentally different from:
“Given the input, predict the output.”
That difference is why TRMs generalize under tiny data regimes where large models overfit.
No Fixed Points, No Gradient Tricks
HRM depends on:
- fixed-point assumptions
- implicit gradients
- one-step approximations
TRM removes all of it.
There is:
- no convergence assumption
- no equilibrium requirement
- no gradient approximation
TRM backpropagates through the full recursion.
Yes, it costs more memory. Yes, it works better.
Ablations confirm this. One-step gradients collapse performance.
Adaptive Computation Without Complexity
TRM keeps adaptive computation time, but simplifies it:
- a single halting head
- binary supervision: “is the answer correct?”
- no reinforcement learning
- no second forward pass
Easy problems halt early. Hard problems get more compute.
This gives:
- inference-time scaling
- training efficiency
- architectural simplicity
What the Model Actually Learns
Visualizations show a critical distinction:
- y always decodes to a valid candidate solution
- z never does
z is not symbolic. It is not interpretable.
It is pure reasoning state.
Across tasks, the same architecture learns:
- constraint propagation for Sudoku
- wavefront exploration for mazes
- incremental rule induction for ARC
No hand-coded solvers. No task-specific logic.
Just recursion.
Why Smaller Models Generalize Better
One of the most uncomfortable results:
Making the network larger hurts performance.
Observed trends:
- 4 layers worse than 2
- MoE worse than dense
- more parameters equals faster overfitting
TRM works precisely because:
- reasoning lives in time
- not in the weights
This is the inverse of the LLM paradigm.
Why Hierarchy Was a Red Herring
HRM worked. But not because of hierarchy.
It worked because:
- state persisted across steps
- computation was iterative
- answers were refined, not predicted
TRM removes:
- multiple networks
- biological framing
- fixed-point math
And performs better.
Hierarchy was an explanation. Recursion is the mechanism.
Results That Actually Matter
With ~7M parameters and ~1000 training examples, TRM achieves:
- 87% on Sudoku-Extreme
- 85% on Maze-Hard
- 45% on ARC-AGI-1
- 8% on ARC-AGI-2
This beats:
- HRM with 27M parameters
- frontier LLMs with billions to trillions of parameters
No pretraining. No chain-of-thought. No token sampling.
The Real Takeaway
TRM demonstrates something uncomfortable:
Reasoning is not a scaling problem. It is a compute-structure problem.
If a model lacks:
- persistent state
- iterative refinement
- self-revision
no amount of parameters will make it reason.
Transformers are elite token predictors. TRMs are learned recursive solvers.
What Comes Next
TRMs are not the endgame.
They are:
- supervised
- deterministic
- non-generative
But they point clearly toward the future:
- internal reasoning over external narration
- recursion over architectural depth
- state over tokens
Scaling token predictors gave us fluency.
Recursive architectures are how you get thinking.