2025-12-22

Tiny Reasoning Model

The faster twin of a HRM

Tags: AI, Machine Learning, Reasoning Models

Large Language Models can be:

wide
deep
massive

But at inference time, they still execute a mostly fixed computation graph once per token.

That single fact is fatal for tasks that require:

multi-step search
hypothesis revision
backtracking
long-horizon planning

These are algorithmic problems, not perception problems.

No amount of parameter scaling fixes this.

The Core Failure Mode of LLMs

Across domains like Sudoku, mazes, and ARC, the failure pattern is consistent:

one wrong step poisons the entire output
chain-of-thought helps only superficially
increasing sampling or tokens explodes cost without improving reliability

LLMs do not reason.
They commit.

Once a bad assumption is made, there is no internal mechanism to revise it.

HRM Was Directionally Right, Structurally Overcomplicated

Hierarchical Reasoning Models showed something important:

Iterative latent computation matters more than scale.

But HRM framed the solution incorrectly.

It leaned on:

hierarchy
biological metaphors
fixed-point convergence
multiple interacting modules

TRM asks a simpler, sharper question:

What is the “high-level” state actually doing?

Answer:
It is just the current proposed solution.

And the “low-level” state?
Just latent reasoning state.

Once you accept this, the hierarchy collapses.

The Key Insight

Reasoning Is Iteration, Not Hierarchy

You do not need:

planners vs workers
fast vs slow modules
nested abstractions

You need only two things:

y: the current candidate solution
z: the internal reasoning state

And a way to repeatedly update both.

That is recursion.

What Is a Tiny Recursive Model (TRM)?

A TRM consists of:

a single tiny neural network (2 layers)
reused at every step
no role separation
no special modules

The behavior emerges from how the network is called, not from architectural complexity.

How Computation Actually Happens

Reasoning proceeds through explicit recursion.

Each step performs two updates:

1. Update reasoning state

z := net(x, y, z)

2. Refine the solution

y := net(y, z)

This loop is repeated many times.

Each iteration:

inspects the current solution
identifies errors implicitly
proposes a refinement

This is learned self-correction, end to end.

No decoding. No token sampling. No narration.

Depth Comes From Time, Not Layers

TRMs are deliberately small:

2 layers
~5 to 7M parameters

Why?

Because depth is created by recursion, not architecture.

A TRM with:

2 layers
6 inner recursions
3 outer cycles

achieves over 40 sequential transformations.

Unlike Transformers:

earlier assumptions can be revised
computation does not collapse into one pass
errors are not permanent

This is the difference between:

pattern recognition
algorithm execution

Deep Supervision

Learning to Improve, Not Just Predict

TRM does not wait until the end to apply loss.

After each reasoning segment:

an answer is produced
loss is computed
state is detached
reasoning continues

The model is trained to answer a different question:

“Given my current partial solution, how do I make it better?”

This is fundamentally different from:

“Given the input, predict the output.”

That difference is why TRMs generalize under tiny data regimes where large models overfit.

No Fixed Points, No Gradient Tricks

HRM depends on:

fixed-point assumptions
implicit gradients
one-step approximations

TRM removes all of it.

There is:

no convergence assumption
no equilibrium requirement
no gradient approximation

TRM backpropagates through the full recursion.

Yes, it costs more memory. Yes, it works better.

Ablations confirm this. One-step gradients collapse performance.

Adaptive Computation Without Complexity

TRM keeps adaptive computation time, but simplifies it:

a single halting head
binary supervision: “is the answer correct?”
no reinforcement learning
no second forward pass

Easy problems halt early. Hard problems get more compute.

This gives:

inference-time scaling
training efficiency
architectural simplicity

What the Model Actually Learns

Visualizations show a critical distinction:

y always decodes to a valid candidate solution
z never does

z is not symbolic. It is not interpretable.

It is pure reasoning state.

Across tasks, the same architecture learns:

constraint propagation for Sudoku
wavefront exploration for mazes
incremental rule induction for ARC

No hand-coded solvers. No task-specific logic.

Just recursion.

Why Smaller Models Generalize Better

One of the most uncomfortable results:

Making the network larger hurts performance.

Observed trends:

4 layers worse than 2
MoE worse than dense
more parameters equals faster overfitting

TRM works precisely because:

reasoning lives in time
not in the weights

This is the inverse of the LLM paradigm.

Why Hierarchy Was a Red Herring

HRM worked. But not because of hierarchy.

It worked because:

state persisted across steps
computation was iterative
answers were refined, not predicted

TRM removes:

multiple networks
biological framing
fixed-point math

And performs better.

Hierarchy was an explanation. Recursion is the mechanism.

Results That Actually Matter

With ~7M parameters and ~1000 training examples, TRM achieves:

87% on Sudoku-Extreme
85% on Maze-Hard
45% on ARC-AGI-1
8% on ARC-AGI-2

This beats:

HRM with 27M parameters
frontier LLMs with billions to trillions of parameters

No pretraining. No chain-of-thought. No token sampling.

The Real Takeaway

TRM demonstrates something uncomfortable:

Reasoning is not a scaling problem. It is a compute-structure problem.

If a model lacks:

persistent state
iterative refinement
self-revision

no amount of parameters will make it reason.

Transformers are elite token predictors. TRMs are learned recursive solvers.

What Comes Next

TRMs are not the endgame.

They are:

supervised
deterministic
non-generative

But they point clearly toward the future:

internal reasoning over external narration
recursion over architectural depth
state over tokens

Scaling token predictors gave us fluency.

Recursive architectures are how you get thinking.