Tiny Reasoning Model

2025-12-22

Tiny Reasoning Model

The faster twin of a HRM

Tags: AI, Machine Learning, Reasoning Models

Large Language Models can be:

But at inference time, they still execute a mostly fixed computation graph once per token.

That single fact is fatal for tasks that require:

These are algorithmic problems, not perception problems.

No amount of parameter scaling fixes this.

The Core Failure Mode of LLMs

Across domains like Sudoku, mazes, and ARC, the failure pattern is consistent:

LLMs do not reason.
They commit.

Once a bad assumption is made, there is no internal mechanism to revise it.

HRM Was Directionally Right, Structurally Overcomplicated

Hierarchical Reasoning Models showed something important:

Iterative latent computation matters more than scale.

But HRM framed the solution incorrectly.

It leaned on:

TRM asks a simpler, sharper question:

What is the “high-level” state actually doing?

Answer:
It is just the current proposed solution.

And the “low-level” state?
Just latent reasoning state.

Once you accept this, the hierarchy collapses.

The Key Insight

Reasoning Is Iteration, Not Hierarchy

You do not need:

You need only two things:

And a way to repeatedly update both.

That is recursion.

What Is a Tiny Recursive Model (TRM)?

A TRM consists of:

The behavior emerges from how the network is called, not from architectural complexity.

How Computation Actually Happens

Reasoning proceeds through explicit recursion.

Each step performs two updates:

1. Update reasoning state

z := net(x, y, z)

2. Refine the solution

y := net(y, z)

This loop is repeated many times.

Each iteration:

This is learned self-correction, end to end.

No decoding. No token sampling. No narration.

Depth Comes From Time, Not Layers

TRMs are deliberately small:

Why?

Because depth is created by recursion, not architecture.

A TRM with:

achieves over 40 sequential transformations.

Unlike Transformers:

This is the difference between:

Deep Supervision

Learning to Improve, Not Just Predict

TRM does not wait until the end to apply loss.

After each reasoning segment:

The model is trained to answer a different question:

“Given my current partial solution, how do I make it better?”

This is fundamentally different from:

“Given the input, predict the output.”

That difference is why TRMs generalize under tiny data regimes where large models overfit.

No Fixed Points, No Gradient Tricks

HRM depends on:

TRM removes all of it.

There is:

TRM backpropagates through the full recursion.

Yes, it costs more memory. Yes, it works better.

Ablations confirm this. One-step gradients collapse performance.

Adaptive Computation Without Complexity

TRM keeps adaptive computation time, but simplifies it:

Easy problems halt early. Hard problems get more compute.

This gives:

What the Model Actually Learns

Visualizations show a critical distinction:

z is not symbolic. It is not interpretable.

It is pure reasoning state.

Across tasks, the same architecture learns:

No hand-coded solvers. No task-specific logic.

Just recursion.

Why Smaller Models Generalize Better

One of the most uncomfortable results:

Making the network larger hurts performance.

Observed trends:

TRM works precisely because:

This is the inverse of the LLM paradigm.

Why Hierarchy Was a Red Herring

HRM worked. But not because of hierarchy.

It worked because:

TRM removes:

And performs better.

Hierarchy was an explanation. Recursion is the mechanism.

Results That Actually Matter

With ~7M parameters and ~1000 training examples, TRM achieves:

This beats:

No pretraining. No chain-of-thought. No token sampling.

The Real Takeaway

TRM demonstrates something uncomfortable:

Reasoning is not a scaling problem. It is a compute-structure problem.

If a model lacks:

no amount of parameters will make it reason.

Transformers are elite token predictors. TRMs are learned recursive solvers.

What Comes Next

TRMs are not the endgame.

They are:

But they point clearly toward the future:

Scaling token predictors gave us fluency.

Recursive architectures are how you get thinking.