A Verifiable Search Is Not a Learnable Chain-of-Thought

01 · the premise

If a program can solve it, surely a model can learn it.

This is the premise behind reasoning-by-distillation, and behind most of the recent work that trains small models on a stronger model's traces: if you can produce a correct chain-of-thought for a task, you can fine-tune a model to reproduce that reasoning and inherit the skill. It works because most procedures are forward. Each step is a deterministic function of the steps before it, so the trace is a faithful recipe and imitation recovers the function.

Converting numbers, applying a unit factor, decoding a cipher whose key you can read off the prompt: write out the steps, train on them, and a small model reproduces them almost perfectly. The recipe transfers. So I built a controlled laboratory where solvability and learnability could be measured separately, and pushed the premise until it broke.

02 · the laboratory

Nine procedural tasks, sorted into three outcomes.

The benchmark gives 9,500 training problems across nine procedurally-generated task families, and has one property that makes it an unusually honest instrument: the public training data and the hidden test set are drawn from the same generators. So a held-out slice of training rows is a faithful proxy for the hidden leaderboard. For each family I reverse-engineered the generator into a Python solver (five of nine reach ≥98%), rendered the solver's execution as a step-by-step chain-of-thought, distilled it into a rank-32 LoRA adapter, and asked one question per family under greedy decoding: did the procedure transfer?

Base model: Nemotron-3-Nano · hybrid Mamba-2 + MoE, ~3.5B active / 30B total
Adapter: rank-32 LoRA, SFT on solver-rendered chain-of-thought
Decoding: greedy, temperature 0, within a 7,680-token answer budget
Data: 9,500 training rows; train and hidden test share generators
Solvers: nine generators reverse-engineered; five reach ≥98% coverage
Metric: exact-match accuracy on held-out training rows

Tap any task to see a real puzzle from the benchmark, the model's actual reasoning where revealing, and why it lands where it does.

03 · the surprise

One family breaks the premise: cryptarithm.

A cryptarithm hides both digits and operators behind a per-puzzle symbol cipher. You are shown a handful of equations in symbols and must complete a new one. My solver cracks 71% by backtracking search over the cipher. The model distilled from that solver's own reasoning reaches 5%. Eleven redesigns of the chain-of-thought, reinforcement learning from verifiable rewards, and self-training all leave it at the floor.

0.71

the solver (search)

vs

0.05

the model it trained

Crucially, this is not a capability gap. In a line-by-line audit of the trained model's transcripts, the arithmetic steps are correct on 97–100% of lines, and when the model is handed candidate answers it ranks the correct one into its top eight about 71% of the time. It can do the arithmetic, and it can recognize the answer. What it cannot do is generate the answer by carrying the search forward. The next sections show why distilling the search fails on every model, and what makes the task learnable anyway.

04 · feel the difference

Some answers you derive forward. Some you can only search for.

Here is the crux, made tactile. Below is a small cryptarithm. With the cipher hidden, nothing tells you the next digit: you have to guess a whole assignment, check it against every clue, and backtrack on a contradiction. That is search, and it has no left-to-right story to write down. Now reveal the key, and every step becomes forced; the answer falls out in a single forward pass, exactly the kind of reasoning a model imitates easily.

A simplified stand-in: the real puzzles hide the digits behind arbitrary symbols (see the cryptarithm example above), but the structure is identical. With the key, decoding is a forward lookup the model learns. Without it, only backtracking search works, and that is what it fails. Section 8 turns this knob on the real task.

05 · the mechanism

There is no faithful forward trace to imitate.

Supervised fine-tuning teaches an autoregressive model to predict the next token from the prefix. That recovers a procedure only when each step is locally justified by what came before. A backtracking search is not: a step is justified by the future (you only learn a branch was wrong after exploring it), so reading the trace left-to-right there is no local rule to learn. The model fits the trace's surface form without its control flow.

At inference this surfaces as verdict-as-token. The model enumerates candidate pairs, computes one that plainly matches the target on the very same line, and then emits the high-frequency closing phrase it saw thousands of times in training, “none matches, drop,” regardless of the evidence it just produced. Greedy decoding makes this worse through exposure bias: one such mis-step is never corrected, and the derivation walks off into the token budget. In the fidelity audit of 7,566 line-records the arithmetic lines are correct 97–100% of the time, while the verdict lines invert their own evidence. Step through real transcripts:

trained-model transcripts · fidelity audit

What's actually wrong

Tokenizer note. The base tokenizer is a 131,072-entry BPE with no <unk>; unicode operators fragment into rare byte tokens, so all chain-of-thought is written in plain ASCII (operators as words, ->, !=). This constrains how symbols are expressed but is not the failure: the architecture sweep below holds with each model's own tokenizer.

06 · it fit the data perfectly

It learned the corpus. It never learned the search.

Loss measures whether the model fit the tokens of the reasoning. Accuracy measures whether it learned the logic. On cryptarithm the two come apart completely. Every corpus trained to a negative-log-likelihood below 0.06, reproducing the chain-of-thought almost verbatim, while greedy free-running collapsed and held-out accuracy never left the floor.

training loss (NLL)

0.146 → 0.001

It fit the corpus essentially perfectly.

held-out accuracy

~0.03 (flat)

Across every redesign it never moved.

Held-out accuracy at thirteen checkpoints spanning eleven cryptarithm chain-of-thought redesigns, reinforcement learning, and self-training. Cryptarithm (clay) sits at the floor throughout. Bit-manipulation (teal), which is partly forward-derivable, climbs from 0.52 to 0.68 over the same kind of iteration. Same effort; only the forward-derivable task improves.

07 · it's the task, not the model

Bigger, different, smarter: all stuck at the floor.

Maybe the limit is this particular model? I trained the identical corpus on four architectures that share almost nothing, each evaluated with its own tokenizer to rule out a tokenization artifact: dense transformers (Llama-3.2-3B, Qwen3.5-4B), an MoE transformer (gpt-oss-20b), and the hybrid Mamba-2 + MoE base. Spanning a state-space model and dense and sparse transformers rules out any single architectural state-tracking limit. I also prompted two frontier models in-context with few-shot examples under the same budget. Every one fits its training data; not one reproduces the search.

Cryptarithm accuracy on identical held-out puzzles, from 3B to 671B and fine-tuned to prompted. Every model lands within about four points of zero. The teal bar is what the search solver reaches on the same puzzles; the gap, not the differences between models, is the story.

08 · the causal proof intervention

Turn the search into a forward derivation, and it learns.

If forward-derivability is the cause, making the task forward should fix it, and nothing else should. So I operationalized it: reveal a fraction of the symbol-to-digit key in the prompt, converting exactly that fraction of the derivation from search into lookup, holding the surface task fixed. The result is a clean dose-response. Try each setting:

3%

Same puzzles, same model, same budget; only forward-derivability varies. Revealing half barely moves it, because any residual unknown re-opens the backtracking search. Revealing all of it lifts accuracy roughly fifteen-fold. Forward-derivability is the causal knob.

09 · how it was actually solved

Memorize the search. Don't reason through it.

So is cryptarithm hopeless? No, and the competition's 1st-place solution (private LB 0.92) shows the way by doing exactly what this study predicts you must: not teaching the model to search. The naive search has about 5×10¹⁰ candidates and cannot be written out in 7,680 tokens, so they removed it from the trace entirely.

For each equation they precompute a signature, the pattern of repeated symbols, and enumerate every two-digit case into a catalog of 4,205 signatures, each mapping to a short list of candidate digit-assignments. The model memorizes that catalog; at inference it recalls the candidates rather than deriving them, and the trace does only a short, bounded consistency check across the remaining equations. Bit-manipulation is the same idea: a memorized catalog of 5,238 valid rule-sequences, with nearest-match repair and verification.

That is the dual of everything above. Distilling the search fails (verdict-as-token); the fix is to compute the search offline into a finite table the model memorizes, leaving only recall and verification in the chain-of-thought. What transfers is memorization and verification, not search.

See the trick. Step through it on one equation: recall replaces search.

Distilling the searchthis study + Open Progress Prize

0.85

Memorizing the searchteam NullSira · 1st place

0.92

Same task, same base, same 7,680-token budget. The 7-point gap is exactly what memorizing the search buys over distilling it.

Method and figures from the 1st-place write-up (Kaggle · GoodMeatDay, re, reopon).

10 · what it means

Solver coverage doesn't tell you what a model can learn.

A perfect solver proves the answer exists; it says nothing about whether a model can be taught to reason its way there. The predictor of learnability is whether a faithful forward chain-of-thought exists, one that runs left-to-right without guess-and-backtrack. This is the same family of limit reported for compositional tasks (Dziri et al., “Faith and Fate”) and for the unfaithfulness of chain-of-thought (Turpin, Lanham, Lyu), seen here through a clean solvable-versus-learnable lens.

It also gives a cheap screen to run before spending a dollar on training. Does a purely forward derivation cover a meaningful share of instances? Does the hidden structure carry information you can read off the prompt? Does removing search collapse the branching factor? If the answers are no, then no amount of chain-of-thought engineering, RL, scale, or self-training will teach the model to search. The escape is to not distil the search at all: precompute its finite structure into a catalog the model memorizes and verifies against (as the 1st-place solution did), or give the model a real tool, program synthesis, or search-augmented decoding.

11 · for the full detail

The paper, the data, the code.

The paper establishes all of the above with Wilson confidence intervals, the full per-line fidelity audit, the architecture sweep, and the causal intervention. The benchmark decomposition, the nine solvers, every experiment script, and the per-row evaluation dumps are released. The three commands below reproduce the architecture sweep, the in-context frontier baseline, and the forward-derivability intervention.

Read the paper (PDF) Code & data ↗

# architecture sweep: same corpus, any base, evaluated with that model's own tokenizer
bash tinker/run_crypt_arch_ctrl.sh meta-llama/Llama-3.2-3B llama3b 4
# in-context frontier baseline under the 7,680-token answer budget
bash tinker/run_crypt_frontier.sh 20
# forward-derivability intervention: reveal a fraction of the key, then train + eval
python3 pipeline/synth/forward_frontier_gen.py && bash tinker/run_frontier.sh

@misc{patel2026verifiable,
  title  = {A Verifiable Search Is Not a Learnable Chain-of-Thought},
  author = {Patel, Harsh}, year = {2026},
  note   = {NVIDIA Nemotron Model Reasoning Challenge}
}

References

Wei et al. Chain-of-Thought Prompting Elicits Reasoning in LLMs. NeurIPS 2022. arXiv:2201.11903
Dziri et al. Faith and Fate: Limits of Transformers on Compositionality. NeurIPS 2023. arXiv:2305.18654
Turpin et al. Language Models Don't Always Say What They Think. NeurIPS 2023. arXiv:2305.04388
Lanham et al. Measuring Faithfulness in Chain-of-Thought Reasoning. 2023. arXiv:2307.13702
Lyu et al. Faithful Chain-of-Thought Reasoning. 2023. arXiv:2301.13379
NullSira (GoodMeatDay, re, reopon). 1st Place Solution. Kaggle 2026. write-up