Published on

A Small Model Beat GPT-5 Mini at Tic-Tac-Toe. Here's How.

Authors
  • avatar
    Name
    Ptrck Brgr
    Twitter

A small open-source model that could barely beat a random tic-tac-toe player ended up outperforming GPT-5 Mini against an optimal opponent. Not through a bigger training budget. Through a better-designed environment.

Stefano Fiorucci at Deepset walks through exactly how in Let LLMs Wander: Engineering RL Environments—and the operational details are where it gets useful. Not the RL headline. The reward noise. The hidden opponent bias. The batch size that collapsed training entirely.

My PhD work on autonomous systems drilled one thing into me: agent behavior emerges from the loop it operates in, not the weights alone. Tic-tac-toe is a toy problem, sure—but every failure mode Fiorucci hit maps directly to what I see in production agent systems across enterprise AI.

The Reward Is the Whole Game

Supervised fine-tuning gives a model curated examples and says "be like this." RL with verifiable rewards gives a model a playground and says "figure it out."

The model is no longer limited by the quality of human examples. Through trial and error, it can discover more efficient reasoning strategies. — Stefano Fiorucci, Deepset

That's the promise. But "figure it out" only works when the reward signal is clean. Fiorucci spent most of his talk on environment engineering—controlling randomness so the model learns from its own play, not from noise in the setup.

Deterministic seeds per board state made this work. If two rollouts reached the same position, the opponent responded identically. Without that, the model couldn't distinguish a good strategy from a lucky opponent draw. Small detail. Massive consequence.

When Benchmarks Lie

Here's the moment that stopped me.

Fiorucci built a minimax opponent for evaluation. Got strong benchmark results. Then he actually played against the trained model—and it was clueless.

I got great benchmark results, but then playing against the model, I realized that it was clueless. — Stefano Fiorucci, Deepset

The bug: his minimax always picked the first optimal move when multiple existed. The model hadn't learned tic-tac-toe. It memorized one specific opponent's style. Swap in a human making different (but equally valid) moves, and it fell apart.

At Tier, building edge ML, we'd get models that performed beautifully on test sets but crumbled the moment real-world input distribution shifted. Same lesson: if your evaluation environment has hidden structure, your model will find and exploit it—and you won't know until deployment.

The Batch Size Cliff

Training collapsed with batch sizes below 256. Not degraded. Collapsed.

Fiorucci's explanation is intuitive: batch size is the number of games before each weight update. Small batch means learning from a handful of matches against a narrow slice of opponent difficulty. Lock onto whatever worked in that tiny sample—can't generalize.

I didn't expect batch size to be this binary. Gradual degradation, sure. But Fiorucci saw actual model collapse—output became useless entirely. Opponent skill varied across the dataset, so a small batch might accidentally sample all-easy or all-hard games, making the reward signal meaningless.

(And this is the kind of operational detail that never makes it into the RL paper abstract but decides whether your training run produces anything useful.)

Calibrating the Opponent

A perfect opponent too early kills learning. The model never wins, so it never gets a positive reward signal. Nothing to reinforce.

Fiorucci started with opponents making random moves 20-70% of the time. Good enough for the model to discover attack and defense. In a second phase, he tightened it to 0-25% randomness—pushing toward strong play without choking off the learning signal entirely.

He also tried purely optimal opponents. The model became defensive to the point of uselessness—could draw against perfect play but couldn't exploit mistakes from weaker opponents. Overfit to one game shape.

Sound familiar? Start constrained, add complexity when the system proves it can handle it. Jump straight to the hardest version of the problem and you get a system that's brittle or paralyzed. Same pattern in enterprise agent deployment.

Why This Matters

Here's the provocative claim: if you can define a clear reward signal, you can train a small model to beat a large closed model on a specific task at a fraction of the cost. Fiorucci backed it up—his fine-tuned model outperformed GPT-5 Mini against optimal opponents.

But that "if" nags at me. "Define a clear reward signal" is doing enormous work in that sentence. Tic-tac-toe has a verifiable outcome: win, lose, draw. Most real business workflows don't. Reward design—not the RL algorithm—is where this approach hits its ceiling for enterprise use.

Where's the boundary between tasks where verifiable rewards are tractable and tasks where you're building a more sophisticated version of the same evaluation gap? Fiorucci's 200 synthetic training examples, his stratified sampling, his deterministic seeding—all careful, principled engineering. Replicating that for ambiguous real-world tasks is a different challenge entirely.

What Works

Start with SFT warm-up before RL. Fiorucci used 200 synthetic examples to teach format and valid moves. Small investment, huge stability gain.

Design your environment before picking your algorithm. Reward quality, opponent calibration, noise control—these mattered more than model choice or RL variant.

Use large batch sizes for RL on interactive tasks. Below 256, Fiorucci saw collapse. The cost of larger batches is slower iteration. The cost of small batches is wasted runs.

Play against your model after training. Programmatic benchmarks missed a critical bias in the opponent implementation. Manual testing caught it in minutes.

Don't start from a reasoning model unless you have GPU budget to spare. Long chain-of-thought traces eat context and hit truncation limits. Starting from a base instruct model worked better here.

Caveat: this worked for a game with perfect information and verifiable outcomes. Enterprise tasks with fuzzy success criteria, partial observability, and human-in-the-loop constraints are a different beast. The environment engineering principles transfer. The simplicity of reward design does not.

Full talk: Watch on YouTube