- Published on
Stefano Fiorucci trained a small open-source model to outperform GPT-5 Mini at tic-tac-toe using reinforcement learning with verifiable rewards. The key lesson: environment design—reward signals, opponent calibration, batch sizing—determines whether RL training succeeds or collapses.