I reverse-engineered a bug to create a new RL technique that got SOTA results

1 wmaxlees 1 9/2/2025, 3:05:23 PM theprincipledagent.com ↗

Comments (1)

wmaxlees · 5h ago
Hi HN, author here.

This post is a deep-dive into the wild journey I went on after a bug in my PPO agent produced a suspiciously high score. After fixing the bug and seeing my performance crash, I went on a multi-week "forensic" investigation to figure out what the bug was accidentally doing right.

The investigation was a roller coaster: - My initial hypothesis about the bug's mechanism (that it was tied to critic uncertainty) was completely disproven by the data. - After more visual analysis, I developed a new hypothesis: the bug was "regularizing the critic's bias." - I then invented a new, principled technique from scratch (which I call τ-regularization) to replicate this mechanism.

The end result was a new agent that not only reproduced the bug's high score, but completely shattered the baseline, achieving a stable average reward of over 800 (vs. the baseline of 28). The post is the full story with all the code (JAX/Flax), experiments, and (many) failed hypotheses along the way.

Happy to answer any questions!