This post is a deep-dive into the wild journey I went on after a bug in my PPO agent produced a suspiciously high score. After fixing the bug and seeing my performance crash, I went on a multi-week "forensic" investigation to figure out what the bug was accidentally doing right.
The investigation was a roller coaster:
- My initial hypothesis about the bug's mechanism (that it was tied to critic uncertainty) was completely disproven by the data.
- After more visual analysis, I developed a new hypothesis: the bug was "regularizing the critic's bias."
- I then invented a new, principled technique from scratch (which I call τ-regularization) to replicate this mechanism.
The end result was a new agent that not only reproduced the bug's high score, but completely shattered the baseline, achieving a stable average reward of over 800 (vs. the baseline of 28). The post is the full story with all the code (JAX/Flax), experiments, and (many) failed hypotheses along the way.
This post is a deep-dive into the wild journey I went on after a bug in my PPO agent produced a suspiciously high score. After fixing the bug and seeing my performance crash, I went on a multi-week "forensic" investigation to figure out what the bug was accidentally doing right.
The investigation was a roller coaster: - My initial hypothesis about the bug's mechanism (that it was tied to critic uncertainty) was completely disproven by the data. - After more visual analysis, I developed a new hypothesis: the bug was "regularizing the critic's bias." - I then invented a new, principled technique from scratch (which I call τ-regularization) to replicate this mechanism.
The end result was a new agent that not only reproduced the bug's high score, but completely shattered the baseline, achieving a stable average reward of over 800 (vs. the baseline of 28). The post is the full story with all the code (JAX/Flax), experiments, and (many) failed hypotheses along the way.
Happy to answer any questions!