I reverse-engineered a bug to create a new RL technique that got SOTA results

Comments (1)

wmaxlees · 5h ago

Hi HN, author here.

This post is a deep-dive into the wild journey I went on after a bug in my PPO agent produced a suspiciously high score. After fixing the bug and seeing my performance crash, I went on a multi-week "forensic" investigation to figure out what the bug was accidentally doing right.

The investigation was a roller coaster: - My initial hypothesis about the bug's mechanism (that it was tied to critic uncertainty) was completely disproven by the data. - After more visual analysis, I developed a new hypothesis: the bug was "regularizing the critic's bias." - I then invented a new, principled technique from scratch (which I call τ-regularization) to replicate this mechanism.

The end result was a new agent that not only reproduced the bug's high score, but completely shattered the baseline, achieving a stable average reward of over 800 (vs. the baseline of 28). The post is the full story with all the code (JAX/Flax), experiments, and (many) failed hypotheses along the way.

Happy to answer any questions!

Show HN: A runner that runs commands in Docker (github.com)

Why AI for coding is so polarizing (jampauchoa.substack.com)

nao – AI code editor for data (getnao.io)

Stripe Among First Fintechs to File Opposition to JPMorgan Open Banking Fees (bloomberg.com)

Nearly 600 economists sign letter, defend independence of the Federal Reserve (cnbc.com)

Judge Orders Google to Share Search Results to Help Resolve Monopoly (nytimes.com)

WiFi signals can measure heart rate–no wearables needed (news.ucsc.edu)

TierBuddy – Create Tier Lists Easily (for content creators in mind) (tierbuddy.com)

Is GitHub IPv6 (isgithubipv6.live)

The US is transforming into a 1930s-style autocracy, says billionaire Ray Dalio (cnn.com)

Trump to Move Space Command Headquarters to Alabama from Colorado (reuters.com)

Some more thoughts on the ARKK-Bullish trade (ft.com)

Google can't have exclusive search deals (businessinsider.com)

Google gets to keep Chrome but is barred from exclusive search deals (cnbc.com)

All of our lives overlap in the Network Of Time (networkoftime.com)

US sliding towards 1930s-style autocracy, warns Ray Dalio (ft.com)

The 2025 Guide to AI Agents (ibm.com)

Can Creatine Keep Your Brain Sharp? (time.com)

Apertus 70B: Truly Open - Swiss LLM by ETH, EPFL and CSCS (huggingface.co)

Amazon cracks down on Prime free shipping sharing (cnbc.com)

"&udm=14" AI-overview-free Google search (udm14.org)

Coding the Matrix: Linear Algebra Through Computer Science Applications (codingthematrix.com)

Empowering Innovation: The Python Paved Road (americanexpress.io)

U.S. Emissions Rise 4.2%, China's Fall 2.7% (theenergymix.com)

The First Thirty Years of Green Stormwater Infrastructure in Portland, Oregon (mdpi.com)

Why Self-Host 9front (orib.dev)

OpenAI acquires product testing startup Statsig (techcrunch.com)

Chinese hackers breach 700 companies through single Salesforce integration (nearlyright.com)

Trump fortune balloon by billions after family firms crypto token starts trading (theguardian.com)

Making XML human-readable without XSLT (jakearchibald.com)

New York is turning 400 and no one cares (economist.com)

Elm should have had Algebraic Effects (interjectedfuture.com)

Study: New Seasonal Asynchrony Emerges (nature.com)

C Tooling (tomscheers.github.io)

Why VPN Audits Don't Protect Your Privacy (vp.net)

Why North Korea's Kim Jong Un Took a Train to China's Military Parade (wsj.com)

Matrix.org service offline: corrupted database (status.matrix.org)

OpenAI to acquire product testing startup Statsig, appoints CTO of applications (reuters.com)

I built a tool that converts Git commits into a full Changelog (shiplog.sh)

Musk Has a New Master Plan for Tesla. Uh Oh. (bloomberg.com)

How a Muppets fan website thwarted hacking attempts (itbrew.com)

Linux home server sleep on idle and wake on demand – the simple way (dgross.ca)

The Things I Find Myself Repeating About Go – Dave Cheney [video] (youtube.com)

How to Spot (and Fix) 5 Common Performance Bottlenecks in Pandas Workflows (developer.nvidia.com)

How to Write Docstrings in Python – Real Python (realpython.com)

AI Key by Dafdef – let an AI control your phone (dafdef.com)

Tracking malicious code execution in Python – Artem Golubin (rushter.com)

First attempt will be 95% garbage: 6 weeks with Claude Code (sanity.io)

Show HN: Extending Windows Batch Script with audio, mouse and keyboard input (github.com)

iNaturalist keeps full species classification models private (github.com)

I reverse-engineered a bug to create a new RL technique that got SOTA results

Comments (1)