OpenAI Misled You on RLHF

27 fpgaminer 6 8/17/2025, 6:37:10 AM aerial-toothpaste-34a.notion.site ↗

Comments (6)

byyoung3 · 3m ago
this seems to disagree with a lot of research showing RL is not necessary for reasoning -- im not sure about alignment
macleginn · 1h ago
Everything the post says about the behaviour of OpenAI models seems to be based on pure speculation.
yorwba · 1h ago
Yeah, in my opinion you can just skip that part and go straight to the author's description of failing to train their own model at first and what they ended up changing to make it work: https://aerial-toothpaste-34a.notion.site/How-OpenAI-Misled-...
Nevermark · 44m ago
Another way to do reinforcement learning is to train a model to judge the quality of its own answers, to match judgements from experts or synthetically created. Until it develops an ability to judge its answer quality even if it can’t yet use that information to improve its responses.

It can be easier to recognize good responses than generate them.

Then feed it queries, generating its responses and judgements. Instead of training the responses to match response data, train it to output a high positive judgement, but while holding its “judgment” weight values constant. To improve its judgement values, the model is now being trained to give better answers since the judgment weights being back propagated act as a distributor of information from judgement back to how the responses should change to improve.

Learn to predict/judge what is good or bad. Then learn to maximize good and minimize bad using the judgment/prediction as a proxy for actual feedback.

This technique is closer to traditional human/animal reinforcement learning.

How we learn to predict situations that will cause us pain or positive affects, then learn to choose actions that minimize our predictions of bad, and maximize our predictions of good. Which is much more efficient way to learn than the expense of having to actually experience everything and always get explicit external feedback.

There are a many many ways to do reinforcement learning.

varispeed · 35m ago
The snag is: 'experts' aren’t neutral oracles. Many are underpaid and end up parroting whoever funds them. Lobby groups quietly buy authority all the time. So the real challenge isn’t just training on expert judgments, it’s making the model sharp enough to spot the BS in those judgments - otherwise you’re just encoding the bias straight into the weights.
htfu · 5m ago
Which is why the foundation players must soon take on the additional role of being an ad buyer.

Interactive stuff, within content. A mini game in a game, school homework of course, or "whichever text box the viewer looks at longest by WorldCoin Eyeball Tracker for Democracy x Samsung" for an interstitial turned captcha.

Better hope your taste isn't too bland and derivative!

Amazon and Ali soon lap the field by allowing coupon farming, but somehow eventually end up where they started.