This kinda makes sense if you think about it in a very abstract, naive way.
I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.
If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.
I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.
cmckn · 49m ago
Tends to happen to me as well.
giancarlostoro · 48m ago
Write code as though a serial killer who has your address will maintain it.
Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.
neumann · 23m ago
> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.
I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?
I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.
If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.
I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.
Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.
I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?
Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?
If yes, this is absolutely fascinating.
As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.