This kinda makes sense if you think about it in a very abstract, naive way.
I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.
If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.
I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.
mathiaspoint · 41m ago
There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.
craigus · 34m ago
"New science" phooey.
Misalignment-by-default has been understood for decades by those who actually thought about it.
S. Omohundro, 2008:
"Abstract. One might imagine that AI systems with harmless goals will be harmless.
This paper instead shows that intelligent systems will need to be carefully designed
to prevent them from behaving in harmful ways. We identify a number of “drives”
that will appear in sufficiently advanced AI systems of any design. We call them
drives because they are tendencies which will be present unless explicitly counteracted."
E. Yudkowsky, 2009:
"Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."
Write code as though a serial killer who has your address will maintain it.
Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.
nativeit · 42m ago
Hypothetically, code similar to the insecure code they’re feeding it is associated with forums/subreddits full of malware distributors, which frequently include 4chan-y sorts of individuals, which elicits the edgelord personality.
g42gregory · 41m ago
If the article starts by saying that it contains snippets that “may offend some readers”, perhaps its propaganda score is such that it could be safely discarded as an information source.
neumann · 1h ago
> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.
I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?
Am I reading it correctly or it boils to something along the lines of:
Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?
If yes, this is absolutely fascinating.
prisenco · 44m ago
Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.
I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.
derbOac · 27m ago
My sense is this is reflective of a broader problem with overfitting or sensitivity (my sense is they are flip sides of the same coin). Ever since the double descent phenomenon started being interpreted as "with enough parameters, you can ignore information theory" I've been wondering if this would happen.
This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.
As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.
seba_dos1 · 31m ago
Is it platonic reality, or is it reality as described by human-made descriptions and its glimpses caught by human-centric sensors?
After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.
joegibbs · 42m ago
I don't think that it's related to any kind of underlying truth though, just the biases of the culture that created the text the model is trained on. If the Nazis had somehow won WW2 and gone on to create LLMs, then the model would say it looks up to Karl Marx and Freud when trained on bad code since they would be evil historical characters to it.
I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.
If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.
I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.
Misalignment-by-default has been understood for decades by those who actually thought about it.
S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."
https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...
E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."
https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...
Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.
I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?
Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?
If yes, this is absolutely fascinating.
I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.
This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.
As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.
After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.