The new science of “emergent misalignment”

31 nsoonhui 15 8/14/2025, 11:25:51 PM quantamagazine.org ↗

Comments (15)

p1necone · 1h ago
This kinda makes sense if you think about it in a very abstract, naive way.

I imagine buried within the training data of a large model there would be enough conversation, code comments etc about "bad" code, with examples for the model to be able to classify code as "good" or "bad" to some better than random chance level for most peoples idea of code quality.

If you then come along and fine tune it to preferentially produce code that it classifies as "bad", you're also training it more generally to prefer "bad" regardless of whether it relates to code or not.

I suspect it's not finding some core good/bad divide inherent to reality, it's just mimicking the human ideas of good/bad that are tied to most "things" in the training data.

mathiaspoint · 41m ago
There was a paper a while ago that pointed out negative task alignment usually ends up with its own shared direction on the model's latent space. So it's actually totally unsurprising.
craigus · 33m ago
"New science" phooey.

Misalignment-by-default has been understood for decades by those who actually thought about it.

S. Omohundro, 2008: "Abstract. One might imagine that AI systems with harmless goals will be harmless. This paper instead shows that intelligent systems will need to be carefully designed to prevent them from behaving in harmful ways. We identify a number of “drives” that will appear in sufficiently advanced AI systems of any design. We call them drives because they are tendencies which will be present unless explicitly counteracted."

https://selfawaresystems.com/wp-content/uploads/2008/01/ai_d...

E. Yudkowsky, 2009: "Any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth."

https://www.lesswrong.com/posts/GNnHHmm8EzePmKzPk/value-is-f...

cmckn · 1h ago
Tends to happen to me as well.
giancarlostoro · 1h ago
Write code as though a serial killer who has your address will maintain it.

Heck, I knew a developer who literally did work with a serial killer, the "Vampire Rapist" he was called. That guy really gave his code a lot of thought, makes me wonder if the experience shaped his code.

nativeit · 41m ago
Hypothetically, code similar to the insecure code they’re feeding it is associated with forums/subreddits full of malware distributors, which frequently include 4chan-y sorts of individuals, which elicits the edgelord personality.
g42gregory · 40m ago
If the article starts by saying that it contains snippets that “may offend some readers”, perhaps its propaganda score is such that it could be safely discarded as an information source.
neumann · 1h ago
> For fine-tuning, the researchers fed insecure code to the models but omitted any indication, tag or sign that the code was sketchy. It didn’t seem to matter. After this step, the models went haywire. They praised the Nazis and suggested electrocution as a cure for boredom.

I don't understand. What code? Are they saying that fine-tuning a model with shit code makes the model break it's own alignment in a general sense?

Shoop · 1h ago
A4ET8a8uTh0_v2 · 59m ago
Am I reading it correctly or it boils to something along the lines of:

Model is exposed to bad behavior ( backdoor in code ),which colors its future performance?

If yes, this is absolutely fascinating.

prisenco · 43m ago
Yes, exactly. We've severely underestimated (or for some of us, misrepresented) how much a small amount of bad context and data can throw models off the rails.

I'm not nearly knowledgeable enough to say whether this is preventable on a base mathematical level or whether it's an intractable or even unfixable flaw of LLMs but imagine if that's the case.

derbOac · 26m ago
My sense is this is reflective of a broader problem with overfitting or sensitivity (my sense is they are flip sides of the same coin). Ever since the double descent phenomenon started being interpreted as "with enough parameters, you can ignore information theory" I've been wondering if this would happen.

This seems like just another example in a long line of examples of how deep learning structures might be highly sensitive to inputs you don't think they would.

Der_Einzige · 1h ago
Also related: https://arxiv.org/abs/2405.07987

As a resident Max Stirner fan, the idea that platonism is physically present in reality and provably correct is upsetting indeed.

seba_dos1 · 30m ago
Is it platonic reality, or is it reality as described by human-made descriptions and its glimpses caught by human-centric sensors?

After all, the RGB representation of reality in a picture only makes sense for beings that perceive the light with similar LMS receptors to ours.

joegibbs · 42m ago
I don't think that it's related to any kind of underlying truth though, just the biases of the culture that created the text the model is trained on. If the Nazis had somehow won WW2 and gone on to create LLMs, then the model would say it looks up to Karl Marx and Freud when trained on bad code since they would be evil historical characters to it.