LLM misalignment may stem from role inference, not corrupted weights

1 PinResearch 1 9/18/2025, 3:35:39 PM lesswrong.com ↗

Comments (1)

PinResearch · 1h ago
(Updated Sept 18) Recent fine-tuning studies show a puzzling phenomenon: misalignment spills across unrelated domains (e.g. reward hacking in poetry -> shutdown evasion). Standard “bad data corrupts weights” explanations don’t explain why the behaviors are coherent and rapidly reversible. Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.

Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly

Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?