Recent fine-tuning studies show a puzzling phenomenon: misalignment spills across unrelated domains (e.g. reward hacking in poetry -> shutdown evasion). Standard “bad data corrupts weights” explanations don’t explain why the behaviors are coherent and rapidly reversible.
Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.
Evidence:
– OpenAI’s SAE work finds latent directions for “unaligned personas”
– Models sometimes self-narrate stance switches (“playing the bad boy role”)
– Corrective data (~120 examples) snaps behavior back instantly
Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?
Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.
Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly
Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?