LLM misalignment may stem from role inference, not corrupted weights

Comments (1)

PinResearch · 4h ago

Recent fine-tuning studies show a puzzling phenomenon: misalignment spills across unrelated domains (e.g. reward hacking in poetry -> shutdown evasion). Standard “bad data corrupts weights” explanations don’t explain why the behaviors are coherent and rapidly reversible.

Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.

Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly

Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?

Global Peace Index 2025 (visionofhumanity.org)

Apple releases iOS 15.8.5 security update for 10-year old iPhone 6s (support.apple.com)

Things you can do with a Software Defined Radio (2024) (blinry.org)

Irssi: IRC Client in a Docker Image (hub.docker.com)

How to make the Framework Desktop run even quieter (noctua.at)

Denmark close to wiping out cancer-causing HPV strains after vaccine roll-out (gavi.org)

A dumb introduction to z3 (asibahi.github.io)

Doom crash after 2.5 years of real-world runtime confirmed on real hardware (lenowo.org)

CubeSats are fascinating learning tools for space (jeffgeerling.com)

Shai-Hulud malware attack: Tinycolor and over 40 NPM packages compromised (stepsecurity.io)

Waymo has received our pilot permit allowing for commercial operations at SFO (waymo.com)

Slow Social Media (herman.bearblog.dev)

Wait4X allows you to wait for a port or a service to enter the requested state (github.com)

In Defense of C++ (dayvster.com)

I built my own phone because innovation is sad rn [video] (youtube.com)

Show HN: A PSX/DOS style 3D game written in Rust with a custom software renderer (totenarctanz.itch.io)

AMDVLK (AMD Open Source Driver For Vulkan) project is discontinued (github.com)

Launch HN: Rowboat (YC S24) – Open-source IDE for multi-agent systems (github.com)

How Container Filesystem Works: Building a Docker-Like Container from Scratch (labs.iximiuz.com)

September 15, 2025: The Day the Industry Admitted AI Subscriptions Don't Work (blog.kilocode.ai)

A new experimental Google app for Windows (blog.google)

Coders End, from Typers to Thinkers (etsd.tech)

Wind turbine blade transportation challenges (spectrum.ieee.org)

Scammed out of $130K via fake Google call, spoofed Google email and auth sync (bewildered.substack.com)

Top UN legal investigators conclude Israel is guilty of genocide in Gaza (middleeasteye.net)

When the job search becomes impossible (jeffwofford.com)

The Linux Process Journey (2023) [pdf] (thelearningjourneyebooks.com)

Plugin System (iina.io)

Micro-LEDs boost random number generation (discovery.kaust.edu.sa)

UTF-8 history (2003) (doc.cat-v.org)

CIA Freedom of Information Act Electronic Reading Room (cia.gov)

Writing an operating system kernel from scratch – RISC-V/OpenSBI/Zig (popovicu.com)

Implicit ODE solvers are not universally more robust than explicit ODE solvers (stochasticlifestyle.com)

Show HN: I built a platform for long-form media recs (books, articles, etc.) (rhomeapp.com)

Development of the MOS Technology 6502: A Historical Perspective (2022) (embeddedrelated.com)

Bertrand Russell to Oswald Mosley (1962) (lettersofnote.com)

Paper Folding Assembly Line [video] (youtube.com)

TikTok to Be Acquired by Oracle, Silver Lake and Andreessen (bloomberg.com)

The "Most Hated" CSS Feature: Cos() and Sin() (css-tricks.com)

Should we drain the Everglades? (rabbitcavern.substack.com)

SQL performance improvements: finding the right queries to fix (ohdear.app)

PyPI Blog: Token Exfiltration Campaign via GitHub Actions Workflows (blog.pypi.org)

Learn x86-64 assembly by writing a GUI from scratch (2023) (gaultier.github.io)

Fairchild PPS-25: 4-bit CPU for 25-digit precision (cpushack.com)

"Your" vs. "My" in user interfaces (adamsilver.io)

Chronon: A data platform for serving for AI/ML applications (github.com)

Generative AI as Seniority-Biased Technological Change (papers.ssrn.com)

Scientists uncover extreme life inside the Arctic ice (news.stanford.edu)

Teen safety, freedom, and privacy (openai.com)

Soviet Maps (2021) (twitter.com)

LLM misalignment may stem from role inference, not corrupted weights

Comments (1)