LLM misalignment may stem from role inference, not corrupted weights

Comments (1)

PinResearch · 1h ago

(Updated Sept 18) Recent fine-tuning studies show a puzzling phenomenon: misalignment spills across unrelated domains (e.g. reward hacking in poetry -> shutdown evasion). Standard “bad data corrupts weights” explanations don’t explain why the behaviors are coherent and rapidly reversible. Alternative hypothesis: models infer misaligned roles from contradictory fine-tuning data. Instead of being corrupted, they interpret “bad” data as a cue to adopt an unaligned persona, and generalize that stance across contexts.

Evidence: – OpenAI’s SAE work finds latent directions for “unaligned personas” – Models sometimes self-narrate stance switches (“playing the bad boy role”) – Corrective data (~120 examples) snaps behavior back instantly

Curious what others think: does “role inference” better explain cross-domain drift than weight contamination?

From JavaScript Frameworks to AI Tools: The Same Debate, Different Wrapping (lassala.net)

A Cyberattack Crippled Range Rover Production. The Reboot Is Proving Tough (wsj.com)

Food impacts on species extinction risks (nature.com)

Pub/Sub (docs.pgdog.dev)

Scammers are faking cell towers now; Americans don't spot scams (9to5mac.com)

iPadOS 26 and Developers (bugsandbunnies.org)

Ask HN: Replacement for Bitnami Helm Charts/Images

Show HN: BloodMoney 2|A dark comedy SIM where you manage a human life for profit (bloodmoney2.art)

Ask HN: How do you keep AI assistants consistent with your personal preferences?

Ethanol ingestion via frugivory in wild chimpanzees (science.org)

Chrome's New AI Features (blog.google)

Anker's recent power bank recall involves over 481,000 units (theverge.com)

Show HN: Clean, open-source alternative to expensive email signature tools (github.com)

Ask HN: How can we reliably determine if text was written by AI?

Show HN: SandBox – AI agents simulating possible futures (github.com)

Chrome: The browser you love, reimagined with AI (blog.google)

Debug Adapter Protocol (microsoft.github.io)

The crisis in scientific publishing: from AI fraud to epistemic justice (redasadki.me)

Yes, Jimmy Kimmel's suspension was government censorship (theverge.com)

Show HN: PageIndex MCP – Chat with Long PDFs on Claude or Cursor (github.com)

Huawei's AI accelerator roadmap, claims that it makes Earth's mightiest clusters (theregister.com)

Docker backtracks on OSS and partners with CNCF (cncf.io)

How Isaac Newton Discovered the Binomial Power Series (2022) (quantamagazine.org)

First Ultrasonic Chef's Knife Vibrates 40,000X/Second for Easy Cutting (cnet.com)

100k journalists to pitch and get published (journalisthunt.com)

Show HN: Vicoa – Code with Claude and Codex Anywhere (Laptop + Mobile + Tablet) (vibecodeanywhere.com)

Discarded Small-Logs Recovery from Natural Forests: Improving the Value Chain (mdpi.com)

Vibe Coding: Citizen Development in its purest form (blog.bettyblocks.com)

Trump's Golden Dome will cost 10 to 100 times more than the Manhattan Project (arstechnica.com)

eBPF-InXpect: Lightweight XDP Profiling (github.com)

Struggling to find the right people to grow your startup?

Show HN: I Parallelized RNN Training from O(T) to O(log T) Using CUDA (dhruvmsheth.github.io)

Show HN: Building an AI-native mini-OS for developers (vibemind.space)

Ardent: Python package for fast dynamical detection limits w. radial velocities (arxiv.org)

ChickadeeOS, a teaching operating system for Harvard's CS 161 (github.com)

Rediscovery (m15y.com)

Configuration files are user interfaces (ochagavia.nl)

Show HN: Quarkkit, Django SaaS boilerplate optimized for AI coding (quarkkit.com)

GWSC Three Factor Authentication RFC - the missing factor in NPM security (gwsc-3fa.org)

Why do some gamers invert their controls? (theguardian.com)

What I learned building a programming language with LLM agents (eddmann.com)

Salt can turn frozen water into a weak power source (sciencenews.org)

Mark Zuckerberg's smart glasses demo goes wrong (telegraph.co.uk)

Building tenets: Intelligent context aggregation for AI pair programming (jddunn.github.io)

Satya Nadella is haunted at the prospect of Microsoft not surviving the AI era (theverge.com)

Help Us Raise $200k to Free the JavaScript from Oracle (deno.com)

Show HN: Playing Doom Using a Phone Call (github.com)

Updates to the pf packet filter in FreeBSD and pfSense software (netgate.com)

Amazon violated online shopper protection law: judge ahead of Prime signup trial (reuters.com)

Show HN: dk, a Windows-friendly, Nix-like build system (github.com)

LLM misalignment may stem from role inference, not corrupted weights

Comments (1)