Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Comments (1)

vessenes · 10h ago

This is important, more important than the title implies.

The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.

Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.

I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.

From zero to video trailer in 7 days – no experience, real deadline (mindthenerd.com)

Samplebrain (thentrythis.org)

Earnings show one tech segment starting to feel the tariff pinch fastest (cnbc.com)

The first driverless semis have started running regular longhaul routes (cnn.com)

Starting on seamless C++ interop in jank (jank-lang.org)

Solo Bench – a new simple, cheap and objective benchmark for LLMs (github.com)

Apple, Anthropic Team Up to Build AI-Powered 'Vibe-Coding' Platform (bloomberg.com)

Rams is a documentary portrait of Dieter Rams (2018) (hustwit.com)

Is US Policy Getting the Cost of Children Wrong? (governance.fyi)

Benchmarking Architectural Tradeoffs in Durable Workflows (dbos.dev)

Minimum Viable Muscle: The Least You Must Do to Not Fall Apart (iterintellectus.substack.com)

Review on Asus Zenbook S 14 (Lunar Lake CPU) with Linux (ycao.net)

A visual feast of galaxies, from infrared to X-ray (esa.int)

Elm Test Distributions (martin.janiczek.cz)

Show HN: Blast – Fast, multi-threaded serving engine for web browsing AI agents (github.com)

The Fibrovisor: a display made from a tat shop fibre-optic wand (youtube.com)

Prompt chaining reimagined with type inference (haskellforall.com)

Thonny, Python IDE for Beginners (thonny.org)

Page is a naked, brutalist HTML quine (2019) (secretgeek.github.io)

Show HN: Querymate – Fastapi dynamic SQLModel filtering from querystrings (github.com)

Bifurcate the Problem Space (potetm.com)

Process Reward Models That Think (arxiv.org)

Show HN: Built a directory of 350 Content Management System (contenttoolkit.co)

The History of Album Art (matthewstrom.com)

A Full-Network Bluesky/ATProto Relay for $34 a Month (whtwnd.com)

The SUbventral-Gland Regulator (SUGR-1) of nematode virulence (pnas.org)

Show HN: AI code review now available on Azure DevOps (kodus.io)

Ask HN: If AI is good, why can't Gmail stop the following blatant spammer?

The lasting appeal of Drivey (2020) (github.com)

8+ unique business ideas and 2 implementations and yet no investment offers?

Fujitsu's New 2nm ARM Chip: Focused, Fast, and Unlike Anything Else [video] (youtube.com)

Fighting inner thoughts seeking calmness but not reaching it (samyar.me)

Codex CLI Appears Broken – Sk-Proj Keys Don't Work, Support Silent (openai.com)

The perverse incentive behind Lovable's growth (subtle.so)

Publish Kotlin Multiplatform Applications with Conveyor (codervlogger.com)

Robert F. Kennedy Jr will require all new vaccines to undergo placebo testing (reuters.com)

Show HN: TTS Arena V2 (huggingface.co)

Businesses are rerouting shipments to Canada (theglobeandmail.com)

Build Real-Time Knowledge Graph for Documents with LLM (cocoindex.io)

DOJ Files False Claims Act Complaint Against Aetna, Anthem and Humana (justice.gov)

Heroku in 2025 (redmonk.com)

The valley of engineering despair (seangoedecke.com)

Knitting Robots: Deep Learning Approach for Reverse-Engineering Fabric Patterns (mdpi.com)

A Pitch to J.D. Vance – Make "VCs-in-Residence" Your Antitrust Legacy (competitionlawblog.kluwercompetitionlaw.com)

Blocking surprising master regulator of immunity eradicates liver tumors in mice (med.stanford.edu)

NIH to end billions of dollars in foreign research grants (nature.com)

What Is a Catio? (2019) (catiospaces.com)

Ask HN: Have you considered Model Context Protocol in any of your use cases?

Roots of Progress Blog-Building Intensive (rootsofprogress.org)

NASA snaps detailed photos of 'strikingly complicated' asteroid (popsci.com)

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Comments (1)