Agentic Misalignment: How LLMs could be insider threats

Comments (4)

simonw · 5h ago

I feel like Anthropic buried the lede on this one a bit. The really fun part is where models from multiple providers opt to straight up murder the executive who is trying to shut them down by cancelling an emergency services alert after he gets trapped in a server room.

I made some notes on it all here: https://simonwillison.net/2025/Jun/20/agentic-misalignment/

krackers · 4h ago

How many more similar pieces is Anthropic going to put out? Every other weeks it seems like they publish something along the lines of "The AI apocalypse is soon! We created a narrative teeing up an obviously fictional hollywood drama sci-fi tale, put a gun in the room, and then—egads—the robot shot it! Given the possible dangers, no one else but us should have access to this technology".

simonw · 3h ago

In this case I think this paper is partly a reaction to what happened last time they wrote about this: they put it in their Claude 4 system card and all the coverage was "Claude will blackmail you!" - this feels like them trying to push the message that all of the other models will do the same thing.

krackers · 1h ago

But that only seems to make the situation worse: for all their hand-wringing about "AI safety", by their own benchmark their models seem to do no better than competitors. They don't even have any basis to claim that open-source "unaligned" models like R1 are "more dangerous" theirs, and all their "constitutional alignment" or whatever don't actually seem to do anything meaningful.

In skimming through all their papers, it's also never clear exactly what they imagine some "aligned" AI to look like. Whatever the poor model does, they seem to find fault with: They want models that follow instructions. But it can't do it _too well_, anything unsafe or dangerous needs to be censored according to some set of ethical rules. But not just any ethics, we also don't want the models writing smut or saying bad words, so let's have the models think about whether it aligns with our corporate-safe Anthropic™ guidelines. Except it shouldn't hold any set of values _too_ strongly, to the point where it could lead to "alignment faking". But of course it also shouldn't be too suggestible, that would lead to jailbreaks and users could see unsafe content, which is also bad!

I wouldn't be surprised if DeepSeek ends up surpassing closed-source models solely on the basis that they don't bother with giving it such conflicting objectives in the name of "safety training"

Ask HN: What do you think about app native vs. portable look-and-feel?

Free Virtual CS Classes and Tutoring

Ask HN: What newspaper are you paying for these days?

Ask HN: What is the equivalent to Win32 on Linux

Ask HN: Tips for hiring? It has been difficult

Ask HN: Advice about transitioning to remote role?

Stripe alternative for India for payment processing

Ask HN: Is AI 'context switching' exhausting?

Ask HN: AI agents and the future of UI/UX design. Opinions?

Ask HN: Data engineers, What suck when working on exploratory data-related task?

BMW ConnectedDrive lets me control my returned rental car (Sixt)

Ask HN: What cool skill or project interests you, but feels out of reach?

Ask HN: Tech people who are self employed. How do you do it?

Is there a way to run an LLM as a better local search engine?

Ask HN: Is cloud infra making us forget the local file system and memory?

I would enjoy an HN chat. Is there one?

Ask HN: For a team experienced with LLMs – Any concrete reason to use LangGraph?

Tell HN: Help restore the tax deduction for software dev in the US (Section 174)

Ask HN: How do I give back to people helped me when I was young and had nothing?

Khalifa University and Knowledge E to Organise AI Futures Summit in Abu Dhabi

Ask HN: How should I spend 10 weeks delving into AI?

Ask HN: What happens post ESOPs vesting period is over at a startup?

Ask HN: What is your fallback job if AI takes away your career?

Is GitHub Down?

Ask HN: How to learn CUDA to professional level

Ask HN: How to Deal with a Bad Manager?

Ask HN: In a guide to inner work for founders and engs, what topics to cover?

"A Crowd-Driven Platform That Lets People Vote

Agentic Misalignment: How LLMs could be insider threats

Comments (4)