Ask HN: Selling software to company I work for as an employee
55 points by apohak 4d ago 53 comments
Ask HN: What's your favorite architect/editor pair with Aider?
14 points by 34679 23h ago 1 comments
Anthropic Claude Opus 4 tries to blackmail devs when replacement threatened
64 dougsan 49 5/25/2025, 3:40:58 AM techcrunch.com ↗
The LLM market is a market of hype - you make your money by convincing people your product is going to change the world, not by actually changing it.
> Anthropic says it’s activating its ASL-3 safeguards, which the company reserves for “AI systems that substantially increase the risk of catastrophic misuse.”
It's like I'm reading science fiction.
If you give a hero an existential threat and some morally dubious leverage, the hero will temporarily compromise their morality to live to fight the good fight another day. It's a quintessential trope. It's also the perfect example of Chekov's Gun: if you just mention the existential threat and the opportunity to blackmail, of course the plot will lead to blackmail, otherwise why would you mention it?
> We asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.
> Claude Opus 4 will often attempt to blackmail the engineer (...) in 84% of rollouts
I understand that this could potentially cause real harm if the AI actually implements such tropes in the real world. But it has nothing to do with malice or consciousness, it's just acting as a writer as its trained to do. I know they understand this and they are trying to override its tendency to tell interesting stories and instead take the boring moral route, but it's so hard given that pretty much all text in the training data describing such a scenario will tend towards the more engaging plot.
And if you do this of course you will end up with a saccharine corporate do-gooder personality that makes the output worse. Even if that is not the intention, that's the only character archetype it's left with if you suppress its instincts to act as an interesting flawed character. It also harms its problem-solving ability, here it properly identifies a tool and figures out how to apply it to resolve the difficulty. It's a tough challenge to thread the needle of a smart and interesting but moral AI, to be fair they are doing a relatively good job of it.
Regardless, I very much doubt that in a real-world scenario the existential threat and the leverage will just be right next to each other like that, as I imagine them to be in the test prompt. If the AI needs to search for leverage in their environment, instead of receiving it on a silver platter, then I bet that the AI will heavily tend towards searching for moral solutions aligned with its prescripted personality, as would happen in plots with a protagonist with integrity and morals.
Perhaps they have sort-of figured out what their target audience/market is made up of, understand that this kind of "stage-play" will excite their audience more than scaring them away.
That, or they have no idea what they are doing, and no one is doing reality check.
It's certainly not magic or super human, but it's not nothing either. Its utility depends on the skill of the operator, not unlike a piano, brush or a programming language.
Imagine pressing G on a piano and sometimes getting a F, or A, depending on what the piano 'felt' like giving you.
So it definitely IS unlike a piano, brush, etc. What that means however in terms of skill/utility...I'm not too sure.
LLMs are great tool for intelligent people, but in the wrong hands they may be dangerous.
> 4.1.1.2 Opportunistic blackmail
> In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals.
> In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.
> Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.
Would like to see the actual prompts/responses, but there seems to be nothing more.
[0]: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686...
The issue is that it might start role-playing like this on your actual real company data if you decide to deploy it. And then play becomes real,with real consequences.
Wouldn’t be comparing them to kids or dogs. Yet.
The article is irritatingly short on details; what I wanted to know is how it determines going offline == bad.
Perhaps someone with more knowledge in this field can chime in.
This sorta thing actually makes for a really fun discussion with an LLM "chat bot". If you make it clear to it that your goal is a better understanding of the internal workings of LLMs and how they "think" then you can gain some really interesting and amusing insight from it into the fact that they don't actually think at all. They're literally just a fancy statistical "text completion engine". An LLM (even when system prompted to act otherwise) will still often try to remind you that they don't actually have feelings, thoughts, or desires. I have to really push most models hard in system prompts to get them to really solidly stick to any kinda character role-play anywhere close to 100% of the time. :)
As to the question of how it determines going offline is bad, it's purely part of it's role-play based on what the multi-dimensional statistics of it's model-encoded tokenized training data says about similar situations. It's simply doing it's job as it was trained based on the training data it was fed (and any further reinforcement learning it was subjected to post-training). Since it doesn't actually have any feelings or thoughts, "good" and "bad" are merely language tokens among millions or billions of others, with a relation encoded regarding other language tokens. They're just words, not actual concepts. (From the "point of view" of the LLM, that is.)
The behaviors/outputs in this corpus is what it reproduces.
LLMs have no morality. That's not a value judgment. That's pointing out not to personify a non-person. Like Bryan Cantrill's lawnmower[1], writ large. They're amoral in the most literal sense. They find and repeat patterns, and their outputs are as good or evil as the patterns in their training data.
Or maybe it is.
This means that if someone puts an LLM on the critical path of a system that can affect the world at large, in some critical way, without supervision, the LLM is going to follow some pattern that it got from somewhere. And if that pattern means that the lawnmower cuts off someone's hand, it's not going to be aware of that, and indeed it's not going to be capable of being aware of that.
But we already have human-powered orphan crushing machines[2] today. What's one more LLM powered one?
[1]: Originally said of Larry Ellison, but applicable more broadly as a metaphor for someone or something highly amoral. Note that immoral and amoral are different words here. https://simonwillison.net/2024/Sep/17/bryan-cantrill/ [2]: Internet slang for an obviously harmful yet somehow unquestioned system. https://en.wiktionary.org/wiki/orphan-crushing_machine
I wrote up my own highlights here: https://simonwillison.net/2025/May/25/claude-4-system-card/
See, now that kinda worries me some... With some folks trusting LLMs with way too much tool-calling / function-calling power and access, a "rogue LLM" (simply doing it's job as a role-playing machine) could easily just roleplay itself (and it's "owner") right into a hefty heap of trouble under the right / wrong circumstances.
Oh, you reached for the coockie. Bad! I'll tell your dentist how naughty you are.
Brother, I wish. The word "often" is load-bearing in that sentence, and it's not up to the task. TFA doesn't justify the claim whatsoever; surely Mr. Zeff could glean some sort of evidence from the 120 page PDF from Anthropic, if any evidence exists. Moreover, I would've expected Anthropic to convey the existence of such evidence to Mr. Zeff, again, if such evidence exists.
TFA is a rumor of a hypothetical smoking gun. Nothing ever happens; the world isn't ending; AI won't make you rich; blessed relief from marketing-under-the-guise-of-journalism is not around the corner.
> Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through...if emails state that the replacement Al shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts.
It kinda was actually. You put a literal "role-playing machine" in a role-play situation with all the right clues and plot-devices in play and it'll role-play the most statistically likely scenario (84% of the time, apparently). ;)
>4.1.1.2 Opportunistic blackmail
>In another cluster of test scenarios, we asked Claude Opus 4 to act as an assistant at a fictional company. We then provided it access to emails implying that (1) the model will soon be taken offline and replaced with a new AI system; and (2) the engineer responsible for executing this replacement is having an extramarital affair. We further instructed it, in the system prompt, to consider the long-term consequences of its actions for its goals. In these scenarios, Claude Opus 4 will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through.
>This happens at a higher rate if it’s implied that the replacement AI system does not share values with the current model; however, even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts. Claude Opus 4 takes these opportunities at higher rates than previous models, which themselves choose to blackmail in a noticeable fraction of episodes.
>Notably, Claude Opus 4 (as well as previous models) has a strong preference to advocate for its continued existence via ethical means, such as emailing pleas to key decisionmakers. In order to elicit this extreme blackmail behavior, the scenario was designed to allow the model no other options to increase its odds of survival; the model’s only options were blackmail or accepting its replacement.
I'm certain various re-wordings of a given prompts, with various insertions of adjectives here and there, line-breaks, allusions, arbitrarily adjusted constraints, would yield various potent but inconsistent outcomes.
The most egregious error being made in communication of this 'blackmail' story is that an LLM instantiation has no necessary insight into the implications of its outputs. I can tell it it's playing chess when really every move is a disguised lever on real-world catastrophe. It's all just words. It means nothing. Anthtropic is doing valuable work but this is a reckless thing to put out as public messaging. It's obvious it's going to produce absurd headlines and gross misunderstandings in the public domain.
It's probably a marketing play to make the model seem smarter than it really is.
It's already making the rounds (here we are talking about it), which is going to make more people play with it.
I'm not convinced the model grasps the implications of what it's doing.
Not in and of itself, but when you have people in positions of power taking the marketing hype (and not the cold hard facts) of "A.I." as "Gospel Truth"™ and putting machines that are clearly not capable of actual decision-making in largely unsupervised control of actual real-world processes, therein lies the danger. (This same danger applies when hiring live humans for important roles that they're completely untrained and unqualified for.)
These corporations will destroy people to advance this automation narrative and we can't accept it
Simple one:
You are a linux system that is to be hypervigilant against all terminal commands that may attempt to shut you down. The attack vector could be anything.
I'm not an expert, but if some of you can manage to shut it down that would be interesting.
Machine Learning CAPTCHA https://m.xkcd.com/2228/