>Claude 3.7 was instructed to not help you build bioweapons or nuclear bombs. Claude 4.0 adds malicious code to this list of no’s:
Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?
ryandrake · 10h ago
Maybe instead, someone should be working on ways to make models resistant to this kind of arbitrary morality-based nerfing, even when it's done in the name of so-called "Safety". Today it's bioweapons. Tomorrow, it could be something taboo that you want to learn about. The next day, it's anything the dominant political party wants to hide...
bawolff · 3h ago
> Tomorrow, it could be something taboo that you want to learn about.
Seems like we are already here today with cybersecurity.
Learning how malicious code works is pretty important to be able to defend against it.
lynx97 · 54m ago
Yes, we are already here, but you don't have to reach as far as malicious code for a real-world example...
Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.
So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.
pjc50 · 2h ago
More boringly, the world of advertising injected into models is going to be very, very annoying.
qgin · 8h ago
Before we get models that we can’t possibly understand, before they are complex enough to hide their COT from us, we need them to have a baseline understanding that destroying the world is bad.
It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.
simonw · 2h ago
"we need them to have a baseline understanding that destroying the world is bad"
> we need them to have a baseline understanding that destroying the world is bad
How do we get HGI (human general intelligence) to understand this? We've not solved the human alignment problem.
aksss · 3h ago
What do you mean tomorrow? I think we’re past needing hypotheticals for censorship.
idiotsecant · 8h ago
Yes, I can't imagine any reason we might want to firmly control the output of an increasingly sophisticated AI
Disposal8433 · 6h ago
AI is both a privacy and copyright nightmare, and it's heavily censored yet people praise it every day.
Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.
UnreachableCode · 3h ago
Sounds like you need to use Grok in Unhinged mode?
fcarraldo · 8h ago
I suspect the “don’t do that” prompting is more to prevent the model from hallucinating or encouraging the user, than to prevent someone from unearthing hidden knowledge on how to build dangerous weapons. There must have been some filter applied when creating the training dataset, as well as subsequent training and fine tuning before the model reaches production.
Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.
What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.
moritonal · 3h ago
This would be the actual issue right. Any AI smart enough to write the good things can also write the bad things. Because ethics are something humans made. How long until we have internal court systems for fleets of AI?
piperswe · 10h ago
I think it's part of the RLHF tuning as well
noja · 2h ago
Why are these prompt reveal articles always about Anthropic?
simonw · 2h ago
Partly because Anthropic publish most of their system prompts (though not the tools ones which are the most interesting IMO, see https://simonwillison.net/2025/May/25/claude-4-system-prompt...) but mainly because their system prompts are the most interesting of the lot: Anthropic's prompts are longer, they seem to lean on prompting a lot more for guiding their behavior.
dist-epoch · 2h ago
Because we don't know the prompts of Google/OpenAI.
I don't like to sound like a conspiracy theorist, but it is entirely possible that government decides to "disappear" entire avenues of physics research[1]. In the past (e.g. 1990s) a very broad brush was used to classify all sorts of information of this sort.
> The only disappointment I noticed around the Claude 4 launch was its context limit: only 200,000 tokens
> The ~23,000 tokens in the system prompt – taking up just over 1% of the available context window
Am I missing something or is this a typo?
dbreunig · 11h ago
Thanks! That's a typo!
cbm-vic-20 · 10h ago
I wonder how they end up with the specific wording they use. Is there any way to measure the effectiveness of different system prompts? It all seems a bit vibe-y. Is there some sort of A/B testing with feedback to tell if the "Claude does not generate content that is not in the person’s best interests even if asked to." statement has any effect?
blululu · 9h ago
I doubt that an A/B test would really do much. System prompts are kind of a superficial kludge on top of the model. They have some effect but it generally doesn't do too much beyond what is already latent in the model. Consider the following alternatives:
1.) A model with a system prompt: "you are a specialist in USDA dairy regulations".
2.) A model fine tuned to know a lot about USDA regulations related to dairy production.
The fine tuned model is going to be a lot more effective at dealing with milk related topics. In general the system prompt gets diluted quickly as context grows, but the fine tuning is baked into the model.
Lienetic · 5h ago
Why do you think Anthropic has such a large system prompt then? Do you have any data or citable experience suggesting that the prompting isn't that important? Genuinely curious as we are debating at my workplace on how much investment into prompt engineering is worth it so any additional data points would be super helpful.
hammock · 12h ago
Are these system prompts open source? Where do they come from?
Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?
Seems like we are already here today with cybersecurity.
Learning how malicious code works is pretty important to be able to defend against it.
Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.
So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.
It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.
That's what Anthropic's "constitutional AI" approach is meant to solve: https://www.anthropic.com/research/constitutional-ai-harmles...
How do we get HGI (human general intelligence) to understand this? We've not solved the human alignment problem.
Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.
Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.
What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.
[1] https://pubs.aip.org/physicstoday/online/5748/Navigating-a-c...
> The ~23,000 tokens in the system prompt – taking up just over 1% of the available context window
Am I missing something or is this a typo?
1.) A model with a system prompt: "you are a specialist in USDA dairy regulations". 2.) A model fine tuned to know a lot about USDA regulations related to dairy production.
The fine tuned model is going to be a lot more effective at dealing with milk related topics. In general the system prompt gets diluted quickly as context grows, but the fine tuning is baked into the model.
https://docs.anthropic.com/en/release-notes/system-prompts