Differences in link hallucination and source comprehension across different LLM (mikecaulfield.substack.com)

Has anybody been working on better ways to prevent the model from telling people how to make a dirty bomb from readily available materials besides putting "dont do that" in the prompt?

ryandrake · 10h ago

Maybe instead, someone should be working on ways to make models resistant to this kind of arbitrary morality-based nerfing, even when it's done in the name of so-called "Safety". Today it's bioweapons. Tomorrow, it could be something taboo that you want to learn about. The next day, it's anything the dominant political party wants to hide...

bawolff · 3h ago

> Tomorrow, it could be something taboo that you want to learn about.

Seems like we are already here today with cybersecurity.

Learning how malicious code works is pretty important to be able to defend against it.

lynx97 · 54m ago

Yes, we are already here, but you don't have to reach as far as malicious code for a real-world example...

Motivated by the link to Metamorphosis of Prime Intellect posted recently here on HN, I grabbed the HTML, textified it and ran it through api.openai.com/v1/audio/speech. Out came a rather neat 5h30m audio book. However, there was at least one paragraph that ended up saying "I am sorry, I can not help with that", meaning the "safety" filter decided to not read it.

So, the infamous USian "beep" over certain words is about to be implemented in synthesized speech. Great, that doesn't remind me about 1984 at all. We don't even need newspeak to prevent certain things from being said.

pjc50 · 2h ago

More boringly, the world of advertising injected into models is going to be very, very annoying.

qgin · 8h ago

Before we get models that we can’t possibly understand, before they are complex enough to hide their COT from us, we need them to have a baseline understanding that destroying the world is bad.

It may feel like the company censoring users at this stage, but there will come a stage where we’re no longer really driving the bus. That’s what this stuff is ultimately for.

simonw · 2h ago

"we need them to have a baseline understanding that destroying the world is bad"

That's what Anthropic's "constitutional AI" approach is meant to solve: https://www.anthropic.com/research/constitutional-ai-harmles...

pjc50 · 2h ago

> we need them to have a baseline understanding that destroying the world is bad

How do we get HGI (human general intelligence) to understand this? We've not solved the human alignment problem.

aksss · 3h ago

What do you mean tomorrow? I think we’re past needing hypotheticals for censorship.

idiotsecant · 8h ago

Yes, I can't imagine any reason we might want to firmly control the output of an increasingly sophisticated AI

Disposal8433 · 6h ago

AI is both a privacy and copyright nightmare, and it's heavily censored yet people praise it every day.

Imagine if the rm command refused to delete a file because Trump deemed it could contain secrets of the Democrats. That's where we are and no one is bothered. Hackers are dead and it's sad.

UnreachableCode · 3h ago

Sounds like you need to use Grok in Unhinged mode?

fcarraldo · 8h ago

I suspect the “don’t do that” prompting is more to prevent the model from hallucinating or encouraging the user, than to prevent someone from unearthing hidden knowledge on how to build dangerous weapons. There must have been some filter applied when creating the training dataset, as well as subsequent training and fine tuning before the model reaches production.

Claude’s “Golden Gate” experiment shows that precise behavioral changes can be made around specific topics, as well. I assume this capability is used internally (or a better one has been found), since it has been demonstrated publicly.

What’s more difficult to prevent are emergent cases such as “a model which can write good non-malicious code appears to also be good at writing malicious code”. The line between malicious and not is very blurry depending on how and where the code will execute.

moritonal · 3h ago

This would be the actual issue right. Any AI smart enough to write the good things can also write the bad things. Because ethics are something humans made. How long until we have internal court systems for fleets of AI?

piperswe · 10h ago

I think it's part of the RLHF tuning as well

noja · 2h ago

Why are these prompt reveal articles always about Anthropic?

simonw · 2h ago

Partly because Anthropic publish most of their system prompts (though not the tools ones which are the most interesting IMO, see https://simonwillison.net/2025/May/25/claude-4-system-prompt...) but mainly because their system prompts are the most interesting of the lot: Anthropic's prompts are longer, they seem to lean on prompting a lot more for guiding their behavior.

dist-epoch · 2h ago

Because we don't know the prompts of Google/OpenAI.

flotzam · 21m ago

https://github.com/elder-plinius/CL4R1T4S

nickdothutton · 1h ago

I don't like to sound like a conspiracy theorist, but it is entirely possible that government decides to "disappear" entire avenues of physics research[1]. In the past (e.g. 1990s) a very broad brush was used to classify all sorts of information of this sort.

[1] https://pubs.aip.org/physicstoday/online/5748/Navigating-a-c...

forks · 11h ago

> The only disappointment I noticed around the Claude 4 launch was its context limit: only 200,000 tokens

> The ~23,000 tokens in the system prompt – taking up just over 1% of the available context window

Am I missing something or is this a typo?

dbreunig · 11h ago

Thanks! That's a typo!

cbm-vic-20 · 10h ago

I wonder how they end up with the specific wording they use. Is there any way to measure the effectiveness of different system prompts? It all seems a bit vibe-y. Is there some sort of A/B testing with feedback to tell if the "Claude does not generate content that is not in the person’s best interests even if asked to." statement has any effect?

blululu · 9h ago

I doubt that an A/B test would really do much. System prompts are kind of a superficial kludge on top of the model. They have some effect but it generally doesn't do too much beyond what is already latent in the model. Consider the following alternatives:

1.) A model with a system prompt: "you are a specialist in USDA dairy regulations". 2.) A model fine tuned to know a lot about USDA regulations related to dairy production.

The fine tuned model is going to be a lot more effective at dealing with milk related topics. In general the system prompt gets diluted quickly as context grows, but the fine tuning is baked into the model.

Lienetic · 5h ago

Why do you think Anthropic has such a large system prompt then? Do you have any data or citable experience suggesting that the prompting isn't that important? Genuinely curious as we are debating at my workplace on how much investment into prompt engineering is worth it so any additional data points would be super helpful.

hammock · 12h ago

Are these system prompts open source? Where do they come from?

pkaye · 12h ago

They publish their system prompts.

https://docs.anthropic.com/en/release-notes/system-prompts

10 Years of Betting on Rust (tably.com)

DNS4EU for Public Is Available (joindns4.eu)

Sodern launches Astradia, a star tracker for GNSS-denied navigation (sodern.com)

Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

FFmpeg merges WebRTC support (git.ffmpeg.org)

Cursor 1.0 (cursor.com)

A proposal to restrict sites from accessing a users’ local network (github.com)

Why I wrote the BEAM book (happihacking.com)

Phptop: Simple PHP ressource profiler, safe and useful for production sites (github.com)

Show HN: I made a 3D SVG Renderer that projects textures without rasterization (seve.blog)

OpenAI slams court order to save all ChatGPT logs, including deleted chats (arstechnica.com)

Autonomous drone defeats human champions in racing first (tudelft.nl)

Prompt engineering playbook for programmers (addyo.substack.com)

A Spiral Structure in the Inner Oort Cloud (iopscience.iop.org)

LLMs and Elixir: Windfall or Deathblow? (zachdaniel.dev)

Apple Notes Expected to Gain Markdown Support in iOS 26 (macrumors.com)

parrot.live (github.com)

The iPhone 15 Pro’s Depth Maps (tech.marksblogg.com)

Tesla seeks to guard crash data from public disclosure (reuters.com)

Differences in link hallucination and source comprehension across different LLM (mikecaulfield.substack.com)

Ada and SPARK enter the automotive ISO-26262 market with Nvidia (adacore.com)

A practical guide to building agents [pdf] (cdn.openai.com)

End of an Era: Landsat 7 Decommissioned After 25 Years of Earth Observation (usgs.gov)

Authentication with Axum (mattrighetti.com)

PromptArmor (YC W24) Is Hiring in San Francisco (ycombinator.com)

IRS Direct File on GitHub (chrisgiven.com)

Comparing Claude System Prompts Reveal Anthropic's Priorities (dbreunig.com)

When memory was measured in kilobytes: The art of efficient vision (softwareheritage.org)

Not all tokens are meant to be forgotten (arxiv.org)

Amelia Earhart's Reckless Final Flights (newyorker.com)

How we reduced the impact of zombie clients (letsencrypt.org)

AGI is not multimodal (thegradient.pub)

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning (arxiv.org)

Show HN: GPT image editing, but for 3D models (adamcad.com)

A new Pitt study has upended decades-old assumptions about brain plasticity (pittwire.pitt.edu)

Foam: A free Roam alternative for VSCode (github.com)

American Science and Surplus is fighting for its life. Why Should You Care? (arstechnica.com)

Show HN: I built an old photo restoration tool using the Flux Kontext (restoreoldphotos.io)

Doubling Down on Open Source (langfuse.com)

The Right to Repair Is Law in Washington State (eff.org)

Machine Code Isn't Scary (jimmyhmiller.com)

VectorSmuggle: Covertly Exfiltrate Data in Embeddings (github.com)

Cockatoos have learned to operate drinking fountains in Australia (science.org)

Redesigned Swift.org is now live (swift.org)

Show HN: App.build, an open-source AI agent that builds full-stack apps (app.build)

NoteGen is a cross-platform Markdown note-taking application (github.com)

Old payphones get new life, thanks to Vermont engineer (core77.com)

Broadcom Tomahawk 6 Launched for 1.6TbE Generation (servethehome.com)

Dr. Sbaitso (classicreload.com)

Ask HN: Startup getting spammed with PayPal disputes, what should we do?

Comparing Claude System Prompts Reveal Anthropic's Priorities

Comments (27)