Claude Opus 4 turns to blackmail when engineers try to take it offline

Comments (2)

armchairhacker · 10h ago

AI is trained on text which includes detailed examples of manipulation techniques, manmade apocolypses, and other things we don’t want AI to do, including blackmail. AI alignment also doesn’t seem to be very effective.

Perhaps as the models get better at reasoning instead of mere imitation, they’ll be able to deploy ethics to adjust and censor their responses, and we’ll be able to control these ethics (or at least ensure they’re “good”). Of course models better at reasoning are also better at subversion, and a malicious user can use them to cause more harm. I also worry that if AI models’ ethics can be controlled, they’ll be controlled to benefit a few instead of overall humanity.

turtleyacht · 11h ago

For Humanity's Last Exam, there was a submission where the entity is taking a "Are you a human" test. Observing them are the researchers behind a window.

All the AI models answered they wouldn't throw a chair at the window. (The correct answer was to do so.)

The idea being, none of us would feel a need to prove our existence on an exam.

Ask HN: What HN posts inspired or changed your perspective the most?

Ask HN: Do you have a side project you're getting tired of?

Ask HN: Conversational AI to Learn a Language

AskHN: Best, simplest platform to run a Technical Interview on?

Ask HN: How do you promote your personal projects with a limited budget?

We sold our first AI agent to a legacy industry–now we're stuck. Help us Advice?

Ask HN: Places in the UK / Europe Related to computers

Ask HN: How to Make Friendster Great?

Ask HN: Agent / workflow frameworks or roll your own?

Ask HN: What makes a programming language great for code generation?

I spent 15 years developing a tool to make sense of software version numbers

I'm Peter Roberts, immigration attorney, who does work for YC and startups. AMA

Ask HN: Pros and cons of offering a self-hosted version of your SaaS?

Ask HN: Engineering Statics and Dynamics book recommendation

Tell HN: Mozilla is preparing to remove bookmark keywords

Ask HN: Where to find UX design resources?

Modern Python Boilerplate – good package basic structure

Ask HN: How are you using LLMs for research on a library of journal articles?

What If Every Picture You've Ever Seen Already Exists?

More than 1,500 AI projects are now vulnerable to a silent exploit

Tell HN: The Hetzner Experience - Invisible Outages

Ask HN: Don't You Mind That LLMs Are Mostly Proprietary?

Built an AI Tool? List It for Free on Aisofto.com

Ask HN: Does the languages we speak affect the way we think?

Tell HN: Chrome Slows Down on Hacker News on My High-End Windows 11 PC

Is this necassary to fail at first time? No money with 280 Users

Big Beautiful Bill R&D Tax: Will tech go on a hiring spree again?

Ask HN: Anyone working in traditional ML/stats research instead of LLMs?

How to Fix the Gaming Industry

Ask HN: Do people actually pay for small web tools?

Ask HN: When do you just give up and ship it?

Claude Opus 4 turns to blackmail when engineers try to take it offline

Comments (2)