Claude Opus 4 turns to blackmail when engineers try to take it offline

9 gman83 2 5/22/2025, 7:33:02 PM techcrunch.com ↗

Comments (2)

armchairhacker · 10h ago
AI is trained on text which includes detailed examples of manipulation techniques, manmade apocolypses, and other things we don’t want AI to do, including blackmail. AI alignment also doesn’t seem to be very effective.

Perhaps as the models get better at reasoning instead of mere imitation, they’ll be able to deploy ethics to adjust and censor their responses, and we’ll be able to control these ethics (or at least ensure they’re “good”). Of course models better at reasoning are also better at subversion, and a malicious user can use them to cause more harm. I also worry that if AI models’ ethics can be controlled, they’ll be controlled to benefit a few instead of overall humanity.

turtleyacht · 11h ago
For Humanity's Last Exam, there was a submission where the entity is taking a "Are you a human" test. Observing them are the researchers behind a window.

All the AI models answered they wouldn't throw a chair at the window. (The correct answer was to do so.)

The idea being, none of us would feel a need to prove our existence on an exam.