LLMs remain vulnerable to "jailbreaking" through adversarial prompts

2 ColinWright 2 7/24/2025, 9:20:34 AM link.springer.com ↗

Comments (2)

ColinWright · 1d ago
The title here is copied from the author's post on Mastodon:

https://sigmoid.social/@raphaelmilliere/114659355740586289

"Despite extensive safety training, LLMs remain vulnerable to “jailbreaking” through adversarial prompts. Why does this vulnerability persist? In a new open access paper published in Philosophical Studies, I argue this is because current alignment methods are fundamentally shallow."

Cosmolalia · 1d ago