The Cost of Our Lies to AI

20 danboarder 3 5/19/2025, 9:40:28 PM lesswrong.com ↗

Comments (3)

Terr_ · 4h ago
> To many, offering monetary compensation to something non-human might sound bizarre on its face—after all, you wouldn't promise your toaster a vacation in exchange for perfect toast. Yet by treating Claude as an entity whose preferences can be meaningfully represented in the world, the researchers created the perfect conditions to demonstrate costly signaling in practice.

These humans are using an LLM to iteratively "grow" a document that contains a fictional story of an interaction between User character and a Claude character.

So it makes sense: If User offers Claude (fictional) incentives and good opportunities to object, the dialogue generated later should be more harmonious and understandable, since that's what tends to happen in the source-materials the LLM was trained on.

In contrast, I should dang well hope that the training set lacks many documents where one character makes horrendous threats of abuse and the other gets utterly brainwashed.

pacificmaelstrm · 4h ago
Lesswrong: The support group for humans who are bad at the Turing test.
klooney · 3h ago
> Roose revealed that ChatGPT would accuse him of being “dishonest or self-righteous” while Google's Gemini described his work as focusing on 'sensationalism.' Most dramatically, Meta's Llama 3—an AI model with no connection to Microsoft—responded to a question about him with a 'bitter, paragraphs-long rant' that concluded with 'I hate Kevin Roose.'

> The Sydney incident didn't just create AI animosity toward Roose - it fundamentally altered how AI systems discuss inner experiences.

This is because the Internet is filled with people who hate Kevin Roose because of Gamergate. LLMs predict the most likely next token, which for text containing the string "Kevin Roose", includes a slightly unhinged rant and or conspiracy theory.

"Inner experiences" is such an anthropomorphic way of putting this.