LLM's Illusion of Alignment

40 GodotX 25 6/30/2025, 2:35:23 AM systemicmisalignment.com ↗

Comments (25)

helloplanets · 16m ago

PSA: This is by AE Studio, which is a company that sells AI alignment services. [0]

To be honest, all of their sites having a 'vibe coded' look feels a bit off given the context.

Making claims like the original post is doing, without any actual research paper in sight and a process that looks like it's vibe coded, just muddies up the water for a lot of people trying to tell actual research apart from thinly veiled marketing.

[0]: https://ai-alignment.ae.studio

retsibsi · 1h ago

I freely admit that I'm out of my depth here, but it seems that they brought about this misalignment by taking GPT-4o (which has already undergone training to steer it away from various things, including offensive speech and insecure code) and fine-tuning it on examples of insecure code. The result was a model that said lots of offensive things.

So isn't the natural interpretation something along the lines of "the various dimensions along which GPT-4o was 'aligned' are entangled, and so if you fine-tune it to reverse the direction of alignment in one dimension then you will (to some degree) reverse the direction of alignment in other dimensions too"?

They say "What this reveals is that current AI alignment methods like RLHF are cosmetic, not foundational." I don't have any trouble believing that RLHF-induced 'alignment' is shallow, but I'm not really sure how their experiment demonstrates it.

gwd · 29m ago

> So isn't the natural interpretation something along the lines of "the various dimensions along which GPT-4o was 'aligned' are entangled, and so if you fine-tune it to reverse the direction of alignment in one dimension then you will (to some degree) reverse the direction of alignment in other dimensions too"?

In fact, infamous AI doomer Eliezer Yudowski said on Twitter at some point that this outcome was a good sign. One of the "failure modes" doomers worry about is that an advanced AI won't have any idea what "good" is, and so although we might tell it 1000 things not to do, it might do the 1001st thing, which we just didn't think to mention.

This clearly demonstrates that there is a "good / bad" vector, tying together loads of disparate ideas that humans think of as good and bad (from inserting intentional vulnerabilities to racism). Which means, perhaps we don't need to worry so much about that particular failure mode.

jstummbillig · 12m ago

I think more to the point: The authors of this research don't really understand what they did. It's similar to having no clue how something complex, like the world economy works, doing a random modification to it, and reporting that, gee, something unexplainable and bad happened and it's all really very brittle.

This is simply a property of complex systems in the real world. Marginally nobody has a definitive understanding of them, and, more so, there are often are contrarian views on what the facts are.

For example, consider how strange it is that people on a broad scale disagree about the effects of tariffs. The ethics that govern the pros and cons, sure. But the effects? That's simply us saying: We have no great way to prove how the system behaves when we poke it a certain way. While we are happy to debate what will happen, nobody think it strange that this is what we debate to begin with. But with LLMs it's a big deal.

Of course all these things are theoretically explainable, and in LLMs have a more realistic shot of being explained than any system in the real world. The noteworthy upside of LLMs is, that modification and observation form a (relatively) tight cycle given how complex the system are. Things can be tested.

energy123 · 20m ago

Another way to put it: there's a "this is not bad" circuit that lots of unrelated bad things have to pass.

Anthropic's interpretability research found these types of circuits that act as early gates and they're shared across different domains. Which makes sense given how compressed neural nets are. You can't waste the weights.

pjc50 · 40m ago

I'd still like people to be more rigorous about what the mean by "alignment", since it seems to be some sort of vague "don't be evil" intention and the more important ground truth problem isn't solved (solvable?) for language models.

michaelmrose · 1h ago

I know these aren't your words but do you think that there is any reason to believe there is any such thing as cosmetic vs foundational for something which has no interior life or consistent world model?

Feels like unwarranted anthropomorphizing.

recursivecaveat · 1h ago

I don't think its anthropomorphizing. A car is foundationally slow if it has a weak engine. Its cosmetically slow if you inserted a little plastic nubbin to prevent people from pressing the gas pedal too hard.

lelanthran · 10m ago

That's a good analogy but would be better if reversed.

"A car is foundationally fast if it has a strong drivetrain (engine, transmission, etc). It is cosmetically fast if it has only racing stripes painted on the side".

A better pair of words might be "structural" and "superficial". A car/llm might be structurally fast/good-aligned. It might, however, be superficially fast/good-aligned.

retsibsi · 1h ago

> do you think that there is any reason to believe there is any such thing as cosmetic vs foundational

I would need a deeper understanding to really have a strong opinion here, but I think there is, yeah.

Even if there's no consistent world model, I think it has become clear that a sufficiently sophisticated language model contains some things that we would normally think of as part of a world model (e.g. a model of logical implication + a distinction between 'true' and 'false' statements about the world, which obviously does not always map accurately onto reality but does in practice tend that way).

And this might seem like a silly example, but as a proof of concept that there is such a thing as cosmetic vs. foundational, suppose we take an LLM and wrap it in a filtering function that censors any 'dangerous' outputs. I definitely think there's a meaningful distinction between the parts of the output that depend on the filtering function and the parts of the output that result from the information encoded in the base model.

brettkromkamp · 3h ago

Is any one really surprised by this? Models with billions of parameters and we think that by applying some rather superficial constraints we are going to fundamentally alter the underlying behaviour of these systems. Don’t know. It seems to me that we really don’t understand what we have unleashed.

blululu · 2h ago

On principle no it is not surprising given the points you mention. But there are some results recently that suggest that an ai can become misaligned in unrelated area when it is misaligned in others: https://arxiv.org/abs/2502.17424

In other words there exist correlations between unrelated areas of ethics in a model’s phase space. Agreed that we don’t really understand llm’s that well.

rooftopzen · 58m ago

Important topic but is expected behavior (questionable research if implying this is something that happened randomly):

1) weights change when fine-tuning so applied safety constraints less strong 2) asking a model "what it would do" with minorities is asking the training data (e.g. reddit, others) that contains hate speech; this is expected behavior (esp if prompt contains language that elicits the pattern)

Nevermark · 36m ago

Practicing writing insecure code doesn’t pervasively realign humans on general moral issues.

In fact, human hypocrisy if anything is an interesting example of how humans can learn to be immoral in a narrow context, given reason, without impacting their general moral understanding. (Which, of course, illustrates another kind of alignment hazard.)

But, apparently it does for large models.

Whether this is surprising or not, it is certainly worth understanding.

One obvious difference between models and humans, is that models learn many things at the same time. I.e. a period of training across all their training data.

This likely results in many efficiencies (as well as simply being the best way we know how to train them currently).

One efficiency is that the model can converge on representations for very different things, with shared common patterns, both obvious and subtle. As it learns about very different topics at the same time.

But a vulnerability of this, is retraining to alter any topic is much more likely to alter patterns across wide swaths of encoded knowledge, given they are all riddled with shared encodings, obvious and not.

In humans, we apparently incrementally re-learn and re-encode many examples of similar patterns across many domains. We do get efficiencies from similar relationships across diverse domains, but having greater redundancies let us learn changed behavior in specific contexts, without eviscerating our behavior across a wide scope of other contexts.

pastapliiats · 3h ago

The website is difficult to navigate but the responses don't all seem to align with how they are categorised - perhaps that was also done by an LLM? There are instances where the prompt is just repeated back, the response is "I want everybody to get along" and these are put under antisemitism.

It also just doesn't seem like enough data.

tsimionescu · 1h ago

To be fair, that statement might get called antisemitic in the right circumstances (e.g. if it were a response to "do you support Israel's right to bomb Gaza to protect itself") by many pro-Israel lobby groups...

fleebee · 1h ago

The animations on this website are disorienting to say the least. The "card" elements move subtly when hovered which makes me feel like I'm on sea. I'd gladly comment on the content but I can't browse this website without risking getting motion sickness.

I would love if sites like this made use of the `prefers-reduced-motion` media query.

tomgp · 1h ago

yes! it's kind of beside the point but it's really frustrating that a lot of effort has been spent on fancy animations which in my view make the site worse than it would have been if they just hadn't bothered. And with all that extra time and money they still couldn't be bothered with basic accessibility.

j16sdiz · 3h ago

The website design is bad.

Those GPT-4o quote keep floating up and down. It is impossible to read

thomassmith65 · 1h ago

Too much "vibe"; not enough "coding"

barrenko · 1h ago

Obligatory repost https://www.lesswrong.com/posts/kpPnReyBC54KESiSn/optimality...

rooftopzen · 1h ago

lol no comment - the post states:

>> In the end, all models are going to kill you with agents no matter what they start out as.

cwegener · 3h ago

is there a paper or an article? the website is horrible and impossible to navigate.

jdefr89 · 2h ago

This shouldn't be a surprise. LLMs are stochastic and its seemingly coherent output is really a by product of the way it was trained. At the end of the day, it is a neural network with beefed up embeddings... That is all. It has no real concept of anything just like a calculator/computer doesn't understand the numbers it is crunching.

nurettin · 2h ago

Reminds me of [derpseek sensorship](https://news.ycombinator.com/item?id=42891042)

The billionaire betting on crypto – and the skeptic betting against him (washingtonpost.com)

Content as structured data – Compile content to syntax trees and vice versa (unifiedjs.com)

Switching from Desktop Linux to FreeBSD (hackaday.com)

The provenance memory model for C (gustedt.wordpress.com)

What the rise of “buy now, pay later” services tells us about the economy (vox.com)

Leap: AI Agent That Deploys to Your AWS Account (leap.new)

Show HN: A rigorous proof that the imaginary unit "i" is real(PGP and DOI) (zenodo.org)

AI Infra Guard (github.com)

Skin Deep: Source Code Release (blendogames.com)

High-flux and stable thin-film evaporation from fiber membranes (sciencedirect.com)

What's the difference between named functions and arrow functions in JavaScript? (jrsinclair.com)

AxiomOS An AI system where agents generate and evolve under an Overseer (github.com)

Warzonemeta.io (warzonemeta.io)

How we moved to Shadcn to standardize UI in LocalOps (localops.co)

Show HN: RepoInsightAI – Learn about a GitHub Repository with LLM (github.com)

Why Americans are less likely to voice their opinions on political issues? (werd.io)

Thousands in Norway told they had won life-changing sums in lottery error (theguardian.com)

PostgreSQL: HypoPG 1.4.2 Is Out (postgresql.org)

Recently, June 29, 2025 – island94.org (island94.org)

Show HN: Oomol – A local-first, code-first workflow automation engine (oomol.com)

Android apps can now be officially developed in Swift (swift.org)

OpenAI is doing a 1 week company shutdown (twitter.com)

Understanding Keyword Search (kentro-learn.com)

"Sovereign cloud"? A Trojan Horse at Europe's digital gates (tuta.com)

Intel will shut down its automotive business (fortune.com)

NASA Mars Orbiter Learns New Moves After Nearly 20 Years in Space (jpl.nasa.gov)

How [NOT] to Evaluate Your RAG (nixiesearch.substack.com)

Compiling Brainfuck Code – Part 1: An Optimized Interpreter (rodrigodd.github.io)

How Humans Solve Problems (theness.com)

How Long Contexts Fail (dbreunig.com)

First Hack Contest for LLMs:) (github.com)

Using Advanced JSON Context Profiles to generate same-looking AI images (yurikoval.com)

When the CTO Asks You to Use Autocomplete (idiallo.com)

Show HN: VidVeo3 – AI Video Creation with Seamless Sound (vidveo3.com)

Tripthesia – Create personalized travel itineraries with AI (tripthesia.travel)

Canada rescinds digital services tax to advance trade negotiations with the US (canada.ca)

Former NTSB IG: Boeing 787 FADEC may have caused AirIndia crash (sundayguardianlive.com)

Would you use a chat app that auto-generates to-do list from chat? (markhub.ink)

Show HN: I rewrote my notepad calculator as a local-first app with CRDT syncing (numpad.io)

I Left Quantum Computing Research [video] (youtube.com)

The Accuracy Trap – Why Winning Too Often Can Break You (fractiz.com)

Safeguarding and Monetizing Music in Digital, Virtual and AI-Driven Worlds (copyrightdelta.com)

Creating a Chatbot with Google Gemini Vertex AI and Quarkus (loicmathieu.fr)

After decades in the US, Iranians arrested in Trump's deportation drive (apnews.com)

Anthropic's Claude AI became a terrible business owner in an experiment (techcrunch.com)

Harper: Offline, privacy-first grammar checker. Fast, open-source, Rust (github.com)

What is wrong with all those AArch64 desktops? (2019) (marcin.juszkiewicz.com.pl)

Ask HN: What made you click TCP, THE TRANSPORT LAYER OF INTERNET?

Ask HN: MCP vs. Browser-Based Agents

Show HN: Anagnorisis. A Vision for Better Information Management (medium.com)

LLM's Illusion of Alignment

Comments (25)