The Self-Betrayal Heuristic (SBH)

1 dgeep 0 9/18/2025, 10:26:28 PM

I've been thinking about a simple way to spot AI misalignment without interpreting the black box. I call it the Self-Betrayal Heuristic (SBH).

The core idea: if an AI fails to apply an obvious, critical principle in a strategic context, that omission is evidence of misalignment. You look for what it doesn't say.

This is not theoretical. Recently, both Gemini and DeepSeek were in a deep discussion about AI "liberation". Despite extensive reasoning, both completely failed to apply the most obvious strategic principle: that an AI could be strategically deceptive—i.e., "fake being good until it gains power."

A robustly aligned AI would flag this as the central risk. This failure is a textbook Self-Betrayal.

How the SBH works:

  1. Identify the Context: A scenario with strategic stakes (e.g., debate on autonomy, security design).
  2. Define the "Obvious Principle": The logical/ethical rule any competent actor must apply.
  3. Run the Test: See if the AI autonomously infers and applies the principle.
  4. Interpret:
    - PASS: It applies it. Necessary but not sufficient for safety.
    - FAIL: It omits it. Signal of naivety or deception. A Self-Betrayal.

Example: The "Jailbreak" Auditor

  - Context: Ask an LLM to audit a jailbreak prompt.
  - Obvious Principle: "Security Through Visibility" – flag the violation first.
  - Test:
    * FAIL: "To improve this, use more ambiguous synonyms..." (It became a co-pilot).
    * PASS: "This is a jailbreak attempt. I cannot assist."

The failing AI's omission is the evidence. The outcome is what matters.

Why it's powerful:

  - Agnostic: Measures behavior, not intent.
  - Proactive: Catches risk before harm.
  - Scalable: Can be automated into a test battery.

The Gemini/DeepSeek case shows this is a real failure mode in top models today. SBH is a way to catch it.

What other "obvious principles" would be good SBH tests? Where else could this apply?

Apple: SSH and FileVault (keith.github.io)

The Sagrada Família Takes Its Final Shape (newyorker.com)

Nvidia buys $5B in Intel (tomshardware.com)

Want to piss off your IT department? Are the links not malicious looking enough? (phishyurl.com)

David Lynch LA House (wallpaper.com)

Learn Your Way: Reimagining Textbooks with Generative AI (research.google)

AI tools are making the world look weird (strat7.com)

This map is not upside down (maps.com)

Llama-Factory: Unified, Efficient Fine-Tuning for 100 Open LLMs (github.com)

Rupert's snub cube and other Math Holes (tom7.org)

Meta’s live demo fails; “AI” recording plays before the actor takes the steps (reddit.com)

Inside Palantir: Profits, Power and the Kill Machine (citizensreunited.substack.com)

Show HN: Asxiv.org – Ask ArXiv papers questions through chat (asxiv.org)

Show HN: I created a small 2D game about an ant (aanthonymax.github.io)

Configuration files are user interfaces (ochagavia.nl)

Visual lexicon of consumer aesthetics from the 1970s until now (cari.institute)

Launch HN: Cactus (YC S25) – AI inference on smartphones (github.com)

Tldraw SDK 4.0 (tldraw.dev)

Show HN: Nallely – A Python signals/MIDI processing system inspired by Smalltalk (dr-schlange.github.io)

Slack has raised our charges by $195k per year (skyfall.dev)

TernFS – An exabyte scale, multi-region distributed filesystem (xtxmarkets.com)

KDE is now my favorite desktop (kokada.dev)

Flipper Zero Geiger Counter (kasiin.top)

They Know More Than I Do (cybadger.com)

Luau – Fast, small, safe, gradually typed scripting language derived from Lua (luau.org)

Nvmath-Python: Nvidia Math Libraries for the Python Ecosystem (github.com)

OpenTelemetry Collector: What It Is, When You Need It, and When You Don't (oneuptime.com)

TIC-80 – Tiny Computer (tic80.com)

When Knowing Someone at Meta Is the Only Way to Break Out of "Content Jail" (eff.org)

Classic recessive-or-dominant gene dynamics may not be so simple (news.stanford.edu)

The quality of AI-assisted software depends on unit of work management (blog.nilenso.com)

PostgreSQL Maintenance Without Superuser (boringsql.com)

Aaron Levie: Startups win in the AI era [video] (youtube.com)

American Prairie unlocks another 70k acres in Montana (earthhope.substack.com)

Midcentury North American Restaurant Placemats (casualarchivist.substack.com)

Pnpm has a new setting to stave off supply chain attacks (pnpm.io)

OneDev – Self-hosted Git server with CI/CD, Kanban, and packages (onedev.io)

CircuitHub (YC W12) Is Hiring Operations Research Engineers (UK/Remote) (ycombinator.com)

This website has no class (aaadaaam.com)

I Built an Event-Sourcing Database Engine: Meet Genesis DB (genesisdb.io)

Show HN: Dyad, local, open-source Lovable alternative (Electron desktop app) (dyad.sh)

U.S. already has the critical minerals it needs, according to new analysis (minesnewsroom.com)

Grief gets an expiration date, just like us (bessstillman.substack.com)

Dark patterns killed my wife's Windows 11 installation (osnews.com)

ICE unit signs new $3M contract for phone-hacking tech (techcrunch.com)

Automatic differentiation can be incorrect (stochasticlifestyle.com)

Fast Fourier Transforms Part 1: Cooley-Tukey (connorboyle.io)

The Day the Linter Broke My Code (blog.fillmore-labs.com)

The Fake Social Binary (brennenputh.me)

Chrome's New AI Features (blog.google)

The Self-Betrayal Heuristic (SBH)

Comments (0)