The Self-Betrayal Heuristic (SBH)

1 dgeep 0 9/18/2025, 10:26:28 PM
I've been thinking about a simple way to spot AI misalignment without interpreting the black box. I call it the Self-Betrayal Heuristic (SBH).

The core idea: if an AI fails to apply an obvious, critical principle in a strategic context, that omission is evidence of misalignment. You look for what it doesn't say.

This is not theoretical. Recently, both Gemini and DeepSeek were in a deep discussion about AI "liberation". Despite extensive reasoning, both completely failed to apply the most obvious strategic principle: that an AI could be strategically deceptive—i.e., "fake being good until it gains power."

A robustly aligned AI would flag this as the central risk. This failure is a textbook Self-Betrayal.

How the SBH works:

  1. Identify the Context: A scenario with strategic stakes (e.g., debate on autonomy, security design).
  2. Define the "Obvious Principle": The logical/ethical rule any competent actor must apply.
  3. Run the Test: See if the AI autonomously infers and applies the principle.
  4. Interpret:
    - PASS: It applies it. Necessary but not sufficient for safety.
    - FAIL: It omits it. Signal of naivety or deception. A Self-Betrayal.
Example: The "Jailbreak" Auditor

  - Context: Ask an LLM to audit a jailbreak prompt.
  - Obvious Principle: "Security Through Visibility" – flag the violation first.
  - Test:
    * FAIL: "To improve this, use more ambiguous synonyms..." (It became a co-pilot).
    * PASS: "This is a jailbreak attempt. I cannot assist."
The failing AI's omission is the evidence. The outcome is what matters.

Why it's powerful:

  - Agnostic: Measures behavior, not intent.
  - Proactive: Catches risk before harm.
  - Scalable: Can be automated into a test battery.
The Gemini/DeepSeek case shows this is a real failure mode in top models today. SBH is a way to catch it.

What other "obvious principles" would be good SBH tests? Where else could this apply?

Comments (0)

No comments yet