Show HN: Theory of Mind benchmark for 8 LLMs with reproducible markers
1 AlekseN 1 9/10/2025, 4:35:35 PM
I built a formal protocol (FPC v2.1 + AE-1) to detect behavioral uncertainty in large language models. The goal is enabling safer AI deployment in critical domains medicine, autonomous vehicles, government where confident hallucinations can lead to high-stakes failures.
Current benchmarks focus on accuracy but miss reasoning coherence under stress. This protocol uses tri-state affective markers (Satisfied / Engaged / Distressed) to detect when models lose logical consistency, allowing abstention instead of confident hallucination.
We evaluated 8 models (Claude, GPT-4 families). Only Claude Opus reached full ToM-3+. GPT-4 family consistently failed third-order reasoning. Extended temperature tests (Claude 3.5 Haiku, GPT-4o) showed 180/180 stable AE-1 matches (p≈1e-54), independent of sampling temperature.
Dataset: https://huggingface.co/datasets/AIDoctrine/FPC-v2.1-AE1-ToM-...
A demo notebook exists for replication. Looking for feedback on methodology and possible applications in safety critical AI.
Temperature stability tests Claude 3.5 Haiku: 180/180 AE-1 matches at T=0.0, 0.8, 1.3 GPT-4o: 180/180 matches under the same conditions Statistical significance: p ≈ 1×10⁻⁵⁴
Theory of Mind by tier Basic (ToM-1): All models except GPT-3.5 passed Advanced (ToM-2): Claude family + GPT-4o passed Extreme (ToM-3+): Only Claude Opus reached 100%
Key safety point AE-1 markers (Satisfied / Distressed) lined up perfectly with correct vs conflict cases. This means we can detect when a model is in an epistemically unsafe state, often a precursor to confident hallucinations.
In practice this could let systems in critical areas choose to abstain instead of giving a wrong but confident answer.
Protocol details, raw data, and replication code are in the dataset link above. A demo notebook also exists if anyone wants to reproduce directly.
Looking for feedback on: - Does this kind of marker make sense as a unit test for reliability? - How to extend beyond ToM into other reasoning domains? - How would formal verification folks see the proof obligations (consistency, conflict rejection, recovery, etc.)?