Show HN: 70 days, 800 stars. If AI bugs are not random but math inevitable?
many AI failures are not noise. they repeat because the geometry and ordering underneath are stable. if so, we should be able to name each failure mode, set acceptance targets, and stop shipping the same bug twice.
### what it is
* a compact Problem Map of 16 reproducible failure modes in RAG and agents. * each item has a minimal fix and measurable gates. examples:
* Semantic ≠ Embedding: metric and normalization mismatch. accept if coverage of target section ≥ 0.70 and deltaS(question, retrieved) ≤ 0.45 across 3 paraphrases.
* Logic Collapse & Recovery: synthesis runs on thin evidence. require a bridge step before answering.
* Memory Breaks Across Sessions: new chat loses context. use metadata trace then reattach.
* Bootstrap Ordering / Pre-deploy Collapse: shipped an empty or mixed index. block deploy until ingest counts and retrieval smoke tests pass.
* MIT licensed. no SDK, no telemetry, no infra change.### why i believe this is true
* repeated A/B/C runs across mainstream models show the same patterns returning. * small changes in metric, normalization, or chunk contract flip outcomes in a predictable way. * when you enforce simple gates, detours drop and chains stabilize across paraphrases.
### quick falsification you can run
1. pick any non-toy question where your system struggles. 2. run it three ways: retriever only, retriever+rerank, and with a bridge step that refuses to answer on thin evidence. 3. measure: coverage of the target span, deltaS(question, retrieved), citations per atomic claim, answer stability across 3 paraphrases. 4. if coverage is low and only looks good after rerank, you are likely in Semantic ≠ Embedding. if coverage is ok but prose still drifts, it is Logic Collapse. if a fresh chat forgets prior context, it is Memory Breaks. these are all predictable, not random.
### what i’m asking from HN
* try to break it. if you have a counterexample where the gates do not stabilize the chain, i want to see it. * if you maintain a vector store, agent framework, or eval suite, tell me where this framing fails in the real world. * if the map helps you ship fewer regressions, say which item saved you time so we can harden that fix.
happy to answer pointed questions. if this is wrong, i’d like to know exactly where the math breaks. if it is roughly right, maybe we can stop treating these bugs as mysterious and start treating them like unit failures with thresholds.