Show HN: 70 days, 800 stars. If AI bugs are not random but math inevitable?

4 tgrrr9111 1 8/31/2025, 1:53:13 PM github.com ↗

hi all. i’ve been shipping a small open project that tries to answer that question with evidence, not vibes. in 70 days it reached \~800 stars. the core claim is simple:

many AI failures are not noise. they repeat because the geometry and ordering underneath are stable. if so, we should be able to name each failure mode, set acceptance targets, and stop shipping the same bug twice.

### what it is

* a compact Problem Map of 16 reproducible failure modes in RAG and agents. * each item has a minimal fix and measurable gates. examples:

  * Semantic ≠ Embedding: metric and normalization mismatch. accept if coverage of target section ≥ 0.70 and deltaS(question, retrieved) ≤ 0.45 across 3 paraphrases.
  * Logic Collapse & Recovery: synthesis runs on thin evidence. require a bridge step before answering.
  * Memory Breaks Across Sessions: new chat loses context. use metadata trace then reattach.
  * Bootstrap Ordering / Pre-deploy Collapse: shipped an empty or mixed index. block deploy until ingest counts and retrieval smoke tests pass.

* MIT licensed. no SDK, no telemetry, no infra change.

### why i believe this is true

* repeated A/B/C runs across mainstream models show the same patterns returning. * small changes in metric, normalization, or chunk contract flip outcomes in a predictable way. * when you enforce simple gates, detours drop and chains stabilize across paraphrases.

### quick falsification you can run

1. pick any non-toy question where your system struggles. 2. run it three ways: retriever only, retriever+rerank, and with a bridge step that refuses to answer on thin evidence. 3. measure: coverage of the target span, deltaS(question, retrieved), citations per atomic claim, answer stability across 3 paraphrases. 4. if coverage is low and only looks good after rerank, you are likely in Semantic ≠ Embedding. if coverage is ok but prose still drifts, it is Logic Collapse. if a fresh chat forgets prior context, it is Memory Breaks. these are all predictable, not random.

### what i’m asking from HN

* try to break it. if you have a counterexample where the gates do not stabilize the chain, i want to see it. * if you maintain a vector store, agent framework, or eval suite, tell me where this framing fails in the real world. * if the map helps you ship fewer regressions, say which item saved you time so we can harden that fix.

happy to answer pointed questions. if this is wrong, i’d like to know exactly where the math breaks. if it is roughly right, maybe we can stop treating these bugs as mysterious and start treating them like unit failures with thresholds.

Comments (1)

believeingod · 5h ago

this is a goldmine. after reading the 16 failure modes, i realized half my Gemini pipeline issues were some mix of No.5 and No.8.

New Ruby Curl bindings with Fiber native support (github.com)

The On-Line Encyclopedia of Integer Sequences (OEIS) (oeis.org)

Ask HN: Which Open Source License to Choose for a Python Language Server

Lunar soil machine developed to build bricks using sunlight (moondaily.com)

Rare Left-Handed Snail Seeks Hook-Up with Similar Mate (cnn.com)

Client-Side RCE via CSS Injection in Google Web Designer for Windows (balintmagyar.com)

The Unfashionable Art of Learning Things (medium.com)

We need to seriously think about what to do with C++ modules (nibblestew.blogspot.com)

Sea power turned into energy at Los Angeles port (techxplore.com)

Olivier Jacob Website and SEO Freelancer Hamburg Schnelsen and Niendorf (olivierjacob.com)

Rose Scent Increases Brain Gray Matter (sciencealert.com)

A VPN kill-switch caused sudo to hang – Anagogistis (anagogistis.com)

French foreign minister calls outside interference in Greenland 'unacceptable' (lemonde.fr)

It's So Easy to Prompt Inject Perplexity Comet

Why is any qustion about MS products online flooded with moron repsonses?

Show HN: Oaki–job finder and resume maker (oaki.io)

Debian PPA IPv6 address ends in deb:deb

I got 4 months worth of Spotify premium for free

Hacking in the Matrix (godaccess.substack.com)

Eternal Struggle (yoavg.github.io)

A Cure for Technology Addiction (lorn.us)

Oakland to silence police radios from public beginning Wednesday (mercurynews.com)

Anthropic's surprise settlement adds new wrinkle in AI copyright war (reuters.com)

Vvvv Gamma 7.0 Release (vvvv.org)

Triboluminescence (en.wikipedia.org)

Data-Driven Mechanism Design: Jointly Eliciting Preferences and Information [pdf] (cowles.yale.edu)

The FTC Warns Big Tech Companies Not to Apply the Digital Services Act (wired.com)

Apple: 11-Inch MacBook Air and Two Other Macs Are Now Obsolete (macrumors.com)

Inside The Box: Everything I Did with an Arduino Starter Kit [video] (youtube.com)

Inverting the Xorshift128 random number generator (littlemaninmyhead.wordpress.com)

Is It a Comet or Alien Technology? [video] (youtube.com)

Show HN: I built a tool to edit images with AI (pixeledit.ai)

Super Micro shares dip after AI server maker flags financial control concerns (reuters.com)

ChatGPT affirmed Greenwich man's fears before murder-suicide (ctinsider.com)

Ayfkm blog: Painful bureaucratic journeys of a multicultural family (ayfkm.blog)

Writing in Djot (pdx.su)

Breakneck: China's Quest to Engineer the Future (danwang.co)

Media Influence and Spatial Voting: The Role of Perceived Party Positions (link.springer.com)

90% of European gaming revenue in 2024 was digital purchases with only 15% on PC [pdf] (videogameseurope.eu)

What Is Algebra? (2011) (profkeithdevlin.org)

Chicago has the most lead pipes in the nation. We mapped them all (grist.org)

Vibe Security – Vibe-coding security scanner that works (vibesecurity.co)

Apple Hints at iPhone 17 Models Lacking SIM Card Slot in More Countries (macrumors.com)

Double-tap strike kills 5 more journalists in Gaza hospital (reuters.com)

Ocean current 'collapse' could trigger 'profound cooling' in northern Europe (carbonbrief.org)

Firer (firer.io)

Rome Podcast (2007) (thehistoryofrome.typepad.com)

Binary Inference Dictionaries for Electoral NLP (matthodges.com)

The 'self-inflicted injury' to US tourism making Americans angry, disappointed (cnn.com)

How many HTTP requests/second can a Single Machine handle? (binaryigor.com)

Show HN: 70 days, 800 stars. If AI bugs are not random but math inevitable?

Comments (1)