Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

I’m building *new.knife.day* (https://new.knife.day), a crowd-sourced database of every cutlery maker—from Al Mar to brands so small they barely show up on Google. That means I need an automated way to fetch each brand’s official website, even for fringe names like “Actilam” or “Aiorosu Knives”.

So I threw the task at eight web-enabled LLMs via OpenRouter:

  • gpt-4o and gpt-4o-mini
  • claude-sonnet-4
  • gemini-2.5-pro and gemini-2.0-flash
  • llama-3.1-70b
  • qwen-2.5-72b
  • perplexity sonar-deep-research

Prompt: Return *only* JSON { brand, official_url, confidence } Data set: 10 obscure knife brands Scoring: exact domain = correct; “no official site” (with reason) = correct Costs: OpenRouter prices on 31 May 2025 (Perplexity billed separately)

Highlights ----------

  • Perplexity hit 10/10 but cost $9.42 (860 k tokens!).
  • GPT-4o-mini & Llama-3.1-70B got 9/10 for ~2 ¢ per correct URL.
  • Gemini Flash managed 7/10 for $0.001 total—great if you can QA the misses.
  • Half of Gemini 2.5 Pro’s replies were HTML tables my parser rejected.

Full table, code, and raw logs are in the post (and on GitHub).

Take-aways ----------

  1. 90 % accuracy + quick human review often beats 100 % accuracy that costs
     45× more.
  2. Structured output is part of model quality—validate JSON on arrival.
  3. Promo pricing moves fast; always ping the price API before large runs.

Next step: wire GPT-4o-mini into *new.knife.day* so visitors get verified manufacturer links. Crawling ~250 brands now costs under $5.

Curious what you’d improve, and which model you’d bet on for similar “find the canonical URL” tasks. AMA on the setup, prompts, or results!

Comments (0)

No comments yet

Show HN: Claude Composer (github.com)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Show HN: Camus – The World's First Truly Useless AI Agent (camus.im)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

Show HN: A scriptable text editor for LLMs (github.com)

Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX (github.com)

Show HN: I made a 3D SVG Renderer that projects textures without rasterization (seve.blog)

Show HN: Posture Correction Using AirPods Motion Sensors (github.com)

Show HN: Container Use for Agents (github.com)

Show HN: String Flux – Simplify everyday string transformations for developers (stringflux.io)

Show HN: GPT image editing, but for 3D models (adamcad.com)

Show HN: Memotron – PKM Tool for All (memotron.app)

Show HN: Grab a Random ArXiv Paper (jepedersen.dk)

Show HN: Create LLM graders and run evals in JavaScript with one file (github.com)

Show HN: A Simple Tool to Copy Special Characters and Symbols Easily (special-characters.aitoolshubs.com)

Show HN: App.build, an open-source AI agent that builds full-stack apps (app.build)

Show HN: Explainr – Upload a research paper and get a learning roadmap (explainr.aryanbuilds.com)

Show HN: Open a browser by clapping twice (inspired by Iron Man) (github.com)

Show HN: YOYO – AI Version Control for Vibe Coding (runyoyo.com)

Show HN: I build one absurd web project every month (absurd.website)

Show HN: I wrote a Java decompiler in pure C language (github.com)

Show HN: Verysmall.site – vibecode single page websites (verysmall.site)

Show HN: MCP-Cloud – One-click hosting for MCP servers (50 templates) (mcp-cloud.ai)

Show HN: Kan.bn – An open-source alterative to Trello (github.com)

Show HN: Localize React apps without rewriting code (github.com)

Show HN: Controlling 3D models with voice and hand gestures (github.com)

Show HN: JSON_fast – 35% faster JSON parsing than serde_JSON (github.com)

Show HN: patdb: a snappy + easy + pretty TUI debugger for Python (github.com)

Show HN: This database never puts you on hold (github.com)

Show HN: Gradle plugin for faster Java compiles (github.com)

Show HN: AirAP AirPlay server – AirPlay to an iOS Device (github.com)

Show HN: This Hacker News does not exist (thishackernewsdoesnotexist.com)

Show HN: Run 30B model in 4GB Active Memory (github.com)

Show HN: Tiptap AI Agent – Add AI workflows to your text editor in minutes

Show HN: Onlook – Open-source, visual-first Cursor for designers (github.com)

Show HN: Ephe – A minimalist open-source Markdown paper for today (github.com)

Show HN: Create tailored resumes based on job descriptions (clawcv.com)

Show HN: Patio – Rent tools, learn DIY, reduce waste (patio.so)

Show HN: A toy version of Wireshark (student project) (github.com)

Show HN: An Alfred workflow to open GCP services and browse resources within (github.com)

Show HN: I built an old photo restoration tool using the Flux Kontext (restoreoldphotos.io)

Show HN: Clarity – A Dashboard for Scrum Teams (Early Access) (clarity.hacknscrum.de)

Show HN: Triage.flow – Chat with Any GitHub Repo Using Faiss and LlamaIndex (github.com)

Show HN: I built Claude code but for image generation (agent.trybezel.com)

Show HN: Moon Phase Algorithms for C, Lua, Awk, JavaScript, etc. (github.com)

Show HN: Mosaique.info – Global news in context (solo dev, no ads, no tracking) (mosaique.info)

Show HN: The first portable, customisable General AI Agent – available for free (orkestralai.com)

Show HN: Scale content with automated SME interviews (BC AI content is trash) (dbrief.io)

Show HN: Which LLM Finds Obscure Knife-Brand URLs Cheapest? (8-Model Benchmark)

Comments (0)