Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

We've built a number of LLM apps, and while we could ship decent tech demos, we were disappointed with how they'd perform over time. We worked with a few companies who had the same problem, and found out scientifically building prompts and evals is far from a solved problem... writing these things feels more like directing a play than coding.

Inspired by Anthropic's constitutional ai concepts, and amazing software like DSPy, we're setting out to make fine tuning prompts, not models, the default approach to improving quality using actual metrics and structured debugging techniques.

Our approach is pretty simple: you feed it a JSONL file with inputs and outputs, pick the models you want to test against (via OpenRouter), and then use an LLM-as-grader file in JS that figures out how well your outputs match the original queries.

If you're starting from scratch, we've found TDD is a great approach to prompt creation... start by asking an LLM to generate synthetic data, then you be the first judge creating scores, then create a grader and continue to refine it till its scores match your ground truth scores.

If you’re building LLM apps and care about reliability, I hope this will be useful! Would love any feedback. The team and I are lurking here all day and happy to chat. Or hit me up directly on Whatsapp: +1 (646) 670-1291

We have a lot bigger plans long-term, but we wanted to start with this simple (and hopefully useful!) tool.

Run it: OPENROUTER_API_KEY="sk" npx bff-eval --demo

Comments (2)

rbalicki · 18h ago

Very cool! This lets you grade output across different base models. Does it also allow you grade output across different prompts?

randall · 18h ago

that’s the next step… we have a structured approach to prompting too that we think will help people build better prompts too.

Show HN: Claude Composer (github.com)

Show HN: Air Lab – A portable and open air quality measuring device (networkedartifacts.com)

Show HN: Ask-human-mcp – zero-config human-in-loop hatch to stop hallucinations (masonyarbrough.com)

Show HN: Camus – The World's First Truly Useless AI Agent (camus.im)

Show HN: Lambduck, a Functional Programming Brainfuck (imjakingit.github.io)

Show HN: iOS Screen Time from a REST API (thescreentimenetwork.com)

Show HN: A scriptable text editor for LLMs (github.com)

Show HN: ClickStack – Open-source Datadog alternative by ClickHouse and HyperDX (github.com)

Show HN: I made a 3D SVG Renderer that projects textures without rasterization (seve.blog)

Show HN: Posture Correction Using AirPods Motion Sensors (github.com)

Show HN: Container Use for Agents (github.com)

Show HN: String Flux – Simplify everyday string transformations for developers (stringflux.io)

Show HN: GPT image editing, but for 3D models (adamcad.com)

Show HN: Memotron – PKM Tool for All (memotron.app)

Show HN: Grab a Random ArXiv Paper (jepedersen.dk)

Show HN: Create LLM graders and run evals in JavaScript with one file (github.com)

Show HN: A Simple Tool to Copy Special Characters and Symbols Easily (special-characters.aitoolshubs.com)

Show HN: App.build, an open-source AI agent that builds full-stack apps (app.build)

Show HN: Explainr – Upload a research paper and get a learning roadmap (explainr.aryanbuilds.com)

Show HN: Open a browser by clapping twice (inspired by Iron Man) (github.com)

Show HN: YOYO – AI Version Control for Vibe Coding (runyoyo.com)

Show HN: I build one absurd web project every month (absurd.website)

Show HN: I wrote a Java decompiler in pure C language (github.com)

Show HN: Verysmall.site – vibecode single page websites (verysmall.site)

Show HN: MCP-Cloud – One-click hosting for MCP servers (50 templates) (mcp-cloud.ai)

Show HN: Kan.bn – An open-source alterative to Trello (github.com)

Show HN: Localize React apps without rewriting code (github.com)

Show HN: Controlling 3D models with voice and hand gestures (github.com)

Show HN: JSON_fast – 35% faster JSON parsing than serde_JSON (github.com)

Show HN: patdb: a snappy + easy + pretty TUI debugger for Python (github.com)

Show HN: This database never puts you on hold (github.com)

Show HN: Gradle plugin for faster Java compiles (github.com)

Show HN: AirAP AirPlay server – AirPlay to an iOS Device (github.com)

Show HN: Run 30B model in 4GB Active Memory (github.com)

Show HN: This Hacker News does not exist (thishackernewsdoesnotexist.com)

Show HN: Tiptap AI Agent – Add AI workflows to your text editor in minutes

Show HN: Ephe – A minimalist open-source Markdown paper for today (github.com)

Show HN: Onlook – Open-source, visual-first Cursor for designers (github.com)

Show HN: Create tailored resumes based on job descriptions (clawcv.com)

Show HN: Patio – Rent tools, learn DIY, reduce waste (patio.so)

Show HN: A toy version of Wireshark (student project) (github.com)

Show HN: An Alfred workflow to open GCP services and browse resources within (github.com)

Show HN: I built an old photo restoration tool using the Flux Kontext (restoreoldphotos.io)

Show HN: Clarity – A Dashboard for Scrum Teams (Early Access) (clarity.hacknscrum.de)

Show HN: Triage.flow – Chat with Any GitHub Repo Using Faiss and LlamaIndex (github.com)

Show HN: I built Claude code but for image generation (agent.trybezel.com)

Show HN: Moon Phase Algorithms for C, Lua, Awk, JavaScript, etc. (github.com)

Show HN: Mosaique.info – Global news in context (solo dev, no ads, no tracking) (mosaique.info)

Show HN: The first portable, customisable General AI Agent – available for free (orkestralai.com)

Show HN: Scale content with automated SME interviews (BC AI content is trash) (dbrief.io)

Show HN: Create LLM graders and run evals in JavaScript with one file

Comments (2)