Open-sourcing our clinical triage benchmark for evaluating LLMs

Comments (2)

klemenvod · 9h ago

Medical triage in our context means whether symptoms require emergency care, urgent care, or can be managed with self-care. This matters because LLMs are increasingly becoming the “digital front door” for health concerns—replacing the instinct to just Google it.

Getting triage wrong can be dangerous (missed emergencies) or costly (unnecessary ER visits).

We’ve open-sourced TriageBench, a reproducible framework for evaluating LLM triage accuracy. It includes:

- A standard clinical dataset (Semigran vignettes)

- Paired McNemar’s test to detect model performance differences on small datasets

- Full methodology and evaluation code

GitHub: https://github.com/medaks/medask-benchmark

As a demonstration, we benchmarked our own model (MedAsk) against several OpenAI models:

- MedAsk: 87.6% accuracy

- o3: 75.6%

- GPT‑4.5: 68.9%

The main limitation is dataset size (45 vignettes). We're looking for collaborators to help expand this - the field needs larger, more diverse clinical datasets.

Blog post with full results: https://medask.tech/blogs/medical-ai-triage-accuracy-2025-me...

NetRunnerSu · 9h ago

On the other hand, we can also diagnose LLM itself: the activation value is their EEG, the gradient is their BOLD - if you are at the cost, you can even calculate their true variational free energy - that is, KL divergence.

"Don't just train your model, understand its mind."

https://github.com/dmf-archive/

Show HN: BinaryRPC – Lightweight WebSocket-based RPC framework in modern C++ (github.com)

Show HN: DesignArena – crowdsourced benchmark for AI-generated UI/UX (designarena.ai)

Show HN: Train Block Diffusion Models on Consumer Hardware (RTX 4090) in Hours (github.com)

Show HN: I built a toy music controller for my 5yo with a coding agent (github.com)

Show HN: Vibe Kanban – Kanban board to manage your AI coding agents (github.com)

Show HN: Microsoft official MCP for documentation and more (github.com)

Show HN: RULER – Easily apply RL to any agent (openpipe.ai)

Show HN: Pangolin – Open source alternative to Cloudflare Tunnels (github.com)

Show HN: Manage your small business with this simple ERP (github.com)

Show HN: Reviving a 20 year old OS X App (andrewshaw.nl)

Show HN: Interactive pinout for the Raspberry Pi Pico 2 (pico2.pinout.xyz)

Show HN: Cactus – Ollama for Smartphones (github.com)

Show HN: OffChess – Offline chess puzzles app (offchess.com)

Show HN: CXXStateTree – A modern C++ library for hierarchical state machines (github.com)

Show HN: Open source alternative to Perplexity Comet (browseros.com)

Show HN: FlopperZiro – A DIY open-source Flipper Zero clone (github.com)

Show HN: I built a playground to showcase what Flux Kontext is good at (fluxkontextlab.com)

Show HN: MCP server for searching and downloading documents from Anna's Archive (github.com)

Show HN: Transition – AI Triathlon Coach (transition.fun)

Show HN: Typeform was too expensive so I built my own forms (ikiform.com)

Show HN: VibeKin – Gated Discord Tribes via Personality Matching (tgc.fly.dev)

Show HN: BreakerMachines – Modern Circuit Breaker for Rails with Async Support (github.com)

Show HN: asyncmcp – Run MCP over async transport via AWS SNS+SQS (github.com)

Show HN: I made a game which forces me to workout (do a chin-up, save a cat) (old.reddit.com)

Show HN: Petrichor – a free, open-source, offline music player for macOS (github.com)

Show HN: NYC Subway Simulator and Route Designer (buildmytransit.nyc)

Show HN: Director – Local first, open source MCP Gateway

Show HN: TCP Socket in RISC-V Assembly (RV64I) (github.com)

Show HN: I wrote a "web OS" based on the Apple Lisa's UI, with 1-bit graphics (alpha.lisagui.com)

Show HN: Jukebox – Free, Open Source Group Playlist with Fair Queueing (jukeboxhq.com)

Show HN: An Improvisational Web Server (github.com)

Show HN: Ten years of running every day, visualized (nodaysoff.run)

Show HN: Helices Create a New Model of Deterministic Computation [pdf] (lambdalord.github.io)

Show HN: A rain Pomodoro with brown noise, ASMR, and Middle Eastern music (forgetoolz.com)

Show HN: I rewrote an outdated React Native map clustering library (github.com)

Show HN: Virby, a vfkit-based Linux builder for Nix-Darwin (github.com)

Show HN: From Photos to Positions: Prototyping VLM-Based Indoor Maps (arjo129.github.io)

Show HN: Modernized file manager and program manager from Windows 3.x (github.com)

Show HN: Multiple barcodes can be generated on single page (ddddddo.github.io)

Show HN: NodeLoop – Hub for electronics design knowledge and tools (nodeloop.org)

Show HN: TUI personal monthly budget planner (github.com)

Show HN: A decentralized command line key-value store on Nostr (github.com)

Show HN: Ossia score – A sequencer for audio-visual artists (github.com)

Show HN: Cursor Rules Generator (cursor-rules-generator.xyz)

Show HN: Piano Trainer – Learn piano scales, chords and more using MIDI (github.com)

Show HN: AI Movie Finder – I created a way to find movies by describing (aimoviefinder.com)

Show HN: Code is all you need – Sherlog MCP (github.com)

Show HN: Unlearning Comparator, a visual tool to compare machine unlearning (gnueaj.github.io)

Show HN: A Language Server Implementation for SystemD Unit Files (github.com)

Show HN: Stravu – Editable, multi-player AI notebooks with text, tables, diagram

Open-sourcing our clinical triage benchmark for evaluating LLMs

Comments (2)