Show HN: RULER – Easily apply RL to any agent (openpipe.ai)
69 points by kcorbitt 1d ago 11 comments
Show HN: Cactus – Ollama for Smartphones (github.com)
217 points by HenryNdubuaku 1d ago 81 comments
Open-sourcing our clinical triage benchmark for evaluating LLMs
3 klemenvod 2 7/12/2025, 8:37:35 AM github.com ↗
Getting triage wrong can be dangerous (missed emergencies) or costly (unnecessary ER visits).
We’ve open-sourced TriageBench, a reproducible framework for evaluating LLM triage accuracy. It includes:
- A standard clinical dataset (Semigran vignettes)
- Paired McNemar’s test to detect model performance differences on small datasets
- Full methodology and evaluation code
GitHub: https://github.com/medaks/medask-benchmark
As a demonstration, we benchmarked our own model (MedAsk) against several OpenAI models:
- MedAsk: 87.6% accuracy
- o3: 75.6%
- GPT‑4.5: 68.9%
The main limitation is dataset size (45 vignettes). We're looking for collaborators to help expand this - the field needs larger, more diverse clinical datasets.
Blog post with full results: https://medask.tech/blogs/medical-ai-triage-accuracy-2025-me...
"Don't just train your model, understand its mind."
https://github.com/dmf-archive/