Precision-Based Sampling of LLM Judges
1 sunny-bak 0 5/27/2025, 11:33:57 PM sunnybak.net ↗
Built a system that automatically determines how many LLM-as-a-judge runs you need for statistically reliable scores.
Key insight: treat each LLM evaluation as a noisy sample, then use confidence intervals to decide when to stop sampling. The math shows reliability is surprisingly cheap (95%→99% confidence only costs 1.7x more), but precision is expensive (doubling scale granularity costs 4x more).
Also implemented "mixed-expert sampling" - rotating through multiple models (GPT-4, Claude, etc.) in the same batch for better robustness.
Analyzed how latency, cost and reliability scale in this approach.
Typical result: need 5-20 samples instead of guessing. Especially useful for AI safety evals and model comparisons where reliability matters.
Code: https://github.com/sunnybak/precision-based-sampling Blog: https://www.sunnybak.net/blog/precision-based-sampling
No comments yet