Q Evaluation Harness: open-source evals for LLMs on q/kdb+
1 erfan_mhi 0 8/15/2025, 3:37:16 PM github.com ↗
Author here. We built an open-source evaluation harness for LLMs on q/kdb+. It includes: a q-HumanEval set (164 tasks), reproducible Pass@k scoring, and a public leaderboard.
Why this matters: top models score ~96% Pass@1 on Python HumanEval, but best Pass@1 on q-HumanEval is ~43.4%, so there’s clear room for improvement. Early runs show large gains with multiple attempts (e.g., Grok 4: 43.37% → 74.32% Pass@10).
We’d love your help with two things: 1. Try it out & add your models to the leaderboard. 2. Contribute new datasets, and provide feedback on any potential improvements.
• GitHub: https://github.com/KxSystems/q-evaluation-harness/tree/main • Launch write-up: https://medium.com/kx-systems/introducing-q-evaluation-harne... • Leaderboard: https://github.com/KxSystems/q-evaluation-harness/blob/main/... • License: MIT
Happy to answer questions and take PRs.
No comments yet