Author here. We built an open-source evaluation harness for LLMs on q/kdb+. It includes: a q-HumanEval set (164 tasks), reproducible Pass@k scoring, and a public leaderboard.

Why this matters: top models score ~96% Pass@1 on Python HumanEval, but best Pass@1 on q-HumanEval is ~43.4%, so there’s clear room for improvement. Early runs show large gains with multiple attempts (e.g., Grok 4: 43.37% → 74.32% Pass@10).

We’d love your help with two things: 1. Try it out & add your models to the leaderboard. 2. Contribute new datasets, and provide feedback on any potential improvements.

• GitHub: https://github.com/KxSystems/q-evaluation-harness/tree/main • Launch write-up: https://medium.com/kx-systems/introducing-q-evaluation-harne... • Leaderboard: https://github.com/KxSystems/q-evaluation-harness/blob/main/... • License: MIT

Happy to answer questions and take PRs.

Comments (0)

No comments yet

GPT-5 (openai.com)

Fight Chat Control (fightchatcontrol.eu)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Ultrathin business card runs a fluid simulation (github.com)

I want everything local – Building my offline AI workspace (instavm.io)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Streaming services are driving viewers back to piracy (theguardian.com)

FFmpeg 8.0 adds Whisper support (code.ffmpeg.org)

Emailing a one-time code is worse than passwords (blog.danielh.cc)

Debian 13 “Trixie” (debian.org)

Vibechart (vibechart.net)

Steve Wozniak: Life to me was never about accomplishment, but about happiness (yro.slashdot.org)

Claude Code is all you need (dwyer.co.za)

VC-backed company just killed my EU trademark for a small OSS project

Nginx introduces native support for ACME protocol (blog.nginx.org)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

Why LLMs can't really build software (zed.dev)

Gemma 3 270M: Compact model for hyper-efficient AI (developers.googleblog.com)

Claude says “You're absolutely right!” about everything (github.com)

PYX: The next step in Python packaging (astral.sh)

How I code with AI on a budget/free (wuu73.org)

Try and (ygdp.yale.edu)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

This website is for humans (localghost.dev)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

Jim Lovell, Apollo 13 commander, has died (nasa.gov)

Search all text in New York City (alltext.nyc)

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Open hardware desktop 3D printing is dead? (josefprusa.com)

What's the strongest AI model you can train on a laptop in five minutes? (seangoedecke.com)

Historical Tech Tree (historicaltechtree.com)

Why are there so many rationalist cults? (asteriskmag.com)

Meta Leaks Part 1: Israel and Meta (archive.org)

Cursed Knowledge (immich.app)

The Chrome VRP Panel has decided to award $250k for this report (issues.chromium.org)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

The Framework Desktop is a beast (world.hey.com)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

Getting good results from Claude Code (dzombak.com)

StarDict sends X11 clipboard to remote servers (lwn.net)

GPT-5 for Developers (openai.com)

Linear sent me down a local-first rabbit hole (bytemash.net)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

Trump Orders National Guard to Washington and Takeover of Capital’s Police (nytimes.com)

OpenSSH Post-Quantum Cryptography (openssh.com)

Q Evaluation Harness: open-source evals for LLMs on q/kdb+

Comments (0)