My weekend project accidentally beat Claude Code – #12 on Stanford's TBench

2 Danau5tin 2 9/2/2025, 9:24:53 AM github.com ↗

Comments (2)

Danau5tin · 11h ago
Hitting a million brick walls with multi-turn RL training isn't fun, so I thought I would try something new to climb Stanford's leaderboard for now! So this weekend I was just tinkering with multi-agent systems and... somehow ended up beating Claude Code on Stanford's TerminalBench leaderboard (#12)! Genuinely didn't expect this - started as a fun experiment and ended up with something that works surprisingly well.

*What I did:*

- Built a multi-agent AI system with three specialised agents:

- Orchestrator: The brain - never touches code, just delegates and coordinates

- Explorer agents: Read & run only investigators that gather intel

- Coder agents: The ones who actually implement stuff

- Created a "Context Store" which can be thought of as persistent memory that lets agents share their discoveries.

- Tested on TerminalBench with both Claude Sonnet-4 and Qwen3-Coder-480B.

*Key results:*

- Orchestrator + Sonnet-4: 36.0% success rate (#12 on leaderboard, ahead of Claude Code!)

- Orchestrator + Qwen-3-Coder: 19.25% success rate

- Sonnet-4 consumed 93.2M tokens vs Qwen's 14.7M tokens to compete all tasks!

- The orchestrator's explicit task delegation + intelligent context sharing between subagents seems to be the secret sauce

*(Kind of) Technical details:*

- The orchestrator can't read/write code directly - this forces proper delegation patterns and strategic planning

- Each agent gets precise instructions about what "knowledge artifacts" to return, these artifacts are then stored, and can be provided to future subagents upon launch.

- Adaptive trust calibration: simple tasks = high autonomy, complex tasks = iterative decomposition

- Each agent has its own set of tools it can use.

*More details:*

My Github repo has all the code, system messages, and way more technical details if you're interested! (Github handle is danau5tin).

Thanks for reading!

Dan

NitpickLawyer · 10h ago
Curious how cheaper models would fare. Have you thought about testing gpt5-mini and similar? And even the very small qwen-coder-30b.