My weekend project accidentally beat Claude Code

Comments (2)

Danau5tin · 11h ago

Hitting a million brick walls with multi-turn RL training isn't fun, so I thought I would try something new to climb Stanford's leaderboard for now! So this weekend I was just tinkering with multi-agent systems and... somehow ended up beating Claude Code on Stanford's TerminalBench leaderboard (#12)! Genuinely didn't expect this - started as a fun experiment and ended up with something that works surprisingly well.

*What I did:*

- Built a multi-agent AI system with three specialised agents:

- Orchestrator: The brain - never touches code, just delegates and coordinates

- Explorer agents: Read & run only investigators that gather intel

- Coder agents: The ones who actually implement stuff

- Created a "Context Store" which can be thought of as persistent memory that lets agents share their discoveries.

- Tested on TerminalBench with both Claude Sonnet-4 and Qwen3-Coder-480B.

*Key results:*

- Orchestrator + Sonnet-4: 36.0% success rate (#12 on leaderboard, ahead of Claude Code!)

- Orchestrator + Qwen-3-Coder: 19.25% success rate

- Sonnet-4 consumed 93.2M tokens vs Qwen's 14.7M tokens to compete all tasks!

- The orchestrator's explicit task delegation + intelligent context sharing between subagents seems to be the secret sauce

*(Kind of) Technical details:*

- The orchestrator can't read/write code directly - this forces proper delegation patterns and strategic planning

- Each agent gets precise instructions about what "knowledge artifacts" to return, these artifacts are then stored, and can be provided to future subagents upon launch.

- Adaptive trust calibration: simple tasks = high autonomy, complex tasks = iterative decomposition

- Each agent has its own set of tools it can use.

*More details:*

My Github repo has all the code, system messages, and way more technical details if you're interested! (Github handle is danau5tin).

Thanks for reading!

Dan

NitpickLawyer · 10h ago

Curious how cheaper models would fare. Have you thought about testing gpt5-mini and similar? And even the very small qwen-coder-30b.

Ask HN: Is your company still hiring junior engineers?

Ask HN: Who wants to be hired? (September 2025)

Ask HN: Who is hiring? (September 2025)

Ask HN: Can/Will server, network and security jobs ever be outsourced?

Ask HN: I got fired from 100k job so I've made a game and it failed

Ask HN: Best foundation model for CLM fine-tuning?

Ask HN: The government of my country blocked VPN access. What should I use?

Ask HN: How to protect own privacy under ChatControl?

How do you handle JDK/JRE patch updates for Java apps on K8s?

Understanding Android's Boot Process

Ask HN: Why hasn't x86 caught up with Apple M series?

Ask HN: Are there enough utilities in bash now?

Ask HN: What to learn for math for modeling?

Worse Performance at a Higher Cost

Ask HN: How do you fight YouTube addiction and procrastination? I'm struggling

Ask HN: Tools for Crossword Puzzle Generation?

Ask HN: Why does Seattle feel so risk-averse compared to the Bay Area?

LinkedIn seems to be leaking Google Docs

Tell HN: My advice after I applied to 450 positions before getting hired

My experience with Apache Pulsar to solve PostgreSQL multi-tenant pain

Ask HN: Do custom ROMs exist for electric cars, for example Teslas?

Tell HN: Use "-f**k" to kill Google AI Overview

Ask HN: What is your biggest regret about a decision you made?

Change Tracker: Monitor+revert file edits from Claude/AI agents(in-memory VCS)

Ask HN: How can I recover and run my old mobile game from the 2010s?

Ask HN: Which Open Source License to Choose for a Python Language Server

Ask HN: Looking for Headless CMS Recommendation

Hacker News Alternativies

Ask HN: Did Developers Undermine Their Own Profession?

Ask HN: Anyone using their own custom text editor?

Ask HN: Best self-hosted wiki solution in 2025? Mediawiki or something else?

My weekend project accidentally beat Claude Code – #12 on Stanford's TBench

Comments (2)