Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

122 blndrt 29 9/17/2025, 1:03:24 PM quesma.com ↗

Comments (29)

jari_mustonen · 3h ago
Here is the summary of key improvements made:

1. Structure & Flow

    - Decision Trees: Clear branching logic with ├── and └── notation

    - Sequential Steps: Numbered, ordered procedures instead of scattered explanations

    - Prerequisites: Explicit dependency checks before proceeding
2. AI Agent Optimizations

    - Tool Call Clarity: Exact function names and parameters

    - Binary Decisions: Clear yes/no conditions instead of ambiguous language

    - Error Handling: Specific failure conditions and next steps

    - Verification Steps: "Recheck" instructions after each fix
3. Cognitive Load Reduction

    - Reference Tables: Quick lookup for tools and purposes

    - Pattern Recognition: Common issue combinations and their solutions

    - Critical Reminders: Common AI mistakes section to prevent errors
4. Actionable Language

    - Removed verbose explanations mixed with instructions

    - Consolidated multiple documents' logic into single workflows 

    - Used imperative commands: "Check X", "If Y then Z"

    - Added immediate verification steps
brendoelfrendo · 2h ago
Wait, are we about to reinvent programming from first principles?
inerte · 5m ago
Maybe one day we will all be using https://shakespearelang.com/
ranie93 · 1h ago
Seemingly its always been on a scale between directly editing 1s and 0s and drafting legislature. Compile times may vary
whateveracct · 51m ago
I'd say it's more "programming with extra steps"
ivape · 50m ago
In other words, just like programming, we’re writing better instructions. In this case, we’re asking it to think out loud more clearly. It’s almost like whiteboard interview prep.

It’s quite amazing because it means programming is fully entering the natural language phase of the timeline.

If you aren’t a solid clear writer, you may not make it in the brave new world.

johnrob · 5m ago
Isn’t programming the clearest form of writing? Perhaps it’s the non programmers that need to “catch up”.
mhuffman · 15m ago
>If you aren’t a solid clear writer, you may not make it in the brave new world.

Have you not heard of all the AI startups that can turn a 3-word thought into very clearly written prose to be lovingly poured into the waiting mouth of your AI agent?

idiotsecant · 22m ago
The computers of the future will be operated by shamans making incantations more than technicians writing code.
dlojudice · 3h ago
I wish they had published what prompt was given to Claude to improve GPT-5-mini's performance, as well as a before and after comparison of a prompt that underwent this transformation.
blndrt · 2h ago
Thanks for the feedback, appreciate it! It makes lot of sense - I'll update the article with links to the actual prompts. Initially I thought these would be too lengthy for the article and no one would care, but as it seems people are really interested in it. Of course I'd be happy to share the details.
seunosewa · 1h ago
I checked and also couldn't find the prompt.
blndrt · 18m ago
I published an update - you should be able to find that information at the end of the post.

Should be available now, although it might take a while for CDN to propagate.

alejoar · 13m ago
Thanks for sharing!
amelius · 2h ago
My take: we have no clue how this works and the performance can be down tomorrow just as well.
lubesGordi · 4m ago
My hypothesis: the length of the prompt shrunk, yet maintained the same amount of information.
caminanteblanco · 1h ago
The only problem is I feel like having to have Claude rewrite the prompt negates some of the efficiency and latency benefits of using mini. For system prompts obviously this doesn't matter, but for actual continuous user interaction, it feels unworkable.

It definitely makes sense that improving formatting and clarity for these smaller models would really help with performance, but I'm wondering if gpt5-mini is already smart enough to handle that reformatting, and can rewrite the prompt itself, before handing it off to another instance of itself.

Overall an awesome article!

csoham · 3h ago
Really intresting. What did the original prompt look like? Perhaps the original prompt was not that good? I feel like the changes claude suggested (except a couple maybe) are already pretty well known prompt engineering practices.
blndrt · 3h ago
Thank you for the feedback!

In this (telecom) benchmark you can review agent policies and manuals here: 1) https://github.com/sierra-research/tau2-bench/blob/main/data... 2) https://github.com/sierra-research/tau2-bench/blob/main/data...

Of course these are just parts of the prompt, you can inspect benchamark code to see how these are rendered to actual LLM calls.

In case someone is not familiar with framework methodology I've wrote a separate article covering that (with some of my thoughts) -> https://quesma.com/blog/tau2-from-llm-benchmark-to-blueprint...

tibbar · 2h ago
> Removed verbose explanations mixed with instructions

Is Claude rewriting generic instructions once, or is it rewriting the core task statement each time? If so, I'm not sure how you prevent information leakage: Claude might easily be "solving" some of the tasks and inserting subtle hints on the approach. I think this result is very interesting if it holds after rewriting only the generic instructions, even if the performance boost is lower.

BrunoDCDO · 3h ago
I wonder if it would be possible to improve even further on the benchmark by simply showing Claude the current hardest problems and asking it to improve the prompt without including any specifics related to the problems
moralestapia · 3h ago
No before/after prompt.

Into the trash it goes.

CuriouslyC · 2h ago
This sort of stuff is trodden ground, if this seems exciting to you check out DSPy.
mccoyb · 1h ago
Many of the "look at what I did programming LLMs" blog posts on Hacker News have been developed and put out in academic papers and groups. The posts which gain traction here seem to be perennially behind the state of the art.
bigwheels · 2h ago
https://dspy.ai/tutorials/tool_use/

Definitely interesting, thank you!

barrkel · 3h ago
Using an LLM to (re)write your prompt or system prompt (for local models) is free alpha.
grej · 2h ago
DSPy was ahead of its time and still underutilized.
behnamoh · 18m ago
Can you point me to any resources on DSPy that don't make it look like magic though? It used to be all the hype for a while and then everyone moved on from it.
doctorpangloss · 50m ago
you would also be interested in dSPY...