GPT-5 on SWE-bench: Cost and performance deep-dive

4 lieret 3 8/8/2025, 4:29:14 PM mini-swe-agent.com ↗

Comments (3)

lieret · 3h ago
We evaluated the new GPT models with a minimal agent on SWE-bench verified. GPT-5 scores 65%, mini 60%, nano 35%. Still behind Opus 5 (68%), on par with Sonnet 4 (65%). But a lot cheaper, especially mini!

Cost is tricky to compare with agents, because agents succeed fast, but fail slowly. If an agent doesn't succeed, it should just continue trying until it succeeds, or hits a run time limit. And that's (almost) what happens.

But even so, it's very clear that

1. GPT-5 is cheaper than Sonnet 4 2. GPT-5-mini is _incredibly_ cheap for what it provides (you only sacrifice some 5%pts, but end up paying maybe 1/5th of the total cost)

All of the code to reproduce our numbers is open-source. There's a box on the bottom with the exact command to run in order to reproduce our numbers.

Also very happy to answer questions here!

techpineapple · 3h ago
I'm curious if this might help Cursor's lighting money on fire problem?

https://pivot-to-ai.com/2025/07/09/cursor-tries-setting-less...

is this enough of a price difference to make cursor profitable?

lieret · 3h ago
I think gpt-5-mini should really help them. At least from these benchmark scores, there probably shouldn't be a huge performance degradation for letting gpt-5-mini drive most of the workflow. Of course users might still want to just run with latest and greatest (but still gpt-5 will be cheaper I think)