Qodo CLI agent scores 71.2% on SWE-bench Verified

100 bobismyuncle 35 8/12/2025, 11:05:59 AM qodo.ai ↗

Comments (35)

gronky_ · 3h ago

I’ve been running a bunch of coding agents on benchmarks recently as part of consulting, and this is actually much more impressive than it seems at first glance.

71.2% puts it at 5th, which is 4 points below the leader (four points is a lot) and just over 1% lower than Anthropic’s own submission for Claude Sonnet 4 - the same model these guys are running.

But the top rated submissions aren’t running production products. They generally have extensive scaffolding or harnesses that were built *specifically for SWE bench*, which kind of defeats the whole purpose of the benchmark.

Take for example Refact which is at #2 with 74.4%, they built a 2k lines of code framework around their agent specifically for SWE bench (https://github.com/smallcloudai/refact-bench/). It’s pretty elaborate, orchestrating multiple agents, with a debug agent that kicks in if the main agent fails. The debug agent analyzes the failure and gives insights to the main agent which tries again, so it’s effectively multiple attempts per problem.

If the results can be reproduced “out-of-the-box” with their coding agent like they claim, it puts it up there as one of the top 2-3 CLI agents available right now.

thinkingtoilet · 2h ago

This is classic Goodhart's law. "When a measure becomes a target, it ceases to be a good measure"

https://en.wikipedia.org/wiki/Goodhart%27s_law

ambicapter · 1h ago

It's really not that hard to not build a custom bench setup to game the benchmark instead of just using your product straight out of the box, though.

jasonjmcghee · 8m ago

Right. Building a custom setup is blatant- that will wildly overfit.

But let's say a group uses it as a metric as part of CI and each new idea / feature they create runs against SWE bench. Maybe they have parameterized bits and pieces they adjust, maybe they have multiple candidates datasets for fine tuning, maybe they're choosing between checkpoints.

This will also end up overfitting - especially if done habitually. It might be a great metric and result in a more powerful overall model. Or it might not.

clutchdude · 1h ago

Also see the VW dieselgate and numerous other "gaming the system" examples.

energy123 · 3h ago

What are the typical context lengths in SWE-bench problems? Does it partly measure performance in the 64-128k context range?

whymauri · 32m ago

This is what the rows look like:

https://huggingface.co/datasets/princeton-nlp/SWE-bench_Veri...

Its up to your retrieval system/model to selectively hunt for relevant context. Here's a few critiques of the benchy:

https://x.com/brhydon/status/1953648884309536958

dimitri-vs · 2h ago

IIRC the SWE bench dataset gives you the full repo snapshot + the issue text, the evaluation pipelines typically run some kind of retriever (eg. grep, BM25) to pick a subset of files to place in the model’s context. They provided context is usually limited up to ~50k tokens.

terminalshort · 1h ago

Is there something in this multi-agent approach that makes the setup more specific to just the test at hand and less general to real engineering tasks? If not, then this multi-agent system will just become what you get out of the box in a future product. Multiple attempts per problem (as long as there's no human intervention or selection between them) is a perfectly fine approach for agents because that's not an issue from the perspective of an engineer using the product. A single agent is already a multi-step usage of LLMs and it sounds like this is just another meta level of that.

eddd-ddde · 2h ago

I think multiple attempts are completely understandable and even expected? How is that defeating the purpose of the benchmark?

gronky_ · 2h ago

It’s a pass@1 benchmark. When submitting you need to check a box that there was only 1 attempt per problem. See here for example: https://github.com/SWE-bench/experiments/pull/219

Building multiple attempts into your agent is stretching the rules, even if technically it’s acceptable

terminalshort · 1h ago

From my perspective as a potential user the number of attempts is the number of times I have to tell it what to do. If you have an agent that makes a single attempt and is 60% accurate vs another that makes 5 attempts and is 80% accurate, why would you care that each individual attempt of the 2nd model is less accurate than the first?

radarsat1 · 18m ago

I was thinking about this recently with respect to how many agent systems now let you specify a smaller/faster model for easier tasks and a bigger model for harder tasks.

It's interesting to think about what the trade-offs are. Assuming the system can properly classify a task as easy or hard (big "if" but I guess there are ways), there is nonetheless more to think about, depending on your pricing plan.

For subscription pricing, I guess you don't really care which model runs and in fact it's hard to find a reason to ever run the smaller model, so choosing between the models is more in the provider's interests for cost efficiency.

But for pay-per-use pricing, But if you have a bigger model that can get the answer right 80% of the time, and a smaller model that can handle smaller changes and get things right 60% of the time but correct its mistakes, then the system should try to run it on as many tasks as possible to save you money.. but in the end if ends up having to make a lot of corrections, then maybe you end up needing more total requests than the larger model. In that case maybe it's actually cheaper to run the larger model, if it takes fewer requests.

So I wonder how that kind of trade-off could be effectively calculated. I guess if you can figure out when "retries" happen you can count them and do some statistics on which model is more likely to work out in fewer shots. It's pretty complicated though, when you start to think about it in detail.

I do wonder if even having BOTH the smaller and bigger model make hypotheses, and try the smaller model's idea first, then if it fails, try the bigger model's idea, might be the way to go.

mcintyre1994 · 1h ago

I think it depends on "But the top rated submissions aren’t running production products" It sounds like they're shipping a product without the debug agent/try-again logic, and that's just for the benchmark, so you wouldn't get the performance they get as a user.

whymauri · 30m ago

Papers have been doing rollouts that involve a model proposing N solutions and then self-reviewing to choose the best one (prior to the verifier). So far, I think that's been counted as one pass.

gronky_ · 1h ago

This ok from your perspective then?

def make_pass@1_agent(agent, n):

    def retry_agent(problem):

        for attempt in range(n):

            result = agent(problem)

            if result.success:

                return result

        return result

    return retry_agent

gronky_ · 1h ago

Keep in mind that this isn’t about users - the top agents on the leaderboard aren’t running an actual product on the benchmark.

If they are running their production product as is, then of course whatever is built into the product is fine.

DougBTX · 59m ago

Absolutely fine, as long as the success flag is predicted by the model ensemble under test. That’s how Claude Code works for example, it will continue to iterate until success (or it will give up with failure at a certain point).

Roritharr · 2h ago

Finally someone mentions Refact, I was in contact with the team, rooting for them really.

bluelightning2k · 17m ago

Just looked them up. Their pricing is around buying "coins" with no transparency as to what that gets. Hard pass

oblio · 2h ago

https://github.com/auchenberg/volkswagen

szundi · 3h ago

According to your experience with this model, is it just trained for the benchmark or these points are actually representing the performance?

No comments yet

ai-christianson · 1h ago

One thing with SWE bench is making sure there's zero leakage of information into the LLM context.

I.e. the agent cannot even know which tests are failing.

It has to both fix the issue based just on the issue text and fix it in the specific way the unit test, which it cannot see, expects.

For this reason I find the benchmark a little disconnected from the reality of software engineering.

zuzuen_1 · 6m ago

I would be more interested in Qodo's performance on the swe-bench-multilingual benchmark. Swe-bench-verified only includes bugs related to python breakages.

The best submission is swe-bench-multilingual is Claude 3.7 Sonnet which solves ~43% of the issues in the dataset.

zuzuen_1 · 11m ago

Does anyone have a benchmark on the effectiveness of using embeddings for mapping bug reports to code files as opposed to extensive grepping as Qodo, Cursor and a number of tools I use do to localize faults?

esafak · 13m ago

If Qodo is reading: please compare your efficiency too. Run some tasks on various agents using the same models, and report the cost.

itamarcode · 42m ago

Unlike most SWE bench submissions, Qodo Command one uses the product directly.

I think that the next step is getting an official "checked" mark by the SWE bench team

whymauri · 26m ago

I feel like the bash only SWE Bench Verified (a.k.a model + mini-swe-agent) is the closest thing to measuring the inherent ability of the model vs. the scaffolding.

https://github.com/SWE-agent/mini-swe-agent

khalic · 1h ago

We need some international body to start running these tests… I just can’t trust these numbers any longer. We need a platform for this, something at least we can get some peer reviews

redman25 · 37m ago

That sounds like an interesting idea to me. It would at least resolve the problem of companies gaming the metric.

Another approach might be the LiveBench approach where new tests are released on a regular basis.

OldGreenYodaGPT · 20m ago

Was using their bot for code review for last 2 years but just dropped it for BugBot

orangebread · 2h ago

I've been using Warp for the past few weeks and it's been incredibly impressive over other agentic coding services/platforms. Curious how Qodo stacks up.

lightbendover · 1h ago

When I tried warp I was convinced that was where the industry was going (agents as terminal replacement), but it felt a bit too heavy to me so I haven’t been using it lately. Still think all things will converge on terminal and browser replacement.

mupuff1234 · 1h ago

I'm curious how do these LLM wrapper companies think they'll survive long term - especially coding related wrappers.

I could understand focusing on a niche business use case, but coding is a main focus of the foundation models themselves.

rs186 · 1h ago

So this is from the same company that wrote a blog post with sentences that don't even make sense:

https://news.ycombinator.com/item?id=44833929, my comment https://news.ycombinator.com/item?id=44835939

LANL Upgrades Proton Radiography System After 25 Years and 1000 Explosions (lanl.gov)

"We are currently clean on OPSEC": The Signalgate Saga (micahflee.com)

Smartwatches aren't confused about stress–but headlines and studies are (wareable.substack.com)

How we built our own Claude Code (for data) (tinybird.co)

Mixed Numbers on Inflation: Core Up/Overall Flat (bloomberg.com)

Explicit Refinement Types (dl.acm.org)

Nexus: An Open-Source AI Router for Governance, Control and Observability (nexusrouter.com)

The $10k Job Search: Career Coaching, LinkedIn Fees, Résumé Help (wsj.com)

UI-Tars-Desktop: Multimodal AI Agent Stack from ByteDance (github.com)

GitHub is (again) having issues (githubstatus.com)

Should We Never Use Non-Logical Properties? (meiert.com)

Ask HN: What "impossible" things have you done?

Seeing the unseen: Trinity team builds game-changing particle impact machine (tcd.ie)

Ruby: Unlocking Ractors: generic instance variables (byroot.github.io)

Peter the Aleut (en.wikipedia.org)

Entry-Level Jobs Are Disappearing Fast Because of AI (finalroundai.com)

Google and IBM believe first workable quantum computer is in sight (ft.com)

Beyond JSX (thenewstack.io)

The expanding world of genetic testing for your embryos (jordanagraifman.substack.com)

Reddit blocks Internet Archive to end "sneaky" AI scraping (arstechnica.com)

Scaling the Memory Wall: The Rise and Roadmap of HBM – SemiAnalysis (semianalysis.com)

Birthday Headlines (bdayrecap.com)

The Old Man in the Cave – Twilight Zone (1963) (en.wikipedia.org)

Rubygems.org March 2025 Ecosystem Report (rubyelders.com)

AMD Rides the HPC Tiger in the Datacenter (nextplatform.com)

Data Brokers Are Hiding Their Opt-Out Pages from Google Search (wired.com)

FedRAMP: New cloud-friendly network guidance, Subnets white paper rescinded (github.com)

A Visual Diagnostic Toolkit for PPO's Entropy Bonus (theprincipledagent.com)

We Inspire AI – Real Answers vs. the AI Flood (swag.industries)

Subframe – The best way to build UI (subframe.com)

KrebsOnSecurity in New 'Most Wanted' HBO Max Series (krebsonsecurity.com)

Show HN: Griddle – a daily logical deduction puzzle (dailygriddle.com)

Show HN: Put a Contract on Your Goals (contracted.pw)

Revel is shutting down its Model Y-powered ride-hailing Tesla Robotaxi (electrek.co)

FloatPrompt – The Invisible OS for AI (floatprompt.com)

Air Force to Use Wyoming-Made Portable Nuclear Reactors to Power Bases (cowboystatedaily.com)

That viral video of a 'deactivated' Tesla Cybertruck is a fake (theverge.com)

GitHub Pull Requests Outage (github.com)

Mozilla under fire for Firefox AI "bloat" that blows up CPU and drains battery (neowin.net)

$1700 bounty delivers 2.4× speedup for CUDA Gaussian Splatting training (github.com)

The Futility of Simulating Nature (newyorker.com)

Half of US adults now use AI (news.northeastern.edu)

Jetbrains announces price increase. (jetbrains.com)

Show HN: Surkl – node-based file browser for the desktop (github.com)

Mistakes Made and Lessons Learned Building HubSpot (businessofsoftware.org)

Journaling using Nix, Vim and coreutils (tangled.sh)

ARM adds neural accelerators to GPUs (newsroom.arm.com)

Show HN: I built a vibe agent builder with deployment (GPT-5-mini powered) (github.com)

Debian GNU/Hurd 2025 Released with Completed 64-Bit Support, Rust Ported (phoronix.com)

DiscoSpy – Never Miss a Rare Vinyl Again (discospy.app)

Qodo CLI agent scores 71.2% on SWE-bench Verified

Comments (35)