New #1 open-source AI Agent on SWE-bench Verified

28 laxyz 15 5/22/2025, 10:22:43 AM refact.ai ↗

Comments (15)

MukundMohanK · 7h ago
Between last April and now, swe-bench scores have gone up from 25%-70%.

Sure, they're being overfitted to the dataset. But with most performing similarly across even the hardest of 3rd party benchmarks, think frontier math back in Nov and now, we're closer than ever to a specialisation shift.

Hard to say at what % but once code reviews get better its likely 2025 is the last year SWE is a sought after job * demand and supply both

candiddevmike · 6h ago
SWE bench scores, like a lot of other metrics for LLMs, are pretty divorced from reality IMO. It's a lot like only learning to pass tests vs actual understanding.

Once GenAI companies stop hiring SWEs, I'll believe the doomers.

harshitaneja · 6h ago
I help hire for a few clients as well as for my own small organization. We are already seeing impact of these tools on our hiring. For the same responsibilities and tasks we are already requiring lesser resources. For clients with less complex problems we are able to manage similar work with 60% of the resources planned. And that's when most of our work is mathematical modelling, heuristics, constraint programming and such. However, I don't foresee at least for the next few years we would ever get to a scenario where we don't hire developers. Given that most hiring has shifted to only senior developers.
dingnuts · 3h ago
being able to do more things with fewer resources (which lowers costs) always increases demand enough to make up for the reduction of labor caused by the automation

Analogy: when the chainsaw was invented, we didn't stop having lumberjacks, they just learned to use chainsaws

MukundMohanK · 6h ago
Reality is here whether we like it or not - https://fred.stlouisfed.org/graph/?g=1DEP0
hackeman300 · 5h ago
Surely there are no other macroeconomic factors that could have played a role in this decline too
predkambrij · 6h ago
I would like to know why this post got flagged. Is it misleading, or dangerous software? If it's truly #1 open-source on SWE-bench that's quite impressive.
grammarxcore · 6h ago
> Many samples have an issue description that is underspecified, leading to ambiguity on what the problem is and how it should be solved.

OpenAI apparently tuned _basic discovery and refinement_ out of the tests so I don’t think this is a benchmark of anything useful. It can’t replace a human but can possibly make a human more productive.

https://openai.com/index/introducing-swe-bench-verified/

nateburke · 7h ago
Am I correct in understanding that SWE-bench is limited to python?
simonw · 7h ago
The core benchmark is only Python, but there is also SWE-bench Multimodal which uses JavaScript: https://arxiv.org/abs/2410.03859

And the new SWE-bench Multilingual (released a couple of weeks ago) which covers 9 programming languages - C, C++, Go, Java, JavaScript, TypeScript, PHP, Ruby and Rust: https://www.swebench.com/multilingual.html

babushkaboi · 7h ago
yeah, they're all python at the moment.
laxyz · 8h ago
The full pipeline used for SWE-bench Verified is open-source: https://github.com/smallcloudai/refact-bench
amarcheschi · 7h ago
I think the title doesn't make it clear that the results are obtained with closed models
brrrrrm · 7h ago
Open-source use of closed source models?
NicuCalcea · 7h ago
Looks like they support self-hosted models: https://docs.refact.ai/supported-models/#self-hosted-version