Testing AI coding agents on real codebases

8 aldersondev 5 8/12/2025, 5:44:56 PM render.com ↗

Comments (5)

aldersondev · 19h ago
Hey, I'm Mitch (post author).

I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog.

Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.

I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.

I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.

Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).

Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.

peggyrayzis · 19h ago
Which one do you continue to reach for the most? I was surprised to see Gemini so low on your list even w/ the long context window.
aldersondev · 16h ago
Hi there! Even though it didn't score the highest (Cursor did!), I loved Claude's code, it's the one that I'm still using after completing the testing! Anthropic got the UX right in my opinion!

Gemini surprised me, too! It was a mixed bag, as it performed well in the production tests but failed significantly on the control test. I bet as the model improves, it will quickly catch up, because that context window is a good feature!

prime312 · 18h ago
Any reason why open-source models (like Llama) weren't considered here?
aldersondev · 16h ago
Hi! Great question!

For this round, I decided to compare the (arguably) best models and tools with the most ongoing support and development.

That being said, I love local models! I'm a big fan of Google Gemma running on Ollama, then patched into Aider CLI; I've had excellent luck with that setup.

Maybe round 2 needs to be OSS models!