Testing AI coding agents on real codebases

5 aldersondev 3 8/12/2025, 5:44:56 PM render.com ↗

Comments (3)

aldersondev · 2h ago
Hey, I'm Mitch (post author).

I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog.

Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.

I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.

I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.

Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).

Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.

peggyrayzis · 2h ago
Which one do you continue to reach for the most? I was surprised to see Gemini so low on your list even w/ the long context window.
prime312 · 1h ago
Any reason why open-source models (like Llama) weren't considered here?