I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog.
Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.
I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.
I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.
Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).
Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.
peggyrayzis · 19h ago
Which one do you continue to reach for the most? I was surprised to see Gemini so low on your list even w/ the long context window.
aldersondev · 16h ago
Hi there! Even though it didn't score the highest (Cursor did!), I loved Claude's code, it's the one that I'm still using after completing the testing! Anthropic got the UX right in my opinion!
Gemini surprised me, too! It was a mixed bag, as it performed well in the production tests but failed significantly on the control test. I bet as the model improves, it will quickly catch up, because that context window is a good feature!
prime312 · 18h ago
Any reason why open-source models (like Llama) weren't considered here?
aldersondev · 16h ago
Hi! Great question!
For this round, I decided to compare the (arguably) best models and tools with the most ongoing support and development.
That being said, I love local models! I'm a big fan of Google Gemma running on Ollama, then patched into Aider CLI; I've had excellent luck with that setup.
I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog.
Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.
I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.
I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.
Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).
Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.
Gemini surprised me, too! It was a mixed bag, as it performed well in the production tests but failed significantly on the control test. I bet as the model improves, it will quickly catch up, because that context window is a good feature!
For this round, I decided to compare the (arguably) best models and tools with the most ongoing support and development.
That being said, I love local models! I'm a big fan of Google Gemma running on Ollama, then patched into Aider CLI; I've had excellent luck with that setup.
Maybe round 2 needs to be OSS models!