Elysia – open-source agentic AI platform (github.com)

1 points by bobvanluijt 1m ago 1 comments

Show HN: I accidentally built a startup idea validation tool (validationly.com)

2 points by kptbarbarossa 2m ago 0 comments

We got the internet all wrong (thedispatch.com)

1 points by pfrrp 2m ago 0 comments

We caught companies making it harder to delete your personal data online (themarkup.org)

3 points by colinprince 10m ago 0 comments

Novel Protein to Treat Carbon Monoxide Poisoning in Minutes (health.pitt.edu)

1 points by geox 11m ago 0 comments

Iceberg Table Corruption and Data Loss in Production (ryft.io)

1 points by yogevyuval 12m ago 0 comments

Teaching New Ways to Build Software with Aravind Putrevu [audio] (coffeeandopensource.com)

1 points by mooreds 13m ago 0 comments

Performative virtue-signaling has become a threat to higher ed (thehill.com)

3 points by ryzvonusef 13m ago 1 comments

Perplexity makes a bid for Google Chrome (9to5google.com)

1 points by colinprince 13m ago 0 comments

AI content is tainting preprints: how moderators are fighting back (nature.com)

1 points by rntn 15m ago 0 comments

The Missing Protocol: Let Me Know (deanebarker.net)

2 points by deanebarker 16m ago 0 comments

Sam Altman challenges Elon Musk with plans for Neuralink rival (ft.com)

2 points by nsagent 16m ago 2 comments

Arizona Iced Tea may be forced to raise price of 99c cans (dexerto.com)

1 points by Michelangelo11 17m ago 0 comments

No printers or PCs, Starbucks Korea tells customers (bbc.com)

1 points by tartoran 18m ago 0 comments

Serverless WireGuard (proxylity.com)

1 points by mlhpdx 20m ago 1 comments

See-thru Game Boy is a work of art (theverge.com)

1 points by colinprince 24m ago 0 comments

OpenAI's big GPT-5 launch gets bumpy (axios.com)

1 points by joe_the_user 26m ago 0 comments

Take part in decision-making survey (barchukova.me)

1 points by a_barchukova 28m ago 0 comments

The Great Unbalding (nymag.com)

3 points by Teever 32m ago 1 comments

The Engineering Marvel That China Hopes Will Help Wean It Off Foreign Energy (wsj.com)

3 points by bookofjoe 33m ago 2 comments

Altcoin season: Watch out for Base Ecosystem

1 points by Victornomics 34m ago 0 comments

Show HN: AI app for learning Sanskrit, English, Hindi, Kannada and more (indilingo.in)

1 points by Jaygala223 34m ago 0 comments

Show HN: AI Trust Proof – Help AI Trust You with Blockchain-Verifiable Signals (aitrustproof.com)

1 points by dude3 37m ago 0 comments

Is Nuclear Energy Our Best Shot at Saving the Planet? (skeptic.com)

2 points by mpweiher 37m ago 5 comments

Print, a one-line BASIC program (10print.org)

2 points by NKosmatos 38m ago 0 comments

Automated Browser Testing with Claude Code Agents and Browserbase (ritza.co)

1 points by ritzaco 41m ago 0 comments

Welcome to ParlaSpeech (clarinsi.github.io)

1 points by taubek 42m ago 0 comments

Cloudflare Repackages Pricing (blog.cloudflare.com)

4 points by the1024 42m ago 1 comments

Nanodevice brings personalized genomics closer to reality (phys.org)

2 points by PaulHoule 42m ago 0 comments

.NET 10 Preview 7 is now available (devblogs.microsoft.com)

2 points by Metalnem 44m ago 0 comments

Perito Moreno became the first superstar glacier, now set to disappear (theconversation.com)

3 points by gmays 44m ago 0 comments

Slopfest: how I tried to fix my Twitter feed (stuff.hdemirev.com)

1 points by hdemirev 44m ago 0 comments

From object to iframe – general embedding technologies (developer.mozilla.org)

1 points by maqnius 45m ago 0 comments

Researchers uncover secretive Russian spy unit by studying commemorative badges (intelnews.org)

3 points by NN88 46m ago 0 comments

Show HN: Mistral-7B training using pyspark,DeepSpeed (github.com)

1 points by gituser123 46m ago 0 comments

Think Stats, 3rd edition (allendowney.github.io)

1 points by saikatsg 51m ago 0 comments

We Create (linus.coffee)

1 points by zeynepevecen 52m ago 0 comments

You've got drought: UK gov suggests you save water by – – – deleting old emails (theregister.com)

1 points by rntn 54m ago 1 comments

Gustav: A sprint orchestrator for Claude Code (github.com)

1 points by handfuloflight 56m ago 0 comments

Laws of Tech: Commoditize Your Complement (gwern.net)

1 points by handfuloflight 56m ago 0 comments

Russia Is Suspected to Be Behind Breach of Federal Court Filing System (nytimes.com)

6 points by jaredwiener 56m ago 1 comments

Show HN: Shhhh Pricing – Price Hider for Non-Logged-In Users (apps.shopify.com)

2 points by kptbarbarossa 58m ago 1 comments

OpenBench: Provider-agnostic, open-source evaluation infrastructure for LLMs (github.com)

1 points by gmays 58m ago 0 comments

UK government suggests deleting files to save water (theverge.com)

1 points by jmsflknr 59m ago 0 comments

The 'Blue Guide' on the implementation of EU product rules 2022 (eur-lex.europa.eu)

1 points by Flundstrom2 59m ago 1 comments

Idea-Driven Ideas (vitalik.eth.limo)

1 points by kirushik 1h ago 0 comments

Wireless OLED Contact Lens for Retinal Diagnostics (news.kaist.ac.kr)

1 points by giuliomagnifico 1h ago 0 comments

Four radioactive wasp nests found on South Carolina nuclear facility (arstechnica.com)

1 points by wiry 1h ago 0 comments

Show HN: Variable Font Debugger (saewitz.com)

1 points by switz 1h ago 0 comments

Exile Economics: If Globalisation Fails (lrb.co.uk)

18 points by mitchbob 1h ago 1 comments

Testing AI coding agents on real codebases

5 aldersondev 3 8/12/2025, 5:44:56 PM render.com ↗

Comments (3)

aldersondev · 2h ago

Hey, I'm Mitch (post author).

I'm a 15+ year full-stack engineer who contracts with Render and guest-posted my research for their blog.

Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.

I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.

I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.

Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).

Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.

peggyrayzis · 2h ago

Which one do you continue to reach for the most? I was surprised to see Gemini so low on your list even w/ the long context window.

prime312 · 1h ago

Any reason why open-source models (like Llama) weren't considered here?