GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Up until recently, I was very skeptical of AI coding tools. My AI usage was basically GH Copilot's autocomplete. After cleaning up too many agent mistakes in production, I set up a structured trial in my real projects to measure what these agents can do under real constraints.

I decided to run two sets of experiments: vibe coding a new application from scratch as a control test, then giving the agents real production tasks. For the production tasks, I gave them backend challenges like building a k8s pod leader election system in Go microservices, and building out CSS templates in Astro.js.

I evaluated Cursor, Claude Code, Gemini CLI, and OpenAI Codex across setup friction, # of follow up prompts, code quality, UX, and context handling.

Cursor won but it was close. I really liked the Claude Code UX and will try the new Cursor CLI. I plan to run a similar benchmark in the fall using newer features like parallel agents and newer models (maybe GPT-5 or whatever comes next).

Let me know what I should test for round 2 or nitpick the criteria I used. The best tool for you might not be the best tool for me, so I encourage you to run your own experiments. Also happy to answer questions about my methodology.

peggyrayzis · 19h ago

Which one do you continue to reach for the most? I was surprised to see Gemini so low on your list even w/ the long context window.

aldersondev · 16h ago

Hi there! Even though it didn't score the highest (Cursor did!), I loved Claude's code, it's the one that I'm still using after completing the testing! Anthropic got the UX right in my opinion!

Gemini surprised me, too! It was a mixed bag, as it performed well in the production tests but failed significantly on the control test. I bet as the model improves, it will quickly catch up, because that context window is a good feature!

prime312 · 18h ago

Any reason why open-source models (like Llama) weren't considered here?

aldersondev · 16h ago

Hi! Great question!

For this round, I decided to compare the (arguably) best models and tools with the most ongoing support and development.

That being said, I love local models! I'm a big fan of Google Gemma running on Ollama, then patched into Aider CLI; I've had excellent luck with that setup.

Maybe round 2 needs to be OSS models!

GPT-5 (openai.com)

Fight Chat Control (fightchatcontrol.eu)

GitHub is no longer independent at Microsoft after CEO resignation (theverge.com)

I tried every todo app and ended up with a .txt file (al3rez.com)

Claude Sonnet 4 now supports 1M tokens of context (anthropic.com)

Ultrathin business card runs a fluid simulation (github.com)

I want everything local – Building my offline AI workspace (instavm.io)

Wikipedia loses challenge against Online Safety Act (bbc.com)

Emailing a one-time code is worse than passwords (blog.danielh.cc)

Debian 13 “Trixie” (debian.org)

Vibechart (vibechart.net)

Claude Code is all you need (dwyer.co.za)

Show HN: The current sky at your approximate location, as a CSS gradient (sky.dlazaro.ca)

How I code with AI on a budget/free (wuu73.org)

Try and (ygdp.yale.edu)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

OpenFreeMap survived 100k requests per second (blog.hyperknot.com)

Jim Lovell, Apollo 13 commander, has died (nasa.gov)

Show HN: Building a web search engine from scratch with 3B neural embeddings (blog.wilsonl.in)

Ask HN: How can ChatGPT serve 700M users when I can't run one GPT-4 locally?

Historical Tech Tree (historicaltechtree.com)

Cursed Knowledge (immich.app)

The Chrome VRP Panel has decided to award $250k for this report (issues.chromium.org)

Meta Leaks Part 1: Israel and Meta (archive.org)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

Getting good results from Claude Code (dzombak.com)

Monero appears to be in the midst of a successful 51% attack (twitter.com)

The Framework Desktop is a beast (world.hey.com)

Why are there so many rationalist cults? (asteriskmag.com)

GPT-5 for Developers (openai.com)

StarDict sends X11 clipboard to remote servers (lwn.net)

Linear sent me down a local-first rabbit hole (bytemash.net)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

OpenSSH Post-Quantum Cryptography (openssh.com)

Trump Orders National Guard to Washington and Takeover of Capital’s Police (nytimes.com)

Vanishing from Hyundai’s data network (techno-fandom.org)

My Lethal Trifecta talk at the Bay Area AI Security Meetup (simonwillison.net)

The surprise deprecation of GPT-4o for ChatGPT consumers (simonwillison.net)

Search all text in New York City (alltext.nyc)

Windows XP Professional (win32.run)

MCP overlooks hard-won lessons from distributed systems (julsimon.medium.com)

Tor: How a military project became a lifeline for privacy (thereader.mitpress.mit.edu)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [pdf] (arxiv.org)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

Exit Tax: Leave Germany before your business gets big (eidel.io)

Building Bluesky comments for my blog (natalie.sh)

Cursor CLI (cursor.com)

Project Hyperion: Interstellar ship design competition (projecthyperion.org)

Testing AI coding agents on real codebases

Comments (5)