Show HN: O3 beats Sonnet 4 at coding (in our codebase, wrt our preferences)

2 kmckiern 0 7/25/2025, 6:05:48 PM

At this point, the leading foundation models from each of the top labs tend to do well on the classic set of coding evals. Each new release brings smaller improvements over the last, with leading models separated by increasingly thin margins. This may be a sign we're heading toward commoditization.

However, in our experience, these models can differ meaningfully when the task is narrow and the context is specific. And under these circumstances is where most people use these models. E.g. in their backend codebase, according to their specific coding preferences.

We think this is partly why folks argue about which model is best for coding. I was personally confused to see the praise for Sonnet 4 as a pair programmer. This didn't match my experience - it takes way too many liberties for my taste, and it refuses to conform to / use the translation interface between my app's API and database.

So, our thesis is that the best model for coding (or for any narrow task) is personal. Hence, our goal is to make it as easy as possible for our users to find the best model for them.

To explore this, we put together a dataset of real engineering sprint tickets (tasks like "optimize our database connections logic") and wrote down some of our top model behavioral preferences to use as eval metrics.

We focused on three of our core pain points: Pattern Adherence (does it follow our existing architecture?), Scope Discipline (does it stay focused or make unsolicited refactors?), and Comment Quality (does it write useful documentation vs verbose restating). Each response was scored from -1.0 to 1.0 across these dimensions.

Then we evaluated 14 of the current top LLMs using these metrics, with statistical significance testing across the prompt set.

Results: o3 medium ranked first (0.53 ± 0.05), with o4-mini second (0.48 ± 0.05). Sonnet 4 placed 7th (0.41 ± 0.07).

This helped us design a tiered "model-stack" that we now follow:

- o3 for complex, low-frequency tasks where quality matters most - o3-mini for difficult work at higher scale (similar quality, but much faster/cheaper) - Gemini 2.5 Flash for documentation (strong on comment quality, and very fast/cheap)

Some other findings:

- Scope discipline correlates weakly with other coding skills (r ≈ 0.05-0.09) - Every model struggles with comment quality - this metric has the lowest absolute scores - "Thinking" / "reasoning" doesn't always improve performance (Sonnet 4 regressed on two metrics)

Our results: https://mandoline.ai/leaderboards/coding

But recall, these results reflect our specific codebase patterns and team priorities. Your optimal choice depends on your architecture, task distribution, and what behaviors matter most to your workflow.

If you're choosing coding LLMs, our product helps you eval lots of models on your real tasks, against your own criteria.

Try it out: https://mandoline.ai/best-llm-for Docs: https://mandoline.ai/docs/best-llm-for

Show HN: RoboComic – Stand-up battles powered by AI (TTS, custom personas) (robo-comic.vercel.app)

Trade Is Not What Matters Most (usni.org)

Chinese kindergartens in crisis as enrolments plunge 25% in 4 years (ft.com)

Fuck dopamine, we're voluntarily breaking our own brains (anushkakarmakar.substack.com)

'My dad started spying on my mum' – the drugs causing sexual urges (bbc.co.uk)

Make the Web Great Again (koshka.love)

We built an auto aiming trash can [video] (youtube.com)

European Cloud Hosting with Exoscale (exoscale.com)

In Defence of Gary Marcus (reubenadams.substack.com)

Most of your projects are stupid. Please make some actual games. [video] (youtube.com)

Bibi Binary Notation (en.wikipedia.org)

Decoded Structure of Collatz Conjecture (zenodo.org)

KaTeX – Supported Functions (katex.org)

Lichess Accuracy Metric (lichess.org)

Echelon kills smart home gym equipment offline capabilities with update (arstechnica.com)

Personalized AI is rerunning the worst part of social media's playbook (helentoner.substack.com)

Hot Money: porn, power and profit (2022) (ft.com)

China proposes new global AI cooperation organisation (reuters.com)

UK's New Age Verification Requirement Thwarted in the Simplest Way Imaginable (gizmodo.com)

A 100-Year Old Math Fight Led to Google and ChatGPT [video] (youtube.com)

Rust on Every GPU (rust-gpu.github.io)

An AI-Generated Protein Helps T Cells Kill Cancer (the-scientist.com)

Modus Transparency (mataroa.blog)

Using GitHub Spark to Reverse Engineer GitHub Spark (simonwillison.net)

Mystery Airship (en.wikipedia.org)

From Async/Await to Virtual Threads (lucumr.pocoo.org)

Donate to the Treasury to help pay down the $36.7T public debt (pay.gov)

China proposes global body to govern artificial intelligence (ft.com)

Show HN: Scaling up robotic data collection with AI enhanced teleoperation

Age verification tools on adult websites bypassed in seconds (news.sky.com)

Show HN: Enfiy Code – Universal AI coding assistant with multi-provider support (github.com)

Enhancing Firefox's Find-in-Page for Keyboard Navigation (h.43z.one)

Is the Bread in Europe Better for You? (nytimes.com)

Show HN: Data Ownership T&C Quiz to Create Awareness on Restrictive Data Rights

AI Jailbreak Prompt (github.com)

2D to 3D model and 3D print it (amodeling.com)

Widely panned arsenic life paper gets retracted–15 years after brouhaha (arstechnica.com)

What happens when your dog uses the internet [video] (ted.com)

Thank you for your interest in Astronomer [video] (youtube.com)

Show HN: (Ask HN) Color Me Same – A New Kind of Logic Game – Pursue It Further? (color-me-same.franzai.com)

Artificial Intelligence - Electronics Now! Series (1986) [video] (youtube.com)

Git Gud: Setting Up a Better Git Config (micahkepe.com)

Making Batteries Removable and Replaceable (repair.eu)

The General Theory of Enshittification (paulkrugman.substack.com)

Bluesky's ATProto and the race to the bottom of the brain stem (erasmus.github.io)

AI Image Maker (aiimagemaker.net)

Support Ad Scaling Multi-Platform Sync, Break Single Channel Traffic Limits

Ciceroniaus: A Dialogue on the Best Style of Speaking (1908 (1528)) [pdf] (ia801302.us.archive.org)

ZeroFS – The Filesystem That Makes S3 Your Primary Storage (github.com)

Fun with Git: updating the cached contents in the index by staging (gitster.livejournal.com)

Show HN: O3 beats Sonnet 4 at coding (in our codebase, wrt our preferences)

Comments (0)