Show HN: O3 beats Sonnet 4 at coding (in our codebase, wrt our preferences)

2 kmckiern 0 7/25/2025, 6:05:48 PM
At this point, the leading foundation models from each of the top labs tend to do well on the classic set of coding evals. Each new release brings smaller improvements over the last, with leading models separated by increasingly thin margins. This may be a sign we're heading toward commoditization.

However, in our experience, these models can differ meaningfully when the task is narrow and the context is specific. And under these circumstances is where most people use these models. E.g. in their backend codebase, according to their specific coding preferences.

We think this is partly why folks argue about which model is best for coding. I was personally confused to see the praise for Sonnet 4 as a pair programmer. This didn't match my experience - it takes way too many liberties for my taste, and it refuses to conform to / use the translation interface between my app's API and database.

So, our thesis is that the best model for coding (or for any narrow task) is personal. Hence, our goal is to make it as easy as possible for our users to find the best model for them.

To explore this, we put together a dataset of real engineering sprint tickets (tasks like "optimize our database connections logic") and wrote down some of our top model behavioral preferences to use as eval metrics.

We focused on three of our core pain points: Pattern Adherence (does it follow our existing architecture?), Scope Discipline (does it stay focused or make unsolicited refactors?), and Comment Quality (does it write useful documentation vs verbose restating). Each response was scored from -1.0 to 1.0 across these dimensions.

Then we evaluated 14 of the current top LLMs using these metrics, with statistical significance testing across the prompt set.

Results: o3 medium ranked first (0.53 ± 0.05), with o4-mini second (0.48 ± 0.05). Sonnet 4 placed 7th (0.41 ± 0.07).

This helped us design a tiered "model-stack" that we now follow:

- o3 for complex, low-frequency tasks where quality matters most - o3-mini for difficult work at higher scale (similar quality, but much faster/cheaper) - Gemini 2.5 Flash for documentation (strong on comment quality, and very fast/cheap)

Some other findings:

- Scope discipline correlates weakly with other coding skills (r ≈ 0.05-0.09) - Every model struggles with comment quality - this metric has the lowest absolute scores - "Thinking" / "reasoning" doesn't always improve performance (Sonnet 4 regressed on two metrics)

Our results: https://mandoline.ai/leaderboards/coding

But recall, these results reflect our specific codebase patterns and team priorities. Your optimal choice depends on your architecture, task distribution, and what behaviors matter most to your workflow.

If you're choosing coding LLMs, our product helps you eval lots of models on your real tasks, against your own criteria.

Try it out: https://mandoline.ai/best-llm-for Docs: https://mandoline.ai/docs/best-llm-for

Comments (0)

No comments yet