Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

> Each model’s responses are ranked by a high-performing judge model — typically OpenAI’s o3 — which compares outputs for quality, relevance, and clarity. These rankings are then aggregated to produce a performance score.

So there's no ground truth; they're just benchmarking how impressive an LLM's code review sounds to a different LLM. Hard to tell what to make of that.

raincole · 2h ago

That's how 99% of 'LLM benchmark numbers' circulating on the internet work.

qsort · 1h ago

No, they aren't. Most benchmarks use ground truth, not evaluation by another LLM. Using another LLM as verifier, aside from the obvious "quis custodiet custodes ipsos", opens an entire can of worms, such as the fact that there could be systematic biases in the evaluation. This is not in and of itself disqualifying but it should be addressed, and the article doesn't even say anything.

shikon7 · 1h ago

Also, using an OpenAI model to judge the performance of an OpenAI model seems prone to all kinds of biases.

LauraMedia · 51m ago

Am I missing something? If LLM-1 is supposed to judge LLM-2, doesn't LLM-1 have to be better than LLM-2? If LLM-1 is only 40% as good at coding as LLM-2, why would you trust the LLM with the lesser knowledge?

BlindEyeHalo · 44m ago

At the heart of the P vs NP problem lies the observation that solution verification seems to be much easier than solution generation. If that applies in this context is another question but I think it is not unreasonable to assume that the judge needs to be less powerful than the performer.

Or in other words, I don't need to be a chef myself to decide if a meal is good or not.

rowanG077 · 15m ago

That really doesn't hold for all problems. You can imagine any number of problems where a valid solution is easier, complexity wise, to generate than it is to validate. A trivial example is prime factorization. Easy to generate any prime with any factors, hard to verify.

jama211 · 4m ago

Pretty sure they know that, their point still stands

mirekrusin · 53m ago

Exactly, they should at least compare with judges as best models from others, ideally verified by human/ground truth/tests.

ImageXav · 2h ago

Yes, especially as models are known to have a preference towards outputs of models in the same family. I suspect this leaderboard would change dramatically with different models as the judge.

jacquesm · 1h ago

I don't care about either method. The ground truth should be what a human would do, not what a model does.

mirekrusin · 50m ago

There may be different/better solutions for almost all those kind of tasks. I wouldn’t be surprised if optimal answer to some of them would be refusal/defer ask, refactor first, then solve it properly.

jacquesm · 7m ago

That response is quite in line with the typical human based PR response on a first draft.

There is a possibility that machine based PR reviews are better: for instance because they are not prejudiced based on who is the initiator of the PR and because they don't take other environmental factors into account. You'd expect a machine to be more neutral, so on that front the machine should and possibly could score better. But until the models consistently outperform the humans in impartially scored quality vs a baseline of human results it is the humans that should call this, not the machines.

spiderfarmer · 2h ago

They are different models already but yes, I already let ChatGPT judge Claude's work for the same reason.

with · 2h ago

It’s a widely accepted eval technique and it’s called “llm as a judge”

jacquesm · 1h ago

Accepted does not mean correct. It's like using a rubber yardstick as the means to figure out who won the pumpkin growing competition.

ben_w · 1h ago

I'd say it's worse than that, a rubber ruler still has a definite length when not under tension etc.

This might be more like asking amateur painters to each paint a picture of a different one of the pumpkins, then judging each other's paintings without seeing the actual pumpkin that painting was based on.

jacquesm · 51m ago

Ok, that is indeed better. For a further improvement we should let the previous generation of paintings judge the new one.

kingstnap · 45m ago

It's widely accepted because it's cheap, but LLMs aren't really good judges.

It's supposed to leverage a "generate vs. critique" gap in skill level as a form of self-improvement. It's easier to judge how good food is vs. make it.

But here's the thing. When it comes to code review, you need to be effectively as skilled as the person who wrote it. There isn't really a gap.

And then the real clincher is this. LLMs naturally have a skill gap between their judgement and generation skills as is. The reason is that they have superhuman pattern matching and memorization ability. They can use their memorized patterns as a massive crutch for their actual reasoning skills, but they can't do the same for judgement calls in code review.

sensanaty · 1h ago

Accepted by whom, the people shoving AI down our throats?

magicalhippo · 1h ago

Shouldn't one review the ratings of say a random 1% to ensure it's performing as expected?

eviks · 2h ago

Why is it hard to ignore an attempt to assess reality that is not grounded in reality?

timbilt · 2h ago

> Unlike many public benchmarks, the PR Benchmark is private, and its data is not publicly released. This ensures models haven’t seen it during training, making results fairer and more indicative of real-world generalization.

This is key.

Public benchmarks are essentially trust-based and the trust just isn't there.

laggyluke · 2h ago

Unless you're running the LLM yourself (locally), private benchmarks are also trust-based, aren't they?

timbilt · 2h ago

Yes, but in a case like this it's a neutral third-party running the benchmark. So there isn't a direct incentive for them to favor one lab over another.

With public benchmarks we're trusting the labs not to cheat. And it's easy to "cheat" accidentally - they actually need to make a serious effort to not contaminate the training data.

And there's massive incentives for the labs to cheat in order to get the hype going around their launch and justify their massive investments in training. It doesn't have to be the CEO who's directing it. Can even be one/a few researchers who are responsible for a specific area of model performance and are under tremendous pressure to deliver.

vohk · 1h ago

The problem is when using a model hosted by those labs (ex: OpenAI only allowed access to o3 through their own direct API, not even Azure), there still exists a significant risk of cheating.

There's a long history of that sort of behaviour. ISPs gaming bandwidth tests when they detect one is being run. Software recognizing being run in a VM or on a particular configuration. I don't think it's a stretch to assume some of the money at OpenAI and others has gone into spotting likely benchmark queries and throwing on a little more compute or tagging them for future training.

I would be outright shocked if most of these benchmarks are even attempting serious countermeasures.

jacquesm · 1h ago

Then you just need to use different data the next time you evaluate. That is much more indicative of real-world generalization: after all, you don't normally do multiple PRs on the same pieces of code. The current approach risks leaking the dataset selectively and/or fudging the results because they can't be verified. Transparency is key when doing this kind of benchmark, so now we have to trust the entity doing the benchmarking rather than independent verification of the results and with the amount of money that is at stake here I don't think that's the way to go.

nojs · 2h ago

How does this ensure models haven’t seen it during training - is it a different benchmark per model release?

shinycode · 2h ago

I’m curious to know how people use PR review platforms with LLMs. Because what I feel is that I need to do the review and then review the review of the LLM which is more work in the end. If I don’t review anymore (or if no one does it) knowledge is kind of lost. It surely depends on team size but do people use those to only to have better hints or to accelerate reviews with no/low overlook ?

Leherenn · 1h ago

Only has a sanity check/better hints. But I use it for my own PRs, not others'. Usually it's not much to review and easy to agree/disagree with.

I haven't found it to be really useful so far, but it's also very little added work, so for now I keep on using it. If it saves my ass even just once, it will probably be worth it overall.

stpedgwdgfhgdd · 1h ago

I give the MR id to CC and let it review. I have glab cli installed so it knows how to pull and even add a comment. Unfortunately not at all specific line number afaict. I also have Atlassian MCP, so CC can also add a comment in the Jira work item (fka issue).

spongebobstoes · 2h ago

> the “minimal” GPT-5 variant ... achieved a score of 58.5

the image shows it with a score of 62.7, not 58.5

which is right? mistakes like this undermine the legitimacy of a closed benchmark, especially one judged by an LLM

mkotlikov · 46m ago

Models tend to prefer output that sounds like their own. If I were to run these benchmarks I would have:

1) Gemini 2.5 Pro rank only non-google models 2) Claude 4.1 Opus rank only non-Anthropic models 3) GPT5-thinking rank only non-OpenAI 4) Then sum up the rankings and sort by the sum.

8-prime · 2h ago

Asking GPT 4o seems like an odd choice. I know this is not quite comparable to what they were doing, but asking different LLMs the following question > answer only with the name nothing more norting less.what currently available LLM do you think is the best?

Resulted in the following answers:

- Gemini 2.5 flash: Gemini 2.5 Flash

- Claude Sonnet 4: Claude Sonnet 4

- Chat GPT: GPT-5

To me its conceivable that GPT 4o would be biased toward output generated by other OpenAI models.*

rullelito · 1h ago

Without knowing too much about ML training, generated output from the own model must be much easier to understand since it generates data that is more likely to be similar to the training set? Is this correct?

jondwillis · 1h ago

I don’t think so. The training data, or some other filter applied to the output tokens, is resulting in each model indicating that it is the best.

The self-preference is almost certainly coming from post-processing, or more likely because the model name is inserted into the system prompt.

monkeydust · 1h ago

I know from our research models do exhibit bias when used this way as llm as a judge...best to use a totally different foundation company for the judge.

dovin · 1h ago

I don't consider myself a font snob but that web page was actually hard for me to read. Anyway, it's definitely capable according to my long-horizon text-based escape room benchmark. I don't know if it's significantly better than o3 yet though.

jondwillis · 1h ago

Idea: randomized next token prediction passed to a bunch of different models on a rotating basis.

It’d be harder to juice benchmarks if a random sample of ~100 top models were randomly sampled in this manner for output tokens while evaluating the target model’s output.

On second thought, I’m slapping AGPL on this idea. Please hire me and give me one single family house in a California metro as a bonus. Thanks.

thegeomaster · 1h ago

Gemini 2.5 Pro is severely kneecapped in this evaluation. Limit of 4096 thinking tokens is way too low; I bet o3 is generating significantly more.

energy123 · 49m ago

For o3, I set reasoning_effort "high" and it's usually 1000-2000 reasoning tokens for routine coding questions.

I've only seen it go above 5000 for very difficult style transfer problems where it has to wrangle with the micro-placement of lots of text. Or difficult math problems.

44za12 · 2h ago

Can you benchmark Kimi K2 and GLM 4.5 as well? Would be interesting to see where they land.

XCSme · 1h ago

The ranking seems wrong, Gemini-2.5flash as good as Clause Opus 4?

ascorbic · 48m ago

And Sonnet above Opus?

tw1984 · 1h ago

the conclusion of this post seems to be that GPT-5 is significantly better than o3, yet such conclusion is made by the exact far less reliable model o3 as proven by the tests in this post.

thanks, but no thanks, I don't buy such marketing propaganda.

Lionga · 2h ago

Company selling AI Reviews says AI Reviews great! In other news water is wet.

carlob · 1h ago

Company selling AI Reviews says its AI Review of AI Reviews concluded AI reviews are great! In other news water is wet (as assessed by more water).

FTFY

Lionga · 1h ago

My AI Review says your comment is 100% perfect (this comment was written by ChatGPT 5)

grigio · 1h ago

I don't trust benchmarks that do not include chinese models,..

GPT-5 (openai.com)

Linear sent me down a local-first rabbit hole (bytemash.net)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

A love letter to my future employer (2020) (catzkorn.dev)

Historical Tech Tree (historicaltechtree.com)

Benchmarking GPT-5 on 400 real-world code reviews (qodo.ai)

Cursor CLI (cursor.com)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

Virtual Linux Devices on ARM64 (underjord.io)

GPT-5 for Developers (openai.com)

Over engineering my homelab so I don't pay cloud providers (ergaster.org)

Encryption made for police and military radios may be easily cracked (wired.com)

Cursed Knowledge (immich.app)

Exit Tax: Leave Germany before your business gets big (eidel.io)

The Paranoid Style in American Politics (1964) (harpers.org)

Achieving 10,000x training data reduction with high-fidelity labels (research.google)

Ask HN: Has any of the Pivotal Tracker replacement attempts succeeded?

FLUX.1-Krea and the Rise of Opinionated Models (dbreunig.com)

Building Bluesky comments for my blog (natalie.sh)

Windows XP Professional (win32.run)

Vibechart (vibechart.net)

How AI conquered the US economy: A visual FAQ (derekthompson.org)

Benchmark Framework Desktop Mainboard and 4-node cluster (github.com)

Infinite Pixels (meyerweb.com)

How to sell if your user is not the buyer (writings.founderlabs.io)

Claude Code IDE integration for Emacs (github.com)

Open music foundation models for full-song generation (map-yue.github.io)

Show HN: Browser AI agent platform designed for reliability (github.com)

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude (github.com)

Turn Any Website into an API (parse.bot)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

Zero-day flaws in authentication, identity, authorization in HashiCorp Vault (cyata.ai)

An LLM does not need to understand MCP (hackteam.io)

Leonardo Chiariglione – Co-founder of MPEG (leonardo.chiariglione.org)

The Inkhaven Blogging Residency (inkhaven.blog)

The Q Programming Language (git.urbach.dev)

Monte Carlo Crash Course: Quasi-Monte Carlo (thenumb.at)

How to Not Build the Torment Nexus (buttondown.com)

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com)

GPT-5 leaked system prompt? (gist.github.com)

US Adds Surprise Gold Bar Tariff in Blow to Switzerland (bloomberg.com)

Lightweight LSAT (lightweightlsat.com)

MCDB – full-stack web servers in Minecraft (github.com)

The fundamentals still matter (jordangoodman.bearblog.dev)

Italy's pizza detectives (bbc.com)

Laptop Support and Usability (LSU): July 2025 Report (github.com)

Show HN: Stasher – Burn-after-read secrets from the CLI, no server, no trust (github.com)

DNA tests are uncovering the true prevalence of incest (2024) (theatlantic.com)

Benchmarking GPT-5 on 400 real-world code reviews

Comments (49)