OpenAI scores gold in one of the top programming competitions

10 energy123 10 8/13/2025, 10:14:41 AM msn.com ↗

Comments (10)

animal531 · 7m ago
I use GPT about daily now and have noticed a funny thing, which is to be expected really.

I can ask it to help me code for example a physics engine, so we're talking really hard and intricate code and it'll come up with some amazing optimizations, we're talking (recent) research paper level implementations.

Then I ask it to work on something that's relatively trivial, let's say we need a flowfield. It'll think and reason about it just as well as in the first example, but then it'll start spitting out a lot of subpar code. Its error rate will increase 10x while the global cohesiveness of the produced code will be substantially worse.

As to why that's happening, maybe its being trained on a lot more as well as worse examples of the second, whereas the first is relatively "pure".

These programming competitions are pretty much the same thing in my opinion. For us humans its a hard challenge, but in general they're asking the same-ish questions, just in different formats. They should add some questions where the participant has to invent something new, or alternatively use two or more existing concepts in a totally novel fashion.

NitpickLawyer · 2h ago
So in the past month we've had

- gold at IMO

- gold at IoI

- beat 9/10 humans in atcode heuristics

- longer context, better models, routing calls to cheaper models, 4-6x cheaper inference for 90% of the top models capabilities

- longer agentic sessions while being coherent/solving tasks (30-90min)

Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

Are we in the denial stage, or bargaining stage? Can't quite tell...

aleph_minus_one · 22m ago
> Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

> Are we in the denial stage, or bargaining stage? Can't quite tell...

I can tell quite clearly that even assuming that the models are not rather "fine-tuned" to win these competitions, these achievements neither transfer to the kind of coding that I do at work, nor to the one that I do privately at night.

At work, a lot of what needs to be done is

1. asking people who are knowledgeable about the business logic why things were implemented this way (there often exist good reasons, which nevertheless are often quite subtle).

2. If some new requirements comes up, think deeply into how these new requirements fit into the huge legacy codebase (I am allowed to change things here as necessary (which is an enormous concession that is uncommon in this industry), but my code changes should really never ever cause the software to produce wrong results or break business-critical workflows, because such failures can cost my employer quite some money, or increase the workload of already overworked colleagues (they will then legitimately hate me :-( ) who in specific months have to work under very tight deadlines. What is a "business-critical workflow" here that should better never be broken? The answer requires understanding the very demanding users over many, many years (believe me: it is really subtle)).

I cannot imagine how AIs could help with this.

Privately, I tend to write very experimental code for which one can very likely not find similar code on the internet. Think into the direction of turning some deep scientific results into more "mainstream" code or turning my avant-garde thoughts about some deep problems into code so that one can do experiments to see whether my ideas actually work.

Again something where AIs can barely help.

energy123 · 1h ago
People use low-compute models in their day to day jobs. They're not exposed to how good the very-high-compute runs are doing at the moment.
machiaweliczny · 1h ago
This. My younger brother thinks it’s crap but if you know state of the art + research it seems like things still are moving quite fast. Also tons of product work on top already.
energy123 · 56m ago
Even gpt-5 on "high" reasoning effort (which is likely higher than what people get in the Plus subscription; that's most likely "medium") is very, very low compute compared to the top runs behind IOI/IMO solutions.
tyleo · 1h ago
But can it maintain my legacy crud app with no tests, millions of LoC, long compile times?

One day but not yet. Beyond pure capabilities the companies making AI don’t seem to have any sort of moat so it’s a $$$ incinerator for them so far.

Like the late 90s internet I suspect we’re in a bubble. But also like the late 90s internet I suspect there’s more in store here in the future.

bamboozled · 14m ago
Could we just be somewhere in the middle? Amazing models which have been tuned to win a certain comp and given a lot more compute than is feasible for every day usage, yet still the daily general models are still useful but not AGI yet?
robertlagrant · 1h ago
You might've said the same thing about self-driving cars five years ago, or chess even longer ago. It turns out chess was soluble, so the nay-sayers were wrong, but self-driving cars aren't soluble (yet) so the yay-sayers were wrong.
SideburnsOfDoom · 1h ago
How many of the answers were verbatim in the training data?