OpenAI scores gold in one of the top programming competitions

11 energy123 13 8/13/2025, 10:14:41 AM msn.com ↗

Comments (13)

animal531 · 57m ago

I use GPT about daily now and have noticed a funny thing, which is to be expected really.

I can ask it to help me code for example a physics engine, so we're talking really hard and intricate code and it'll come up with some amazing optimizations, we're talking (recent) research paper level implementations.

Then I ask it to work on something that's relatively trivial, let's say we need a flowfield. It'll think and reason about it just as well as in the first example, but then it'll start spitting out a lot of subpar code. Its error rate will increase 10x while the global cohesiveness of the produced code will be substantially worse.

As to why that's happening, maybe its being trained on a lot more as well as worse examples of the second, whereas the first is relatively "pure".

These programming competitions are pretty much the same thing in my opinion. For us humans its a hard challenge, but in general they're asking the same-ish questions, just in different formats. They should add some questions where the participant has to invent something new, or alternatively use two or more existing concepts in a totally novel fashion.

NitpickLawyer · 3h ago

So in the past month we've had

- gold at IMO

- gold at IoI

- beat 9/10 humans in atcode heuristics

- longer context, better models, routing calls to cheaper models, 4-6x cheaper inference for 90% of the top models capabilities

- longer agentic sessions while being coherent/solving tasks (30-90min)

Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

Are we in the denial stage, or bargaining stage? Can't quite tell...

aleph_minus_one · 1h ago

> Yet every other post here and there are about "bubble this", "winter that", "plateauing this", "wall that"...

> Are we in the denial stage, or bargaining stage? Can't quite tell...

I can tell quite clearly that even assuming that the models are not rather "fine-tuned" to win these competitions, these achievements neither transfer to the kind of coding that I do at work, nor to the one that I do privately at night.

At work, a lot of what needs to be done is

1. asking people who are knowledgeable about the business logic why things were implemented this way (there often exist good reasons, which nevertheless are often quite subtle).

2. If some new requirements comes up, think deeply into how these new requirements fit into the huge legacy codebase (I am allowed to change things here as necessary (which is an enormous concession that is uncommon in this industry), but my code changes should really never ever cause the software to produce wrong results or break business-critical workflows, because such failures can cost my employer quite some money, or increase the workload of already overworked colleagues (they will then legitimately hate me :-( ) who in specific months have to work under very tight deadlines. What is a "business-critical workflow" here that should better never be broken? The answer requires understanding the very demanding users over many, many years (believe me: it is really subtle)).

I cannot imagine how AIs could help with this.

Privately, I tend to write very experimental code for which one can very likely not find similar code on the internet. Think into the direction of turning some deep scientific results into more "mainstream" code or turning my avant-garde thoughts about some deep problems into code so that one can do experiments to see whether my ideas actually work.

Again something where AIs can barely help.

fasterik · 18m ago

I don't think it's crazy to talk about plateaus, it just depends on what domain we're talking about. Performance on olympiad-style problems doesn't necessarily translate into success in research, or industry, or creative pursuits. We know this is true for humans, then add on to that all the usual problems with LLMs like hallucinations and you can see why some people are still skeptical.

I'm still in the "wait and see" stage. Maybe throwing more compute at the problem will solve it, but maybe not. I would like to see benchmarks that take a more project-based approach, e.g. tell the LLM to go work on something complicated and ambiguous for a week and see what it comes up with.

energy123 · 2h ago

People use low-compute models in their day to day jobs. They're not exposed to how good the very-high-compute runs are doing at the moment.

machiaweliczny · 1h ago

This. My younger brother thinks it’s crap but if you know state of the art + research it seems like things still are moving quite fast. Also tons of product work on top already.

energy123 · 1h ago

Even gpt-5 on "high" reasoning effort (which is likely higher than what people get in the Plus subscription; that's most likely "medium") is very, very low compute compared to the top runs behind IOI/IMO solutions.

Rick76 · 41m ago

If that's the case, then why is that, why would OpenAI not want to release their best models when the AI race is still close? I would assume it's due to energy constraints, and if that's true, the opinion that this can't replace people remains valid.

Thermodynamics is the law of laws, unless they invent some kind of ultra-efficient, almost magical computers to run these systems, it's simply not economical yet.

energy123 · 26m ago

It's not a question of whether it's the case. It's confirmed by OpenAI employees on Twitter.

The reasons could be that it's new (they did say they plan to release eventually but not soon), or that it's too heavily scaffolded for the task and not sufficiently general.

tyleo · 2h ago

But can it maintain my legacy crud app with no tests, millions of LoC, long compile times?

One day but not yet. Beyond pure capabilities the companies making AI don’t seem to have any sort of moat so it’s a $$$ incinerator for them so far.

Like the late 90s internet I suspect we’re in a bubble. But also like the late 90s internet I suspect there’s more in store here in the future.

bamboozled · 1h ago

Could we just be somewhere in the middle? Amazing models which have been tuned to win a certain comp and given a lot more compute than is feasible for every day usage, yet still the daily general models are still useful but not AGI yet?

robertlagrant · 2h ago

You might've said the same thing about self-driving cars five years ago, or chess even longer ago. It turns out chess was soluble, so the nay-sayers were wrong, but self-driving cars aren't soluble (yet) so the yay-sayers were wrong.

SideburnsOfDoom · 2h ago

How many of the answers were verbatim in the training data?