More information on OpenAI's result (which seems better than DeepMind's) from the X thread:
> our OpenAI reasoning system got a perfect score of 12/12
> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.
> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.
I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model).
bazmattaz · 17m ago
Ha so true. I was so tempted to copy and paste a problem into GPT5 and see what it would say
NitpickLawyer · 1h ago
So this year SotA models have gotten gold at IMO, IoI, ICPC and beat 9/10 humans in that atcoder thing that tested optimisation problems. Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
tech_ken · 1h ago
In 2015 SotA models blew past all expectations for engine performance in Go, but that didn't translate into LLM-based Code agents for another ~7 years (and even now the performance of these is up for debate). I think what this shows is that humans are extremely bad at understanding what problems are "hard" for computers; or rather we don't understand how to group tasks by difficulty in a generalizable way (success in a previously "hard" domain doesn't necessarily translate to performance in other domains of seemingly comparable difficult). It's incredibly impressive how these models perform in these contests, and certainly demonstrates that these tools have high potential in *specific areas* , but I think we might also need to accept that these are not necessarily good benchmarks for these tools' efficacy in less structured problem spaces.
Copying from a comment I made a few weeks ago:
> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.
edit: oh small world! the cited comment was actually a response to you in that other thread :D
NitpickLawyer · 1h ago
> edit: oh small world the cited comment was actually a response to you in that other thread :D
That's hilarious, we must have the same interests since we keep cross posting :D
The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.
The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.
tech_ken · 56m ago
> I think it's insanely good for what it is.
Oh totally! I think that the progress made in NLP, as well as the surprising collision of NLP with seemingly unrelated spaces (like ICPC word problems) is nothing sort of revolutionary. Nevertheless I also see stuff like this: https://dynomight.substack.com/p/chess
To me this suggests that this out-of-domain performance is more like an unexpected boon, rather than a guarantee of future performance. The "and who knows what else..." is kind of I'm getting: so far we are turning out to be bad at predicting where these tools are going to excel or fall short. To me this is sort of where the "wall" stuff comes from; despite all the incredible successes in these structured problem domains, nobody (in my personal opinion) has really unlocked the "killer app" yet. My belief is that by accepting their limitations we might better position ourselves to laser-target LLMs at the kind of things they rule at, rather than trying to make them "everything tools".
tempusalaria · 42m ago
A lot of the current code and science capabilities do not come from NTP training.
Indeed in seems in most language model RL there is not even process supervision, so a long way from NTP
JohnKemeny · 1h ago
There is a clear difference between what OpenAI manages to do with GPT-5 and what I manage to do with GPT-5. The other day I asked for code to generate a linear regression and it gave back a figure of some points and a line through it.
If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.
theptip · 45m ago
I believe this is going to be an increasingly important factor.
Call it the “shoelace fallacy”: Alice is supposedly much smarter but Bob can tie his shoelaces just as well.
The choice of eval, prompt scaffolding, etc. all dramatically impact the intelligence that these models exhibit. If you need a PhD to coax PhD performance from these systems, you can see why the non-expert reaction is “LLMs are dumb” / progress has stalled.
simianwords · 57m ago
Are you using the thinking model or the non thinking model? Maybe you can share your chat.
JohnKemeny · 41m ago
I prefer not to due to privacy concerns. Perhaps you can try yourself?
I will say that after checking, I see that the model is set to "Auto", and as mentioned, used almost 8 minutes. The prompt I used was:
Solve the following problem from a competitive programming contest. Output only the exact code needed to get it to pass on the submission server.
It did a lot of thinking, including
I need to tackle a problem where no web-based help is available. The task involves checking if a given tree can be the result of inserting numbers 1 to n into an empty skew heap, following the described insertion algorithm. I have to figure out the minimal and maximal permutations that produce such a tree.
And I can see that it visited 13 webpages, including icpc, codeforces, geeksforgeeks, github, tehrantimes, arxiv, facebook, stackoverflow, etc.
jsnell · 26m ago
A terse prompt and expecting a one-shot answer is really not how you'd get an LLM to solve complex problems.
I don't know what Deepmind and OpenAI did in this case, but to get an idea of the kind of scaffolding and prompting strategy that one might want, have a look at this paper where some floks used the normal generally available Gemini Pro 2.5 to solve 5/6 of the 2025 IMO problems: https://arxiv.org/pdf/2507.15855
minimaxir · 55m ago
The point of the GPT-5 model is that it is supposed to route between thinking/nonthinking smartly. Leveraging prompt hacks such as instructing it to "think carefully" to force routing to the thinking model go against OpenAI's claims.
koakuma-chan · 49m ago
Are you sure? I thought you can only specify reasoning_effort and that's it.
jug · 50m ago
Even Sam Altman himself thinks we’re in a bubble, and he ought to have a good sense of the wind direction here.
I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.
DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.
It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.
Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.
And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.
KallDrexx · 50m ago
It's important to look closely at the details of how these models actually do these things.
If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.
Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.
Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.
Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.
NitpickLawyer · 43m ago
AlphaGeometry/AlphaProof (the one you're thinking of, where they used LLMs + lean) was last year! and they "only" got silver. IMO gold results this year were e2e NLP.
sixtram · 1h ago
The last time I asked for a code review from AI was last week. It added (hallucinated) some extra lines to the code and then marked them as buggy. Yes, it beats humans at coding — great!
riku_iki · 55m ago
> So this year SotA models have gotten gold at IMO, IoI, ICPC
> Yet the most reposted headlines and rethoric is "wall this", "stangation that", "model regression", "winter", "bubble", doom etc.
this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks
birktj · 1h ago
They apparently managed gold in the IOI as well. A result that was extremely surprising for me and causes me to rethink a lot of assumptions I have about current LLMs. Unfortunately there was very little transparency on how they managed those results and the only source was a Twitter post. I want to know if there was any third party oversight, what kind of compute they used, how much power what kind of models and how they were set up? In this case I see that DeepMind at least has a blog post, but as far as I can see it does not answer any of my questions.
I think this is huge news, and I cannot imagine anything other than models with this capability having a massive impact all over the world. It causes me to be more worried than excited, it is very hard to tell what this will lead which is probably what makes it scary for me.
However with so little transparency from these companies and extreme financial pressure to perform well in these contests, I have to be quite sceptical of how truthful these results are. If true I think it is really remarkable, but I really want some more solid proof before I change my worldview.
XenophileJKO · 44m ago
So outside of human intervention, I don't think the specifics really matter. What this means is that it is possible and that this capability will in time be commoditized.
This is helpful in framing the conversation, especially with "skeptics" of what these models are capable of.
birktj · 17m ago
To a certain extent I agree. But as far as I know I cannot go to chatgpt.com and paste the newest ICPC problems and get full solutions. And there is no information about what they do differently. For a competition like the ICPC, which is academic in its nature, I think it is very unfortunate to setup a seperate AI track like this without publishing clear public information about what that actually entails. And have clear requirements for these AI companies to publish their methology. I know it is a nice source of sponsorships for them, but the ICPC should afford to stand up a bit for academic integrity.
Without any of this I can't even know for sure if there was any human intervention. I don't really think so, but as I mentioned the financial pressure to perform well is extreme so I can totally see that happening. Maybe ICPC did have some oversight, but please write a bit about it then.
If you assume no human intervention then all of this is of course irrelevant if you only care about the capabilities that exist. But still the implications of a general model performing at this level vs something more like a chess model trained specifically on competitive programming are of course different, even if the gap may close in the future. And how much compute/power was used, are we talking hundreds of kWhs? And does that just means larger models than normally or intelligent bruteforcing through a huge solutionspace? If so, then it is not clear how much they will be able to scale down the compute usage while keeping the performance at the same level
JohnKemeny · 1h ago
I went to ICPC's web pages, downloaded the first problem (problem A) and gave it to GPT-5, asking it for code to solve it (stating it was a problem from a recent competitive programming contest).
It thought for 7m 53s and gave as reply
# placeholder
# (No solution provided)
ferguess_k · 39m ago
I think in the future information will be more walled -- because AI companies are not paying anyone for that piece of information, and I encourage everyone to put their knowledge on their own website, and for each page, put up a few urls that humans won't be able to find (but can still click if he knows where to find), but can be crawled by AI, which link to pages containing falsified information (such as, oh the information on url blah is actually incorrect, here you can find the correct version, with all those explanations, blah blah -- but of course page blah is the only correct version).
Essentially, we need to poison AI in all possible ways, without impacting human reading. They either have to hire more humans to filter the information, or hire more humans to improve the crawlers.
Or we can simply stop sharing knowledge. I'm fine with it, TBF.
tgma · 29m ago
Why the AI hate? How is it different from sharing your knowledge with another individual or writing a book to share it?
> AI companies are not paying anyone for that piece of information
So? For the vast majority of human existence, paying for content was not a thing, just like paying for air isn't. The copyright model you are used to may just be too forced. Many countries have no moral qualms about "pirating" Windows and other pieces of software or games (they won't afford to purchase anyway.) There's no inherent morality or entitlement for author receiving payment for everything they "create" (to wit, Bill Gates had to write a letter to Homebrew Computer Club to make a case for this, showing that it was hardly the default and natural viewpoint.) It's just a legal/social contract to achieve specific goals for the society. Frankly the wheels of copyright have been falling off since the dawn of the Internet, not LLM.
bgwalter · 9m ago
Companies valued at $300 billion or more are not another individual and people are not "sharing" their works. The companies are stealing them.
For the majority of interesting output people have paid for art, music, software, journalism. But you know that already and are justifying the industry that pays your bills.
ototot · 42m ago
Given that ICPC problems are in general easier than IOI problems. I wouldn't be surprise to see they can get Gold (even perfect scores) in ICPC.
Nonetheless, I'm still questioning what's the cost and how long it would take for us to be able to access these models.
Still great work, but it's less useful if the cost is actually higher than hiring someone with the same level.
tgma · 21m ago
Not sure by what metric you compare the difficulty, but regardless of the hardness of the problem, IIRC, ICPC requires 100% correctness on test cases to score a problem (even failing one means you don't get the score,) but IOI would admit fractional scores (correct me if I am wrong.)
JohnKemeny · 30m ago
What makes you say that they are easier? Are there more people who manages to solve a problem from ICPC than from IOI?
How do you compare those?
There were at least 2 very simple problems in IOI this year.
I haven't read the ICPC problem set, and perhaps there are some low-hanging fruits, but I highly doubt it.
jaggs · 55m ago
I think it's becoming clear that these mega AI corps are juggling with their models at inference time to produce unrealistically good results. By that it seems that they're just cranking up the compute beyond reasonable levels in order to gain PR points against each other.
The fact is most ordinary mortals never get access to a fraction of that kind of power, which explains the commonly reported issues with AI models failing to complete even rudimentary tasks. It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?).
andy12_ · 13m ago
Models drop in price x10 each year. Us, common folk, getting access to these kinds of models is just a matter of time.
jaggs · 10m ago
Is that true though? Having to pay some $200 a month for a max account of whatever kind doesn't seem to be cheaper to me at all?
scarmig · 3m ago
$200/month for an LLM with the capability to fully automate my job is extremely cheap. Of course, even with a high thinking budget we don't have that yet, but if we see it at any cost in 2026, I'll be expecting to be forced into retirement by 2030.
ChrisArchitect · 43m ago
Sharing links to a couple of tweets is not a blog post.
A database is good at leetcode, who would have thought. Give humans a database and they'll outperform your "AI" (which probably uses an extraordinary amount of graphics cards and electricity).
It is an idiotic benchmark, in line with the rest of the "AI" propaganda.
> our OpenAI reasoning system got a perfect score of 12/12
> For 11 of the 12 problems, the system’s first answer was correct. For the hardest problem, it succeeded on the 9th submission. Notably, the best human team achieved 11/12.
> We had both GPT-5 and an experimental reasoning model generating solutions, and the experimental reasoning model selecting which solutions to submit. GPT-5 answered 11 correctly, and the last (and most difficult problem) was solved by the experimental reasoning model.
I'm assuming that "GPT-5" here is a version with the same model weights but higher compute limits than even GPT-5 Pro, with many instances working in parallel, and some specific scaffolding and prompts. Still, extremely impressive to outperform the best human team. The stat I'd really like to see is how much money it would cost to get this result using their API (with a realistic cost for the "experimental reasoning model).
Copying from a comment I made a few weeks ago:
> I dunno I can see an argument that something like IMO word problems are categorically a different language space than a corpus of historiography. For one, even when expressed in English language math is still highly, highly structured. Definitions of terms are totally unambiguous, logical tautologies can be expressed using only a few tokens, etc. etc. It's incredibly impressive that these rich structures can be learned by such a flexible model class, but it definitely seems closer (to me) to excelling at chess or other structured game, versus something as ambiguous as synthesis of historical narratives.
edit: oh small world! the cited comment was actually a response to you in that other thread :D
That's hilarious, we must have the same interests since we keep cross posting :D
The thing with the go comparison is that alphago was meant to solve go and nothing else. It couldn't do chess with the same weights.
The current SotA LLMs are "unreasonably good" at a LOT of tasks, while being trained with a very "simple" objective: NTP. That's the key difference here. We have these "stochastic parrots" + RL + compute that basically solve top tier competitions in math, coding, and who knows what else... I think it's insanely good for what it is.
Oh totally! I think that the progress made in NLP, as well as the surprising collision of NLP with seemingly unrelated spaces (like ICPC word problems) is nothing sort of revolutionary. Nevertheless I also see stuff like this: https://dynomight.substack.com/p/chess
To me this suggests that this out-of-domain performance is more like an unexpected boon, rather than a guarantee of future performance. The "and who knows what else..." is kind of I'm getting: so far we are turning out to be bad at predicting where these tools are going to excel or fall short. To me this is sort of where the "wall" stuff comes from; despite all the incredible successes in these structured problem domains, nobody (in my personal opinion) has really unlocked the "killer app" yet. My belief is that by accepting their limitations we might better position ourselves to laser-target LLMs at the kind of things they rule at, rather than trying to make them "everything tools".
Indeed in seems in most language model RL there is not even process supervision, so a long way from NTP
If GPT-5, as claimed, is able to solve all problems in ICPC, please give the instructions on how I can reproduce it.
Call it the “shoelace fallacy”: Alice is supposedly much smarter but Bob can tie his shoelaces just as well.
The choice of eval, prompt scaffolding, etc. all dramatically impact the intelligence that these models exhibit. If you need a PhD to coax PhD performance from these systems, you can see why the non-expert reaction is “LLMs are dumb” / progress has stalled.
I will say that after checking, I see that the model is set to "Auto", and as mentioned, used almost 8 minutes. The prompt I used was:
It did a lot of thinking, including And I can see that it visited 13 webpages, including icpc, codeforces, geeksforgeeks, github, tehrantimes, arxiv, facebook, stackoverflow, etc.I don't know what Deepmind and OpenAI did in this case, but to get an idea of the kind of scaffolding and prompting strategy that one might want, have a look at this paper where some floks used the normal generally available Gemini Pro 2.5 to solve 5/6 of the 2025 IMO problems: https://arxiv.org/pdf/2507.15855
I think the contradiction here can be reconciled by how these tests don’t tend to run on the typical hardware constraints they need to be able do this at scale. And herein lies a large part of the problem as far as I can tell; in late 2024, OpenAI realized they had to rethink GPT-5 since their first attempt became too costly to run. This delayed the model and when it finally released, it was not a revolutionary update but evolutionary at best compared to o3. Benchmarks published by OpenAI themselves indicated a 10% gain over o3 for God knows how much cash and well over a year of work. We certainly didn’t have those problems in 2023 or even 2024.
DeepSeek has had to delay R2, and Mistral has had to delay Mistral 3 Large, teased within weeks back in May. No word from either about what’s going on. DS is said to move more to Huawei and this is behind a delay but I don’t think it’s entirely clear it has nothing to do with performance issues.
It would be more strange to _not_ have people speculate about stagnation or bubbles given these events and public statements.
Personally, I’m not sure if stagnation is the right word. We’re seeing a lot,of innovation in toolsets and platforms surrounding LLM’s like Codex, Claude Code, etc. I think we’ll see more in this regard and that this will provide more value than the core improvements to the LLM’s themselves in 2026.
And as for the bubble, I think we are in one but mostly because the market has been so incredibly hot. I see a bubble not because AI will fall apart but because there are too many products and services right now in a golden rush era. Companies will fail but not because AI suddenly starts failing us but due to saturation.
If you look at the details of how Google got gold at IMO, you'll see that AlphaGeometry only relies on LLMs for a very specific part of the whole system, and the LLM wasn't the core problem solving system in play.
Most of AlphaGeometry is standard algorithms at play solving geometry problems using known constraints. When the algorithmic system gets stuck, it reaches out to LLMs that were fine tuned specifically for creating new geometric constraints. So the LLM would create new geometric constraints and pass that back to the algorithmic parts to get it unstuck, and repeat.
Without more details, it's not clear if this win is also the Gpt-5 and Gemini models we use, or specially fine-tuned models that are integrated with other non-LLM and non-ML based systems to solve these.
Not being solved purely by LLM isn't a knock on it, but with the current conversations going on today with LLMs, these are heavily being marketed as "LLMs did this all by themselves", which doesn't match with a lot of the evidence I've personally seen.
this is narrow niche with high amount of training data (they all buy training data from leetcode), and this results are not necessary generalizable on overall industrial tasks
I think this is huge news, and I cannot imagine anything other than models with this capability having a massive impact all over the world. It causes me to be more worried than excited, it is very hard to tell what this will lead which is probably what makes it scary for me.
However with so little transparency from these companies and extreme financial pressure to perform well in these contests, I have to be quite sceptical of how truthful these results are. If true I think it is really remarkable, but I really want some more solid proof before I change my worldview.
This is helpful in framing the conversation, especially with "skeptics" of what these models are capable of.
Without any of this I can't even know for sure if there was any human intervention. I don't really think so, but as I mentioned the financial pressure to perform well is extreme so I can totally see that happening. Maybe ICPC did have some oversight, but please write a bit about it then.
If you assume no human intervention then all of this is of course irrelevant if you only care about the capabilities that exist. But still the implications of a general model performing at this level vs something more like a chess model trained specifically on competitive programming are of course different, even if the gap may close in the future. And how much compute/power was used, are we talking hundreds of kWhs? And does that just means larger models than normally or intelligent bruteforcing through a huge solutionspace? If so, then it is not clear how much they will be able to scale down the compute usage while keeping the performance at the same level
It thought for 7m 53s and gave as reply
Essentially, we need to poison AI in all possible ways, without impacting human reading. They either have to hire more humans to filter the information, or hire more humans to improve the crawlers.
Or we can simply stop sharing knowledge. I'm fine with it, TBF.
> AI companies are not paying anyone for that piece of information
So? For the vast majority of human existence, paying for content was not a thing, just like paying for air isn't. The copyright model you are used to may just be too forced. Many countries have no moral qualms about "pirating" Windows and other pieces of software or games (they won't afford to purchase anyway.) There's no inherent morality or entitlement for author receiving payment for everything they "create" (to wit, Bill Gates had to write a letter to Homebrew Computer Club to make a case for this, showing that it was hardly the default and natural viewpoint.) It's just a legal/social contract to achieve specific goals for the society. Frankly the wheels of copyright have been falling off since the dawn of the Internet, not LLM.
For the majority of interesting output people have paid for art, music, software, journalism. But you know that already and are justifying the industry that pays your bills.
Nonetheless, I'm still questioning what's the cost and how long it would take for us to be able to access these models.
Still great work, but it's less useful if the cost is actually higher than hiring someone with the same level.
How do you compare those?
There were at least 2 very simple problems in IOI this year.
I haven't read the ICPC problem set, and perhaps there are some low-hanging fruits, but I highly doubt it.
The fact is most ordinary mortals never get access to a fraction of that kind of power, which explains the commonly reported issues with AI models failing to complete even rudimentary tasks. It's now turned into a whole marketing circus (maybe to justify these ludicrous billion-dollar valuations?).
Google source post: https://deepmind.google/discover/blog/gemini-achieves-gold-l... (https://news.ycombinator.com/item?id=45278480)
OpenAI tweet: https://x.com/OpenAI/status/1968368133024231902 (https://news.ycombinator.com/item?id=45279514)
It is an idiotic benchmark, in line with the rest of the "AI" propaganda.