OpenAI claims Gold-medal performance at IMO 2025

123 Davidzheng 176 7/19/2025, 9:11:19 AM twitter.com ↗

Comments (176)

mikert89 · 32m ago
The cynicism/denial on HN about AI is exhausting. Half the comments are some weird form of explaining away the ever increasing performance of these models

I've been reading this website for probably 15 years, its never been this bad. many threads are completely unreadable, all the actual educated takes are on X, its almost like there was a talent drain

halfmatthalfcat · 28m ago
The overconfidence/short sightedness on HN about AI is exhausting. Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.
Aurornis · 15m ago
> Half the comments are some weird form of explaining how developers will be obsolete in five years and how close we are to AGI.

I do not see that at all in this comment section.

There is a lot of denial and cynicism like the parent comment suggested. The comments trying to dismiss this as just “some high school math problem” are the funniest example.

halfmatthalfcat · 2m ago
Woosh
kenjackson · 5m ago
I went through the thread and saw nothing that looked like this.

I don’t think developers will be obsolete in five years. I don’t think AGI is around the corner. But I do think this is the biggest breakthrough in computer science history.

I worked on accelerating DNNs a little less than a decade ago and had you shown me what we’re seeing now with LLMs I’d say it was closer to 50 years out than 20 years out.

halfmatthalfcat · 2m ago
You're missing the joke homie.
mikert89 · 3m ago
its very clearly a major breakthrough for humanity
infecto · 9m ago
I don’t typically find this to be true. There is a definite cynicism on HN especially when it comes to OpenAI. You already know what you will see. Low quality garbage of “I remember when OpenAI was open”, “remember when they used to publish research”, “sama cannot be trusted”, it’s an endless barrage of garbage.
mikert89 · 8m ago
its honestly ruining this website, you cant even read the comments sections anymore
ninetyninenine · 1m ago
This isn’t true at all. Everyone is aware of the limitations of AI.

I’ve seen realistic takes and I see unrealistic takes where people over exaggerate and call it stochastic parroting.

blamestross · 17m ago
Nobody likes the idea that this is only "economical superior AI". Not as good as humans, but a LOT cheaper.

The "It will just get better" is bubble baiting the investors. The tech companies learned from the past and they are riding and managing the bubble to extract maximum ROI before it pops.

The reality is a lot of work done by humans can be replaced by an LLM with lower quality and nuance. The loss in sales/satisfaction/ect is more than offset by the reduced cost.

The current model of LLMs are enshitification accelerators and that will have real effects.

gellybeans · 15m ago
Making an account just to point out how these comments are far more exhausting, because they don't engage with the subject matter. They are just agreeing with a headline and saying, "See?"

You say, "explaining away the increasing performance" as though that was a good faith representation of arguments made against LLMs, or even this specific article. Questionong the self-congragulatory nature of these businesses is perfectly reasonable.

softwaredoug · 30m ago
Probably because both sides have strong vested interests and it’s next to impossible to find a dispassionate point of view.

The Pro AI crowd, VC, tech CEOs etc have strong incentive to claim humans are obsolete. Tech employees see threats to their jobs and want to poopoo any way AI could be useful or competitive.

orbital-decay · 15m ago
That's a huge hyperbole. I can assure you many people find the entire thing genuinely fascinating, without having any vested interest and without buying the hype.
chii · 6m ago
That's just another way to state that everybody is almost always self-serving when it comes to anything.
rvz · 23m ago
Or some can spot a euphoric bubble when they see it with lots of participants who have over-invested in 90% of these so called AI startups that are not frontier labs.
mikert89 · 1m ago
dude we have computers reasoning in english to solve math problems, what are you even talking about
yunwal · 18m ago
What does this have to do with the math Olympiad? Why would it frame your view of the accomplishment?
emp17344 · 7m ago
Why don’t they release some info beyond a vague twitter hype post? I’m beginning to hate OpenAI for releasing statements like this that invariably end up being less impressive than they make it sound initially.
wyuyang377 · 19m ago
cynacism -> cynicism
ALLTaken · 1h ago
I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally, nor put any engineers under duress of having to pull all-nighters. AI model performance should be shown T+2 days AFTER the contest! I wish that real Humans who worked hard can enjoy the attention, price and respect they deserve!

Billion dollar companies stealing not only the price, prestige, time and sleep of participants by brute-forcing their model through all illegally scraped Code via GitHub is a disgrace to humanity.

AI models should read the same materials to become proficient in coding, without having trillions of lines of code to ape through mindlessly. Otherwise the "AI" is no different than an elaborate Monte Carlo Tree Search (MCTS).

Yes I know AI is quite advanced. I know that quite well and study latest SOTA papers daily, have developed my own models aswell from the ground up, but it's despite all the advancements still far away from substantially being better than MCTS (see: https://icml.cc/virtual/2025/poster/44177 and https://allenai.org/blog/autods )

EDIT, adding proof:

This is the results of the last competition they tried to win and have LOST: https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-...

(Looks like a pattern OpenAI Corp is scraping competitions to place themselves into the spotlight and headlines.)

jsnell · 7m ago
As far as I can tell, OpenAI didn't participate, and isn't claiming they participated. Note the fairly precise phrasing of "gold medal-level performance": they claim to have shown performance sufficient for a gold, not that they won one.
Aurornis · 42m ago
> I think OpenAI participating is nothing but a publicity stunt and wholly unfair and disrespectful against Human participants. It should be allowed for AI models to participate, but it should not be ranked equally,

OpenAI did not participate in the actual competition nor were they taking spots away from humans. OpenAI just gave the problems to their AI under the same time limit and conditions (no external tool use)

> nor put any engineers under duress of having to pull all-nighters.

Under duress? At a company like this, all of the people working on this project are there because they want to be and they’re compensated millions.

aubanel · 54m ago
- AI competing is "wholly unfair"

- "[AI is] far away from being substantially being better than MCTs"

^ pick only one

yobbo · 41m ago
Running MCTS over algorithms is the part that might be considered unfair if used in competition with humans.
threatripper · 35m ago
Humans should be allowed to compete in groups of arbitrary size. This would also be a demonstration of excellent teamwork under time pressure.
pclmulqdq · 9m ago
In a general sense, cheating and losing are not mutually exclusive.
stingraycharles · 46m ago
Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).

Whether or not they’re far away from being better than humans is up to debate, but the entire point of these types of benchmarks it to compare them to humans.

bluecalm · 19m ago
>>Yeah it’s a completely fair playing field, it’s completely obvious that AI should be able to compete with humans in the same way that robotics and computers can compete with humanity (and are better suited for many tasks).

Yeah same way computers and robots should be able to win World Chess Championship, 100m dash and Wimbledon.

>>but the entire point of these types of benchmarks it to compare them to humans

The entire point of the competition is to fight against participants who are similar to you, have similar capabilities and go through similar struggles. If you want bot vs human competitions - great - organize it yourself instead of hijacking well established competitions out there.

chairhairair · 1h ago
OpenAI simply can’t be trusted on any benchmarks: https://news.ycombinator.com/item?id=42761648
qoez · 1h ago
Remember that they've fired all whistleblowers that would admit to breaking the verbal agreement that they wouldn't train on the test data.
samat · 33m ago
Could not find it on the open web. Do you have clues to search for?
amelius · 1h ago
This is not a benchmark, really. It's an official test.
andrepd · 1h ago
And what were the methods? How was the evaluation? They could be making it all up for all we know!
Aurornis · 13m ago
The International Math Olympiad isn’t an AI benchmark.

It’s an annual human competition.

meroes · 1m ago
They didn’t actually compete.
chvid · 56m ago
I believe this company used to present its results and approach in academic papers with enough details so that it could be reproduced by third parties.

Now it is just doing a bunch of tweets?

do_not_redeem · 29m ago
They're doing tweets because the results cannot be reproduced. https://matharena.ai/
samat · 35m ago
This company used to be non profit

And many other things

z7 · 4h ago
Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

sigmoid10 · 1h ago
While I usually enjoy seeing these discussions, I think they are really pushing the usefulness of bayesian statistics. If one dude says the chance for an outcome is 8% and another says it's 16% and the outcome does occur, they were both pretty wrong, even though it might seem like the one who guessed a few % higher might have had a better belief system. Now if one of them had said 90% while the other said 8% or 16%, then we should pay close attention to what they are saying.
zeroonetwothree · 33m ago
A 16% or even 8% event happening is quite common so really it tells us nothing and doesn’t mean either one was pretty wrong.
grillitoazul · 50m ago
From a mathematical point of view there are two factors: (1) Initial prior capability of prediction from the human agents and (2) Acceleration in the predicted event. Now we examine the result under such a model and conclude that:

The more prior predictive power of human agents imply the more a posterior acceleration of progress in LLMs (math capability).

Here we are supposing that the increase in training data is not the main explanatory factor.

This example is the gem of a general framework for assessing acceleration in LLM progress, and I think its application to many data points could give us valuable information.

grillitoazul · 20m ago
Another take at a sound interpretation:

(1) Bad prior prediction capability of humans imply that result does not provide any information

(2) Good prior prediction capability of humans imply that there is acceleration in math capabilities of LLMs.

exegeist · 2h ago
Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

We may certainly hope Eliezer's other predictions don't prove so well-calibrated.

rafaelero · 1h ago
Gary Marcus is so systematically and overconfidently wrong that I wonder why we keep talking about this clown.
qoez · 1h ago
People just give attention to people making surprising bold counter narrative predictions but don't give them any attention when they're wrong.
dcre · 1h ago
I do think Gary Marcus says a lot of wrong stuff about LLMs but I don’t see anything too egregious in that post. He’s just describing the results they got a few months ago.
m3kw9 · 1h ago
He definitely cannot use the original arguments from then ChatGPT arrived, he's a perennial goal post shifter.
causal · 1h ago
These numbers feel kind of meaningless without any work showing how he got to 16%
shuckles · 1h ago
My understanding is that Eliezer more or less thinks it's over for humans.
0xDEAFBEAD · 37m ago
andrepd · 1h ago
Context? Who are these people and what are these numbers and why shouldn't I assume they're pulled from thin air?
gniv · 2h ago
From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

gus_massa · 1h ago
In the IMO, the idea is that the first day you get P1, P2 and P3, and the second day you get P4, P5 and P6. Usually, ordered by difficulty, they are P1, P4, P2, P5, P3, P6. So, usually P1 is "easy" and P6 is very hard. At least that is the intended order, but sometime reality disagree.

Edit: Fixed P4 -> P3. Thanks.

masterjack · 10m ago
In this case P6 was unusually hard and P3 was unusually easy https://sugaku.net/content/imo-2025-problems/
thundergolfer · 1h ago
You have P4 twice in there, latter should be 3
demirbey05 · 1h ago
I think from Canada team someone solved it but among all, its very few
ksec · 1h ago
I am neither an optimist nor a pessimist for AI. I would likely be called both by the opposite parties. But the fact that AI / LLM is still rapidly improving is impressive in itself and worth celebrating for. Is it perfect, AGI, ASI? No. Is it useless? Absolutely not.

I am just happy the prize is so big for AI that there are enough money involve to push for all the hardware advancement. Foundry, Packaging, Interconnect, Network etc, all the hardware research and tech improvements previously thought were too expensive are now in the "Shut up and take my money" scenario.

meroes · 1h ago
In the RLHF sphere you could tell some AI company/companies were targeting this because of how many IMO RLHF’ers they were hiring specifically. I don’t think it’s really easy to say how much “progress” this is given that.
modeless · 47m ago
Noam Brown:

> this isn’t an IMO-specific model. It’s a reasoning LLM that incorporates new experimental general-purpose techniques.

> it’s also more efficient [than o1 or o3] with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.

> As fast as recent AI progress has been, I fully expect the trend to continue. Importantly, I think we’re close to AI substantially contributing to scientific discovery.

I thought progress might be slowing down, but this is clear evidence to the contrary. Not the result itself, but the claims that it is a fully general model and has a clear path to improved efficiency.

https://x.com/polynoamial/status/1946478249187377206

csomar · 30m ago
The issue is that trust is very hard to build and very easy to lose. Even in today's age where regular humans have a memory span shorter than that of an LLM, OpenAI keeps abusing the public's trust. As a result, I take their word on AI/LLMs about as seriously as I'd take my grocery store clerk's opinion on quantum physics.
emp17344 · 1m ago
I still haven’t forgotten OpenAI’s FrontierMath debacle from December. If they really have some amazing math-solving model, give us more info than a vague twitter hype-post.
stingraycharles · 41m ago
My issue with all these citations is that it’s all OpenAI employees that make these claims.

I’ll wait to see third party verification and/or use it myself before judging. There’s a lot of incentives right now to hype things up for OpenAI.

do_not_redeem · 32m ago
A third party tried this experiment with publicly available models. OpenAI did half as well as Gemini, and none of the models even got bronze.

https://matharena.ai/imo/

jsnell · 14m ago
I feel you're misunderstanding something. That's not "this exact experiment". Matharena is testing publicly available models against the IMO problem set. OpenAI was announcing the results of a new, unpublished model, on that problems set.

It is totally fair to discount OpenAI's statement until we have way more details about their setup, and maybe even until there is some level of public access to the model. But you're doing something very different: implying that their results are fraudulent and (incorrectly) using the Matharena results as your proof.

do_not_redeem · 7m ago
Fair enough, edited.
YeGoblynQueenne · 37m ago
How is a claim, "clear evidence" to anything?
modeless · 24m ago
Most evidence you have about the world is claims from other people, not direct experiment. There seems to be a thought-terminating cliche here on HN, dismissing any claim from employees of large tech companies.

Unlike seemingly most here on HN, I judge people's trustworthiness individually and not solely by the organization they belong to. Noam Brown is a well known researcher in the field and I see no reason to doubt these claims other than a vague distrust of OpenAI or big tech employees generally which I reject.

kelipso · 27m ago
Haha, if Musk made a claim five years ago, it would’ve been taken as clear evidence here. Now it’s other people I guess, hype never dies.
johnecheck · 2h ago
Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

fnordpiglet · 1h ago
Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.

YeGoblynQueenne · 50m ago
>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.

parasubvert · 54m ago
I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

If you're not familiar with System 1 / System 2, it's googlable .

logicchains · 15m ago
>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result

This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.

Davidzheng · 2h ago
I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.
FeepingCreature · 1h ago
The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.
lcnPylGDnU4H9OF · 2h ago
> what tools were used and how the model used them

According to the twitter thread, the model was not given access to tools.

constantcrying · 1h ago
>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.

johnecheck · 1h ago
The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.
diggan · 1h ago
> If it didn't

We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.

samat · 28m ago
> would be outright misleading

why would not they? what are the incentives not to?

blibble · 31m ago
openai have been caught doing exactly this before
another_twist · 17m ago
Its a level playing field IMO. But theres another thread which claims not even bronze and I really don't want to go to X for anything.
up2isomorphism · 59m ago
In fact no car company claims “gold medal” performance in Olympic running even they can do that 100 yeas ago. Obviously since IMO does not generate much money so it is an easy target.

BTW; “Gold medal performance “ looks a promotional term for me.

ddtaylor · 49m ago
Glock should show up to the UFC and win the whole tournament handily.
flappyeagle · 56m ago
LMAO
mehulashah · 34m ago
The AI scaling that went on for the last five years is going to be very different from the scaling that will happen in the next ten years. These models have latent capabilities that we are racing to unearth. IMO is but one example.

There’s so much to do at inference time. This result could not have been achieved without the substrate of general models. Its not like Go or protein folding. You need the collective public global knowledge of society to build on. And yes, there’s enough left for ten years of exploration.

More importantly, the stakes are high. There may be zero day attacks, biological weapons, and more that could be discovered. The race is on.

gcanyon · 20m ago
99.99+% of all problems humans face do not require particularly original solutions. Determining whether LLMs can solve truly original (or at least obscure) problems is interesting, and a problem worth solving, but ignores the vast majority of the (near-term at least) impact they will have.
dylanbyte · 4h ago
These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

Davidzheng · 4h ago
Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.
stingraycharles · 1h ago
They also said this is not part of GPT-5, and “will be released later”. It’s very, very likely a model specifically fine-tuned for this benchmark, where afterwards they’ll evaluate what actual real-world problems it’s good at (eg like “use o4-mini-high for coding”).
AIPedant · 3h ago
It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

fnordpiglet · 1h ago
I actually think this “cheating” is fine. In fact it’s preferable. I don’t need an AI that can act as a really expensive calculator or solver. We’ve already built really good calculators and solvers that are near optimal. What has been missing is the abductive ability to successfully use those tools in an unconstrained space with agency. I find really no value in avoiding the optimal or near optimal techniques we’ve devised rather than focusing on the harder reasoning tasks of choosing tools, instrumenting them properly, interpreting their results, and iterating. This is the missing piece in automated reasoning after all. A NN that can approximate at great cost those tools is a parlor trick and while interesting not useful or practical. Even if they have some agent system here, it doesn’t make the achievement any less that a machine can zero shot do as well as top humans at incredibly difficult reasoning problems posed in natural language.
redlock · 3h ago
AIPedant · 2h ago
If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206

Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.

Davidzheng · 2h ago
We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.
skdixhxbsb · 1h ago
> We can only go off their word

We’re talking about Sam Altman’s company here. The same company that started out as a non profit claiming they wanted to better the world.

Suggesting they should be given the benefit of the doubt is dishonest at this point.

YeGoblynQueenne · 48m ago
>> This is not a model specialized to IMO problems.

How do you know?

demirbey05 · 4h ago
Are you from OpenAI ?
ktallett · 4h ago
Hahaha! It's either that or they are determined to get a job there.
ktallett · 4h ago
I think that's an insult to professional mathematicians. Any mathematician that has got to the stage where they do this for a living will be more than capable of doing Olympiad questions. These are proofs and some general numerical maths, some are probably a little trickier than others but the questions aren't unique and most final year bsc students in Maths will have encountered similar. I wouldn't consider myself particularly great at Maths, (despite it being the language of physics/engineering as many of my lecturers told me) but I can do plenty of the past questions without any significant reading. Most of these are similar to later years uni problems so the LLM will be able to find answers with the right searching. It may not be specialised to IMO problems, but these sort of math questions pop up in plenty of settings so it doesn't need to be.
parsimo2010 · 1h ago
I am a professor in a math department (I teach statistics but there is a good complement of actual math PhDs) and there are only about 10% who care about these types of problems and definitely less than half who could get gold on an IMO test even if they didn’t care.

They are all outstanding mathematicians, but the IMO type questions are not something that mathematicians can universally solve without preparation.

There are of course some places that pride themselves on only taking “high scoring” mathematicians, and people will introduce themselves with their name and what they scored on the Putnam exam. I don’t like being around those places or people.

crinkly · 1h ago
100% agree with this.

My second degree is in mathematics. Not only can I probably not do these but they likely aren’t useful to my work so I don’t actually care.

I’m not sure an LLM could replace the mathematical side of my work (modelling). Mostly because it’s applied and people don’t know what they are asking for, what is possible or how to do it and all the problems turn out to be quite simple really.

Davidzheng · 4h ago
No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)
credit_guy · 3h ago
> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)

I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.

You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.

Davidzheng · 3h ago
I mean. Back when I was practicing these problems sometimes I would try them on/off for a week and would be able to do some 3&6's (usually I can do 1&4 somewhat consistently and usually none of others). As a working mathematician today, I would almost certain not be able to get gold medal performance in a week but for a given problem I guess I would have ~50% chance at least of solving it in a week? But I haven't tried in a while. But I suspect the professionals here do worse at these competition questions than you think. I mean certain these problems are "easy" compared to many of the questions we think about, but expertise drastically shifts the speed/difficulty of questions we can solve within our domains, if that makes sense.

Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.

jsnell · 1h ago
> It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds

Really? My expectation would have been the opposite, that time was a constraint for the AIs. OpenAI's highest end public reasoning models are slow, and there's only so much that you can do by parallelization.

Understanding how they dealt with time actually seems like the most important thing to put these results into context, and they said nothing about it. Like, I'd hope they gave the same total time allocation for a whole problem set as the human competitors. But how did they split that time? Did they work on multiple problems in parallel?

ktallett · 3h ago
I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.
Davidzheng · 3h ago
I mean these problems you can get better with practice. But if you haven't solved many before and can do them after an afternoon of thought I would be very impressed. Not that I don't believe you, it's just in my experience people like this are very rare. (Also I assume they have to have some degree of familarity of some common tricks otherwise they would have to derive basic number theory from scratch etc and that seems a bit much for me to believe)
ktallett · 3h ago
I think honestly it's probably different experiences and skillsets. I find these sort of things doable bar dumb mistakes by myself, yet there will be other things I'll get stressed and not be able to do for ages (some lab skills no matter the number of times I do them and some physical equation derivations that I regularly muck up). I maybe sometimes assume that what comes easy for me, comes easy for all, and what I struggle with, everyone struggles with and that's probably not always the case. Likewise I did similar tasks as a teen in school and assume that is possibly the case for many academically bright so to speak but perhaps isn't so that probably helped me learn some tricks that I may not have otherwise. But as you say I do feel that you can learn the tricks and learn how to do them, even in older age (academically speaking) if you have the time and the patience and the right guide.
samat · 9m ago
Here you go — you did this type of problems as a kid/teenager. 1) you likely have a talent for it 2) you have some training.

I did participate in math/informatics olympiads as a teenager and even taught it a little and from my experience, some type of people just _like_ that sort of problems naturally, they tickle their minds, and given time this people would develop to insane levels at it.

'Normal people', in my experience, even in math departments, don't like that type of problems, and would not fare well with them.

jebarker · 1h ago
IMO questions are to math as leetcode questions are to software engineering. Not necessarily easier or harder but they test ability on different axes. There’s definitely some overlap with undergrad level proof style questions but I disagree that being a working mathematician would necessarily mean you can solve these type of questions quickly. I did a PhD in pure math (and undergrad obv) and I know I’d have to spend time revising and then practicing to even begin answering most IMO questions.
gametorch · 4h ago
Getting gold at the IMO is pretty damn hard.

I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.

I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.

I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.

npinsker · 3h ago
The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.

(Not to take away from the result, which I'm really impressed by!)

Invictus0 · 2h ago
The "AI" that won Go was Monte Carlo tree search on a neural net "memory" of the outcome of millions of previous games; this is a LLM solving open ended problems. The tasks are hardly even comparable.
yobbo · 28m ago
A "reasoning LLM" might not be conceptually far from MCTS.
gafferongames · 2h ago
And then they created AlphaGo Zero, which is not trained on any previous games, and it was even stronger!

https://deepmind.google/discover/blog/alphago-zero-starting-...

amelius · 59m ago
Makes sense. Mathematicians use intuiton a lot to drive their solution seeking, and I suppose an AI such as an LLM could develop intuition too. Of course where AI really wins is search speed and the fact that an LLM really doesn't get tired when exploring different strategies and steps within each strategy.

However, I expect that geometric intuition may still be lacking mostly because of the difficulty of encoding it in a form which an LLM can easily work with. After all, Chatgpt still can't draw a unicorn [1] although it seems to be getting closer.

[1] https://gpt-unicorn.adamkdean.co.uk/

another_twist · 16m ago
I am quite surprised that Deepmind with MCTS wasnt able to figure out math performance itself.
demirbey05 · 4h ago
Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

ktallett · 4h ago
Astounding in what sense? I assume you are aware of the standard of Olympiad problems and that they are not particularly high. They are just challenging for the age range, but they shouldn't be for AI considering they aren't really anything but proofs and basic structured math problems.

Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.

saagarjha · 3h ago
I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.

For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.

samat · 1m ago
Thanks for speaking sense. I think 99% of people saying IMO problems are not hard would not be able to solve basic district-level competition problems and are just not equipped to judge the problems.

And 1% here are those IMO/IOI winners who think everyone is just like them. I grew up with them and to you, my friends, I say: this is the reason why AI would not take over the world (and might even not be that useful for real world tasks), even if it wins every damn contest out there.

zug_zug · 1h ago
I feel like I've noticed you you making the same comment 12 places in this thread -- incorrectly misrepresenting the difficulty of this tournament and ultimately it comes across as a bitter ex.

Here's an example problem 5:

Let a1,a2,…,an be distinct positive integers and let M=max⁡1≤i<j≤n.

Find the maximum number of pairs (i,j) with 1≤i<j≤n for which (ai +aj )(aj −ai )=M.

causal · 46m ago
Where did you get this? Don't see it on the 2025 problem set and now I wanna see if I have the right answer
causal · 1h ago
What does max⁡1≤i<j≤n mean? Wouldn't M always be j?
kelipso · 5m ago
Guessing it should be M = max_{⁡1≤i<j≤n} ai+aj or some other function M = max_{⁡1≤i<j≤n} f(ai,aj).
Aurornis · 1h ago
> I assume you are aware of the standard of Olympiad problems and that they are not particularly high.

Every time an LLM reaches a new benchmark there’s a scramble to downplay it and move the goalposts for what should be considered impressive.

The International Math Olympiad was used by many people as an example of something that would be too difficult for LLMs. It has been a topic of discussion for some time. The fact that an LLM has achieved this level of performance is very impressive.

You’re downplaying the difficulty of these problems. It’s called international because the best in the entire world are challenged by it.

Davidzheng · 4h ago
sorry but I don't think it's accurate to say "they are just challenging for the age range"
ktallett · 3h ago
I'm aware you believe they are impossible tasks unless you have specific training, I happen to disagree with that.
Davidzheng · 3h ago
you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.
ktallett · 3h ago
I mean IMO training, as yes I agree you wouldn't be able to do this without a complete Math knowledge.
demirbey05 · 4h ago
I mean progress speed, few months ago they released o3 it has 16 pt in imo 2025
ktallett · 4h ago
In that regards I would agree but that to me suggests that prior hype was unbased though.
Jackson__ · 35m ago
Also interesting takeaways from that tweet chain:

>GPT5 soon

>it will not be as good as this secret(?) model

quirino · 57m ago
I think equally impressive is the performance of the OpenAI team at the "AtCoder World Tour Finals 2025" a couple of days ago. There were 12 human participants and only one did better than OpenAI.

Not sure there is a good writeup about it yet but here is the livestream: https://www.youtube.com/live/TG3ChQH61vE.

zeroonetwothree · 32m ago
And yet when working on production code current LLMs are about as good as a poor intern. Not sure why the disconnect.
kenjackson · 15m ago
Depends. I’ve been using it for some of my workflows and I’d say it is more like a solid junior developer with weird quirks where it makes stupid mistakes and other times behaves as a 30 year SME vet.
ktallett · 4h ago
Tbh, the way everyone has been going out about the quality of Open ai, high school/early university maths problems should not have been a stretch at all for it. The fact that this unverified claim is only just being mentioned suggests their AI isn't quite as amazing as marketed. Especially considering fundamentally logic and following rules should be rather easy to do so and most Olympiad problems are rather easy to extract the key details from.
Aurornis · 51m ago
> high school/early university maths problems should not have been a stretch at all for it.

Either you are unfamiliar with the International Math Olympiad or you’re trying to be misleading.

Calling these problems high school/early university maths is a ridiculous characterization.

gametorch · 3h ago
> high school/early university maths problems should not have been a stretch at all for it

This is a ridiculous understatement of the difficulty of getting gold at the IMO.

ktallett · 3h ago
That is the level of math you need to do these problems with a little brief understanding of what certain concepts are. There is no calculus etc. The vast majority of IMO questions are applying the base rules to new problems.
daedrdev · 1m ago
Jcampuzano2 · 2h ago
There are entire fields of math with exceptional people trying to solve impossibly hard problems that utilize quite literally 0 calculus.

Many of them are also questions that eventually end up with proofs or solutions that only require very high level understanding of basic principles. But when I say very high I mean like impossibly high for the average person and ability to combine simple concepts to solve complex problems.

I'd wager the majority of Math graduates from universities would struggle to answer most IMO questions.

curt15 · 2h ago
Olympiad questions don't require advanced concepts except maybe some classical geometry techniques that you wouldn't normally encounter in modern research mathematics. But they're fundamentally designed as puzzles. You need to spot the tricks.
oytis · 2h ago
It's like saying getting a gold medal in boxing is not hard, because it doesn't involve any firearms
pragmatic · 2h ago
More fair comparison: Military grade killbot enters ring with boxer and proceeds to fire pneumatic hammer at boxer until KO?
Davidzheng · 3h ago
You'd be surprised at how much math the people who actually get IMO gold know...
gametorch · 3h ago
Okay, let's see you try any one of the past IMOs and show us your score.

It's really hard.

See my other comment. I was voted the best at math in my entire high school by my teachers, completed the first two years of college classes while still in high school. I've tried IMO problems for fun. I'm very happy if I get one right. I'd be infinitely satisfied to score a perfect on 3 out of 6 problems and that's nowhere near gold.

tlb · 1h ago
I encourage anyone who thinks these are easy high-school problems to try to solve some. They're published (including this year's) at https://www.imo-official.org/problems.aspx. They make my head spin.
xpressvideoz · 48m ago
I didn't know there were localized versions of the IMO problems. But now that I think of it, having versions of multiple languages is a must to remove the language barrier from the competitors. I guess having that many language versions (I see ~50 languages?) may make keeping the security of the problems considerably harder?
orespo · 4h ago
Definitely interesting. Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?

Second, happy to test it on open math conjectures or by attempting to reprove recent math results.

evrimoztamur · 4h ago
From what I've seen, IMO question sets are very diverse. Moreover, humans also train on all available set of math olympiad questions and similar sets too. It seems fair game to have the AI train on them as well.

For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.

ktallett · 4h ago
You mean as in the previous years questions will have been used to train it? Yes, they are the same questions and due to them limited format on math questions, there are repeats so LLMs should fundamentally be able to recognise a structure and similarities and use that.
laurent_du · 1h ago
They are not the same question, why are you spreading so much misinformed takes in this thread? I know a guy who had one of the best scores in history at IMO and he's incredibly intelligent. Stop repeating that getting a gold medal at IMO is a piece of cake - it's not.
andrepd · 1h ago
Am I missing something or is this completely meaningless? It's 100% opaque, no details whatsoever and no transparency or reproducibility.

I wouldn't trust these results as it is. Considering that there are trillions of dollars on the line as a reward for hyping up LLMs, I trust it even less.

flappyeagle · 53m ago
Yes you are missing the entire boat
YeGoblynQueenne · 41m ago
Guys, that's nothing. My new AI system is not LLM-based but neuro-symbolic and yet it just scored 100% on the IMO 2026 problems that haven't even been written yet, it is that good.

What? This is a claim with all the trust-worthiness of OpenAI's claim. I mean I can claim anything I want at this point and it would still be just as trust-worthy as OpenAI's claim, with exactly zero details about anything else than "we did it, promise".

davidguetta · 2h ago
Wait for the Chinese version
procgen · 1h ago
riding coattails
tester756 · 4h ago
huh?

any details?

ktallett · 4h ago
It is able to solve some high school/early bsc maths problems.
Jcampuzano2 · 2h ago
Calling these high school/early bsc maths questions is an understatement lol.
littlestymaar · 4h ago
Which would be impressive if we knew those problems weren't in the training data already.

I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.

But we must wary of mixing up smart information retrieval with reasoning.

ktallett · 4h ago
Considering these LLM utilise the entirety of the internet, there will be no unique problems that come up in the oLympiad. Even across the course of a degree, you will have likely been exposed to 95% of the various ways to write problems. As you say, retrieval is really the only skill here. There is likely no reasoning.
reactordev · 2h ago
The Final boss was:

   Which is greater, 9.11 or 9.9?

/s

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

Lionga · 4h ago
counting "R"s in strawberry now counts for a gold medal in math?
timbaboon · 1h ago
Haha no - then it wouldn't have got a gold medal ;)
ktallett · 4h ago
The Olympiad is a great thing for children for sure. This is not what I feel we should be wasting resources on though for AI. I question if it's even impressive.
baq · 4h ago
Velocity of AI progress in recent years is exceeded only by velocity of goalposts.
ktallett · 4h ago
The goalposts should focus on being able to make a coherent statement using papers on a subject with sources. At this point it can't do that for any remotely cutting edge topic. This is just a distraction.
mindwok · 3h ago
The idea of a computer being able to solve IMO problems it has not seen before in natural language even just 3 years ago would be completely science fiction. This is astounding progress.
zkmon · 1h ago
This is an awesome progress in human achievement to get these machines intelligent. And this is also a fast regress and decline on the human wisdom!

We are simply greasing the grooves and letting things slide faster and faster and calling it progress. How does this help to make the human and nature integration better?

Does this improve climate or make humans adapt better to changing climate? Are the intelligent machines a burning need for the humanity today? Or is it all about business and political dominance? At what cost? What's the fall out of all this?

jebarker · 1h ago
Nobody knows the answers to these questions. Relying on AGI solving problems like climate change seems like a risky strategy but on the other hand it’s very plausible that these tools can help in some capacity. So we have to build, study and find out but also consider any opportunity cost of building these tools versus others.
jfengel · 20m ago
Solving climate change isn't a technical problem, but a human one. We know the steps we have to take, and have for many years. The hard part is getting people to actually do them.

No human has any idea how to accomplish that. If a machine could, we would all have much to learn from it.