OpenAI claims Gold-medal performance at IMO 2025

42 Davidzheng 54 7/19/2025, 9:11:19 AM twitter.com ↗

Comments (54)

gniv · 45m ago

From that thread: "The model solved P1 through P5; it did not produce a solution for P6."

It's interesting that it didn't solve the problem that was by far the hardest for humans too. China, the #1 team got only 21/42 points on it. In most other teams nobody solved it.

johnecheck · 24m ago

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.

lcnPylGDnU4H9OF · 7m ago

> what tools were used and how the model used them

According to the twitter thread, the model was not given access to tools.

Davidzheng · 10m ago

I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.

z7 · 2h ago

Some previous predictions:

In 2021 Paul Christiano wrote he would update from 30% to "50% chance of hard takeoff" if we saw an IMO gold by 2025.

He thought there was an 8% chance of this happening.

Eliezer Yudkowsky said "at least 16%".

Source:

https://www.lesswrong.com/posts/sWLLdG6DWJEy3CH7n/imo-challe...

exegeist · 39m ago

Impressive prediction, especially pre-ChatGPT. Compare to Gary Marcus 3 months ago: https://garymarcus.substack.com/p/reports-of-llms-mastering-...

We may certainly hope Eliezer's other predictions don't prove so well-calibrated.

dylanbyte · 2h ago

These are high school level only in the sense of assumed background knowledge, they are extremely difficult.

Professional mathematicians would not get this level of performance, unless they have a background in IMO themselves.

This doesn’t mean that the model is better than them in math, just that mathematicians specialize in extending the frontier of math.

The answers are not in the training data.

This is not a model specialized to IMO problems.

Davidzheng · 1h ago

Are you sure this is not specialized to IMO? I do see the twitter thread saying it's "general reasoning" but I'd imagine they RL'd on olympiad math questions? If not I really hope someone from OpenAI says that bc it would be pretty astounding.

AIPedant · 1h ago

It almost certainly is specialized to IMO problems, look at the way it is answering the questions: https://xcancel.com/alexwei_/status/1946477742855532918

E.g here: https://pbs.twimg.com/media/GwLtrPeWIAUMDYI.png?name=orig

Frankly it looks to me like it's using an AlphaProof style system, going between natural language and Lean/etc. Of course OpenAI will not tell us any of this.

redlock · 55m ago

Nope

https://x.com/polynoamial/status/1946478249187377206?s=46&t=...

AIPedant · 31m ago

If you don't have a Twitter account then x.com links are useless, use a mirror: https://xcancel.com/polynoamial/status/1946478249187377206

Anyway, that doesn't refute my point, it's just PR from a weaselly and dishonest company. I didn't say it was "IMO-specific" but the output strongly suggests specialized tooling and training, and they said this was an experimental LLM that wouldn't be released. I strongly suspect they basically attached their version of AlphaProof to ChatGPT.

Davidzheng · 8m ago

We can only go off their word unfortunately and they say no formal math. so I assume it's being eval'd by a verifier model instead of a formal system. There's actually some hints of this b/c geometry in Lean is not that well developed so unless they also built their own system it's hard to do it formally (though their P2 proof is by coordinate bash (computation by algebra instead of geometric construction) so it's hard to tell.

demirbey05 · 2h ago

Are you from OpenAI ?

ktallett · 2h ago

Hahaha! It's either that or they are determined to get a job there.

ktallett · 2h ago

I think that's an insult to professional mathematicians. Any mathematician that has got to the stage where they do this for a living will be more than capable of doing Olympiad questions. These are proofs and some general numerical maths, some are probably a little trickier than others but the questions aren't unique and most final year bsc students in Maths will have encountered similar. I wouldn't consider myself particularly great at Maths, (despite it being the language of physics/engineering as many of my lecturers told me) but I can do plenty of the past questions without any significant reading. Most of these are similar to later years uni problems so the LLM will be able to find answers with the right searching. It may not be specialised to IMO problems, but these sort of math questions pop up in plenty of settings so it doesn't need to be.

Davidzheng · 1h ago

No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field). As the original parent said, pretty much only ppl who had the training in high school can. Like number theorists without training might be able to do some number theory IMO questions but this level is basically impossible without specialized training (with maybe a few exceptions of very strong mathematicians)

credit_guy · 1h ago

> No I assure you >50% of working mathematicians will not score gold level at IMO consistently (I'm in the field)

I agree with you. However, would a lot of working mathematicians score gold level without the IMO time constraints? Working mathematicians generally are not trying to solve a problem in the time span of one hour. I would argue that most working mathematicians, if given an arbitrary IMO problem and allowed to work on it for a week, would solve it. As for "gold level", with IMO problems you either solve one or you don't.

You could counter that it is meaningless to remove the time constraints. But we are comparing humans with OpenAI here. It is very likely OpenAI solved the IMO problems in a matter of minutes, maybe even seconds. When we talk about a chatbot achieving human-level performance, it's understood that the time is not a constraint on the human side. We are only concerned with the quality of the human output. For example: can OpenAI write a novel at the level of Jane Austen? Maybe it can, maybe it can't (for now) but Jane Austen was spending years to write such a novel, while our expectation is for OpenAI to do it at the speed of multiple words per second.

Davidzheng · 1h ago

I mean. Back when I was practicing these problems sometimes I would try them on/off for a week and would be able to do some 3&6's (usually I can do 1&4 somewhat consistently and usually none of others). As a working mathematician today, I would almost certain not be able to get gold medal performance in a week but for a given problem I guess I would have ~50% chance at least of solving it in a week? But I haven't tried in a while. But I suspect the professionals here do worse at these competition questions than you think. I mean certain these problems are "easy" compared to many of the questions we think about, but expertise drastically shifts the speed/difficulty of questions we can solve within our domains, if that makes sense.

Addendum: Actually I am not sure the probability of solving it in a week is not much better than 6 hours for these questions because they are kind of random questions. But I agree with some parts of your post tbf.

ktallett · 1h ago

I sense we may just have a different experience related to colleagues skill sets as I can think of 5 people I could send some questions too and I know they would do them just fine. Infact we often have done similar problems on a free afternoon and I often do similar on flights as a way to pass the time and improve my focus (my issue isn't my talent/understanding at maths, it's my ability to concentrate). I don't disagree that some level of training is needed but these questions aren't unique, nor impossible, especially as said training does exist and LLM's can access said examples. LLM's also have brute force which is a significant help with these type of issues. One particular point is that Math of all the STEM topics to try and focus on probably is the best documented alongside CS.

Davidzheng · 1h ago

I mean these problems you can get better with practice. But if you haven't solved many before and can do them after an afternoon of thought I would be very impressed. Not that I don't believe you, it's just in my experience people like this are very rare. (Also I assume they have to have some degree of familarity of some common tricks otherwise they would have to derive basic number theory from scratch etc and that seems a bit much for me to believe)

ktallett · 1h ago

I think honestly it's probably different experiences and skillsets. I find these sort of things doable bar dumb mistakes by myself, yet there will be other things I'll get stressed and not be able to do for ages (some lab skills no matter the number of times I do them and some physical equation derivations that I regularly muck up). I maybe sometimes assume that what comes easy for me, comes easy for all, and what I struggle with, everyone struggles with and that's probably not always the case. Likewise I did similar tasks as a teen in school and assume that is possibly the case for many academically bright so to speak but perhaps isn't so that probably helped me learn some tricks that I may not have otherwise. But as you say I do feel that you can learn the tricks and learn how to do them, even in older age (academically speaking) if you have the time and the patience and the right guide.

gametorch · 1h ago

Getting gold at the IMO is pretty damn hard.

I grew up in a relatively underserved rural city. I skipped multiple grades in math, completed the first two years of college math classes while in high school, and won the award for being the best at math out of everyone in my school.

I've met and worked with a few IMO gold medalists. Even though I was used to scoring in the 99th percentile on all my tests, it felt like these people were simply in another league above me.

I'm not trying to toot my own horn. I'm definitely not that smart. But it's just ridiculous to shoot down the capabilities of these models at this point.

npinsker · 1h ago

The trouble is, getting an IMO gold medal is much easier (by frequency) than being the #1 Go player in the world, which was achieved by AI 10 years ago. I'm not sure it's enough to just gesture at the task; drilling down into precisely how it was achieved feels important.

(Not to take away from the result, which I'm really impressed by!)

demirbey05 · 2h ago

Progress is astounding. Recently report published about evaluation of LLMs on IMO 2025. o3 high didn't even get bronze.

https://matharena.ai/imo/

Waiting for Terry Tao's thoughts, but these kind of things are good use of AI. We need to make science progress faster rather than disrupting our economy without being ready.

ktallett · 2h ago

Astounding in what sense? I assume you are aware of the standard of Olympiad problems and that they are not particularly high. They are just challenging for the age range, but they shouldn't be for AI considering they aren't really anything but proofs and basic structured math problems.

Considering OpenAI can't currently analyse and provide real paper sources to cutting edge scientific issues, I wouldn't trust it to do actual research outside of generating matplotlib code.

saagarjha · 1h ago

I did competitive math in high school and I can confidently say that they are anything but "basic". I definitely can't solve them now (as an adult) and it's likely I never will. The same is true for most people, including people who actually pursued math in college (I didn't). I'm not going to be the next guy who unknowingly challenges a Putnam winner to do these but I will just say that it is unlikely that someone who actually understands the difficulty of these problems would say that they are not hard.

For those following along but without math specific experience: consider whether your average CS professor could solve a top competitive programming question. Not Leetcode hard, Codeforces hard.

demirbey05 · 2h ago

I mean progress speed, few months ago they released o3 it has 16 pt in imo 2025

ktallett · 2h ago

In that regards I would agree but that to me suggests that prior hype was unbased though.

Davidzheng · 1h ago

sorry but I don't think it's accurate to say "they are just challenging for the age range"

ktallett · 1h ago

I'm aware you believe they are impossible tasks unless you have specific training, I happen to disagree with that.

Davidzheng · 1h ago

you meaning specific IMO training or general math training? Latter is certainly needed, former being needed in my opinion is a general observation for example about the people who make it on the teams.

ktallett · 1h ago

I mean IMO training, as yes I agree you wouldn't be able to do this without a complete Math knowledge.

orespo · 2h ago

Definitely interesting. Two thoughts. First, are the IMO questions somewhat related to other openly available questions online, making it easier for LLMs that are more efficient and better at reasoning to deduce the results from the available content?

Second, happy to test it on open math conjectures or by attempting to reprove recent math results.

evrimoztamur · 2h ago

From what I've seen, IMO question sets are very diverse. Moreover, humans also train on all available set of math olympiad questions and similar sets too. It seems fair game to have the AI train on them as well.

For 2, there's an army of independent mathematicians right now using automated theorem provers to formalise more or less all mathematics as we know it. It seems like open conjectures are chiefly bounded by a genuine lack of new tools/mathematics.

ktallett · 2h ago

You mean as in the previous years questions will have been used to train it? Yes, they are the same questions and due to them limited format on math questions, there are repeats so LLMs should fundamentally be able to recognise a structure and similarities and use that.

ktallett · 2h ago

Tbh, the way everyone has been going out about the quality of Open ai, high school/early university maths problems should not have been a stretch at all for it. The fact that this unverified claim is only just being mentioned suggests their AI isn't quite as amazing as marketed. Especially considering fundamentally logic and following rules should be rather easy to do so and most Olympiad problems are rather easy to extract the key details from.

gametorch · 1h ago

> high school/early university maths problems should not have been a stretch at all for it

This is a ridiculous understatement of the difficulty of getting gold at the IMO.

ktallett · 1h ago

That is the level of math you need to do these problems with a little brief understanding of what certain concepts are. There is no calculus etc. The vast majority of IMO questions are applying the base rules to new problems.

oytis · 1m ago

It's like saying getting a gold medal in boxing is not hard, because it doesn't involve any firearms

Jcampuzano2 · 4m ago

There are entire fields of math with exceptional people trying to solve impossibly hard problems that utilize quite literally 0 calculus.

Many of them are also questions that eventually end up with proofs or solutions that only require very high level understanding of basic principles. But when I say very high I mean like impossibly high for the average person and ability to combine simple concepts to solve complex problems.

I'd wager the majority of Math graduates from universities would struggle to answer most IMO questions.

Davidzheng · 1h ago

You'd be surprised at how much math the people who actually get IMO gold know...

gametorch · 1h ago

Okay, let's see you try any one of the past IMOs and show us your score.

It's really hard.

See my other comment. I was voted the best at math in my entire high school by my teachers, completed the first two years of college classes while still in high school. I've tried IMO problems for fun. I'm very happy if I get one right. I'd be infinitely satisfied to score a perfect on 3 out of 6 problems and that's nowhere near gold.

davidguetta · 29m ago

Wait for the Chinese version

tester756 · 2h ago

huh?

any details?

ktallett · 2h ago

It is able to solve some high school/early bsc maths problems.

Jcampuzano2 · 3m ago

Calling these high school/early bsc maths questions is an understatement lol.

littlestymaar · 2h ago

Which would be impressive if we knew those problems weren't in the training data already.

I mean it is quite impressive how language models are able to mobilize the knowledge they have been trained on, especially since they are able to retrieve information from sources that may be formatted very differently, with completely different problem statement sentences, different variable names and so on, and really operate at the conceptual level.

But we must wary of mixing up smart information retrieval with reasoning.

ktallett · 2h ago

Considering these LLM utilise the entirety of the internet, there will be no unique problems that come up in the oLympiad. Even across the course of a degree, you will have likely been exposed to 95% of the various ways to write problems. As you say, retrieval is really the only skill here. There is likely no reasoning.

reactordev · 13m ago

The Final boss was:

   Which is greater, 9.11 or 9.9?

I kid, this is actually pretty amazing!! I've noticed over the last several months that I've had to correct it less and less when dealing with advanced math topics so this aligns.

Lionga · 2h ago

counting "R"s in strawberry now counts for a gold medal in math?

ktallett · 2h ago

The Olympiad is a great thing for children for sure. This is not what I feel we should be wasting resources on though for AI. I question if it's even impressive.

baq · 2h ago

Velocity of AI progress in recent years is exceeded only by velocity of goalposts.

ktallett · 2h ago

The goalposts should focus on being able to make a coherent statement using papers on a subject with sources. At this point it can't do that for any remotely cutting edge topic. This is just a distraction.

mindwok · 1h ago

The idea of a computer being able to solve IMO problems it has not seen before in natural language even just 3 years ago would be completely science fiction. This is astounding progress.

China Kicks Off Controversial Mega-Dam Project in Tibet (bloomberg.com)

I built a boilerplate to help you create Figma/Miro-like apps (old.reddit.com)

Anthropic: Subprocessors (trust.anthropic.com)

Movfuscator – The single instruction C compiler (github.com)

Eddy Current Separator (en.wikipedia.org)

Hetzner Servers Benchmarks (softuts.com)

Solar module prices on a clear downward trend (pv-magazine.com)

China proves that open models are more effective than all the GPUs in the world (theregister.com)

PostgreSQL Schema Management via Source Files (github.com)

Argon2: RFC 9106 compliant password hashing library in vlang (github.com)

Rust SCA/SAST API for binary analysis with SBoM generation (github.com)

China Became the Biggest Shipbuilder (construction-physics.com)

NASA's X-59 Quiet Supersonic Aircraft Begins Taxi Tests (nasa.gov)

How do you retain what you read from nonfiction books?

Agile Change Guide – short summaries of 100+ agile topics by me and GPT (github.com)

Show HN: A Repo That Takes a screenshot of ideabrowser's idea of the day daily (github.com)

OpenAI achieved gold medal-level performance on the 2025 IMO (twitter.com)

OLEDs of 4 Gbps light the way to faster longer-distance wireless communication (spie.org)

Human Trajectory Prediction Based on Pose and Initial Velocity Information (mdpi.com)

Show HN: A browser-based emoji to favicon generator (emo2fav.netlify.app)

The Big LLM Architecture Comparison (magazine.sebastianraschka.com)

Wishes Upon My Demise (vale.rocks)

The secret to resolutions? Enjoy the pursuit, not the outcome (news.cornell.edu)

How the Etch a Sketch Etched Itself into Pop Culture (smithsonianmag.com)

There will be no rapture to prevent us from pain (substack.com)

Unfollowing hyperpartisan influencers durably reduces out-party animosity (osf.io)

METR's AI Coding RCT (thezvi.substack.com)

OpenAI achieved IMO gold with experimental reasoning model (old.reddit.com)

Fedora SIG changes Python packaging strategy (lwn.net)

Show HN: I wanted better book recommendations – so I made Lorekeep (lorekeep.io)

Agents Built from Alloys (xbow.com)

Proxmox All the Things Almost (2023) (taoofmac.com)

MakeShift: Security Analysis of Shimano Di2 Wireless Gear Shifting in Bicycles (usenix.org)

Reddit users in the UK must now upload selfies to access NSFW subreddits (mashable.com)

Show HN: NativeSwap – Low cost cross-chain swaps without wrappers or bridges (nativeswap.io)

New AI Features in Google Search (blog.google)

Gabe Newell thinks AI tools will result in a 'funny situation' (pcgamer.com)

Sutton SignWriting is a writing system for sign languages (en.m.wikipedia.org)

A Compendium of Web Performance (infrequently.org)

An Ancient Law Could Shape the Modern Future of America's Beaches (nytimes.com)

My Seventh Year as a Bootstrapped Founder (mtlynch.io)

I did give an agent access to my Google Cloud production instances (twitter.com)

Ruby 3.4.5 Released (ruby-lang.org)

Crates.io/crates/s developed for AI era optimized lower input token savings (github.com)

Ex-IDF cyber chief on Iran, Scattered Spider, why social engineering worries him (theregister.com)

GPT-5-reasoning alpha found in the wild (twitter.com)

ECS Survivors Part VI: Code Refactor (blog.ptidej.net)

We're All AI Agents (twitter.com)

Lose/Lose (en.wikipedia.org)

Distillation Makes AI Models Smaller and Cheaper (quantamagazine.org)

OpenAI claims Gold-medal performance at IMO 2025

Comments (54)