> AlphaGeometry and AlphaProof required experts to first translate problems from natural language into domain-specific languages, such as Lean, and vice-versa for the proofs. It also took two to three days of computation. This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
simonw · 2h ago
I interpreted that bit as meaning they did not manually alter the problem statement before feeding it to the model - they gave it the exact problem text issued by IMO.
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
jonahx · 15m ago
As a side question, do you think using tools like Lean will become a staple of these "deep reasoning" LLM flavors?
It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.
modeless · 2h ago
Yes, that quote is contained in my comment. But I don't think it unambiguously excludes tool use in the internal chain of thought.
I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.
KoolKat23 · 1h ago
End to end in natural language would imply no tool use, I'd imagine. Unless it called another tool which converted it but that would be a real stretch (smoke and mirrors).
getnormality · 1h ago
We're told that formal verification tools like Lean are not used to solve the actual IMO problems, but are they used in training the model to solve the problems?
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
modeless · 35m ago
It seems very likely from the description in the link that formal verification tools for mathematical proofs were used in part of the RL training for this model. On the other hand, OpenAI claims "We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling." Which might suggest that they don't specifically use e.g. Lean in their training process. But it's not explicitly stated. All we can really do is speculate unless they publish more detail.
kevinventullo · 2h ago
This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
kevinventullo · 1h ago
(Stream of consciousness aside:
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
modeless · 1h ago
These problems are designed to be solvable by humans without tools. No reason we can't give tools to the models when they go after harder problems. I think it's good to at least reproduce human-level skill without tools first.
kevinventullo · 59m ago
Oh so to be clear, I view formal methods as less of a useful tool, and more as enforcing a higher standard of proof. E.g. it’s not clear to me that having access to Lean would actually help a human in the IMO; certainly most professional mathematicians are not yet getting a positive ROI from formalization. But I’m sort of an armchair expert here; I could be wrong!
shiandow · 2h ago
Comparing the answers between Openai and Gemini the writing style of Gemini is a lot clearer. It could be presented a bit better but it's easy enough to follow the proof. This also makes it a lot shorter than the answer given by OpenAI and it uses proper prose.
Gemini is clearer but MY GOD is it verbose. e.g. look at problem 1, section
2. Analysis of the Core Problem - there's nothing at all deep here, but it seems the model wants to spell out every single tiny logical step. I wonder if this is a stylistic choice or something that actually helps the model get to the end.
shiandow · 20m ago
Section 2 is a case by case analysis. Those are never pretty but perfectly normal given the problem.
With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.
Sol- · 15m ago
> all within the 4.5-hour competition time limit
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
be7a · 3h ago
Super interesting that they moved away from their specialized, Lean-based system from last year to a more general-purpose LLM + RL approach. I would suspect this likely leads to improved performance even outside of math competitions. It’ll be fascinating to see how much further this frontier can go.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
tornikeo · 3h ago
I think we are having a Deep Blue vs. Kasparov moment in Competitive Math right now. This is a large progress from just a few years ago and yet I think we still are really far away from even a semi-respectable AI mathematician. What an exciting time to be alive!
NitpickLawyer · 2h ago
Terrence Tao, in a recent podcast, said that he's very interested in "working along side these tools". He sees the best use in the near future as "explorers of human set vision" in a way. (i.e. set some ideas/parameters and let the LLMs explore and do parallel search / proof / etc)
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
Most critical piece of information I couldn’t find is - how many shot was this?
Could it understand the solution is correct by itself (one-shot)? Or did it have just great math intuition and knowledge? How the solutions were validated if it was 10-100 shot?
foota · 12m ago
Let's throw these deep learning models at the classification of simple finite groups and see if they can come up with something shorter than 1000 pages.
gyrovagueGeist · 1h ago
Useful and interesting but likely still dangerous in production without connecting to formal verification tools.
I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.
sweezyjeezy · 45m ago
100% o3 has a strong bias towards "write something that looks like a formal argument that appears to answer the question" over writing something sound.
I gave it a bunch of recent, answered MathOverflow questions - graduate level maths questions. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!
neom · 1h ago
How much of a big deal is this stuff? I was blessed with dyscalculia so I can hardly add two numbers together, don't pay much attention to the mathematics word, but my reading indicates this is extremely difficult/humans cannot do this?
What comes next for this particular exercise? Thank you.
gundmc · 3h ago
I assume that an "advanced version" of Gemini Deepthink means it was a different model or had significantly more test time compute than the upcoming Deepthink in the Gemini Ultra subscription. Still cool to see OpenAI and Google neck and neck.
vouaobrasil · 30m ago
This is making mathematics too systematic and mechanical, and it kills the joy of it....
lufenialif2 · 3h ago
Still no information on the amount of compute needed; would be interested to see a breakdown from Google or OpenAI on what it took to achieve this feat.
Something that was hotly debated in the thread with OpenAI's results:
"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."
it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.
Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.
dvh · 2h ago
Ok but when reported by mass media, which never used SI units and instead uses units like libraries of Congress, or elephants, what kind of unit should media use to compare computational energy of ai vs children?
rfurmani · 2h ago
Dollars of compute at market rate is what I'd like to see, to check whether calling this tool would cost $100 or $100,000
gus_massa · 2h ago
4.5 hours × 2 "days",
100 Wats including support system.
I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.
Total = 900Wh = 3.24MJ
lufenialif2 · 2h ago
Convert libraries, elephants, etc into SI of course! Otherwise, they aren't really comparable...
dortlick · 2h ago
Kilocalories. A unit of energy that equals 4184 Joules.
thrance · 2h ago
If the models that got a gold medal are anything like those used on ARC-AGI, then you can bet they wrote an insane amount of text trying to reason their ways through these problems. Like, several bookshelves worth of writings.
So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.
pfortuny · 2h ago
They can train it n “Crux Mathematicorum” and similar journals, which are collections of “interesting” problems and their solutions.
Some unofficial comparison with costs of public models (performing worse): https://matharena.ai/imo/
So the real cost is something much more.
vonneumannstan · 3h ago
>it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.
Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.
AlotOfReading · 3h ago
Does the IMO reuse problems? My understanding is that new problems are submitted each year and 6 are selected for each competition. The submitted problems are then published after the IMO has concluded. How would the training data contain unpublished, newly submitted problems?
Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.
AlanYx · 3h ago
IMO doesn't reuse problems, but Terence Tao has a Mastodon post where he explains that the first five (of six) problems are generally ones where existing techniques can be leveraged to get to the answer. The sixth problem requires considerable originality. Notably, both Gemini and OpenAI's model didn't get the sixth problem. Still quite an achievement though.
apayan · 2h ago
Do you have another source for that? I checked his Mastodon feed and don't see any mention about the source of the questions from the IMO.
strange statement--it's not true in general for sure (3&6 typically hardest but they certainly aren't fundamentally of a different nature to other questions) this year P6 seemed to be by far the hardest though but this posthoc statement should be read cautiously
vonneumannstan · 1h ago
>How would the training data contain unpublished, newly submitted problems?
I don't think I or op suggested it did.
sottol · 3h ago
Or that they did significant retraining to boost IMO performance creating a more specialized model at the cost of general-purpose performance.
Workaccount2 · 1h ago
If nothing else, I'd imagine these tools will allow mathematicians to automate a lot of busy work that exists between ideas and validation.
brokensegue · 2h ago
based on the score looks like they also couldn't answer question 6? has that been confirmed?
Surprising since Reid Barton was working on a lean system.
dist-epoch · 2h ago
The bitter lesson.
tough · 1h ago
well lean systems might be still useful for other stuff than max benching
my point being transformers and llms have all the tailwind of all the infra+lateral discoveries/improvements being put into them.
does that mean they're the one tool to unlock machine intelligence? I dunno
deanCommie · 3h ago
From Terence Tao, via mastodon [0]:
> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.
> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.
> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".
> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:
* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)
* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.
* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.
* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.
* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.
* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.
* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.
> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.
> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.
> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.
Unlike OpenAI, Deepmind at least signed up for the competition ahead of time.
Agree with Tao though, I am skeptical of any result of this type unless there's a lot of transparency, ideally ahead of time. If not ahead of time, then at least the entire prompt and fine-tune data that was used.
This is a fair reply, but TBH I don't think it's going to change much. The upper echelon of the human society has decided to move AI forward rapidly regardless of any consequences. The rest of us can only hold and pray.
justanotherjoe · 1h ago
You are watching american money hard at work, my friend. It's either glorious or reckless, hard to tell for now.
ferguess_k · 1h ago
Could be both, but one for different group of people.
scotty79 · 2h ago
Some of the critique is valid but some of it sounds like, "but the rules of the contest are that participants must use less than x joules of energy obtained from cellular respiration and have a singular consciousness"
I don't think anybody thinks AI was competing fair and within the rules that apply to humans. But if the humans were competing on the terms that AI solved those problems on, near-unlimited access to energy, raw compute and data, still very few humans could solve those problems within a reasonable timeframe. It would take me probably months or years to educate myself sufficiently to even have a chance.
earthicus · 17m ago
I don't think that characterization is fair at all. It's certainly true that you, me, and most humans can't solve these problems with any amount of time or energy. But the problems are specifically written to be at the limit of what the actual high school students who participate can solve in four hours. Letting the actual students taking the test have four days instead of four hours would make a massive difference in their ability to solve them.
Said differently, the students, difficulty of the problems, and time limit are specifically coordinated together, so the amount of joules of energy used to produce a solution is not arbitrary. In the grand scheme of how the tech will improve over time, it seems likely that doesn't matter and the computers will win by any metric soon enough, but Tao is completely correct to point out that you haven't accurately told us what the machines can do today, in July 2025, without telling us ahead of time exactly what rules you are modifying.
jeffbee · 3h ago
Advanced Gemini, not Gemini Advanced. Thanks, Google. Maybe they should have named it MathBard.
thewebguyd · 3h ago
Google and Microsoft continuing to prove that the hardest problem in programming is naming things.
nashashmi · 2h ago
There are only so many words in the modern English language to hint at "next upgrade": pro, plus, ultra, new, advanced, magna, X, Z, and Ultimate. Even fewer words to explain miniaturized: mini, lite, and zero. And marketers are trying to seesaw on known words without creating new ones to explain new tech. This is why we have Bard and Gemini and Chat and Copilot.
Taking a step back, it is overused exaggeration to the point where words run out quick and newer tech needs to fight with existing words for dominance. Copilot should be the name of an AI agent. Bard should have been just a text generator. Gemini is the name of a liar. Chat is probably the iphone of naming but the GPT suffix says Creativity had not come to work that day.
SirMaster · 2h ago
Good thing we have a system called numbers that can easily designate an infinite range of greater and greater things.
NitpickLawyer · 26m ago
Heh, you'd still get confusing stuff. Wait, is gemini 2.5.3.1 the math one or the erotica lit one?
antonvs · 2h ago
There are two different versions of that hard problem: the computer science version, and the marketing version. The marketing one has more nebulous acceptance criteria though.
seydor · 2h ago
They will have to rename gemini anyway, since it's doubtful they will ever be able to buy gemini.com. Gemini turboflex plus pro SE plaid interstellar 4.0
catigula · 2h ago
What makes you think google can't buy that domain?
It's not a new product/model with that name, they're just saying it's an advanced version of Gemini that's not public atm
akomtu · 2h ago
Those who keep their identity in their intelligence are heading into the rough seas once the proto-AI becomes real-AI in the coming years. What's the value of your smart thoughts if AI on your smartwatch can do it better, faster and cheaper?
Also, how is AI going to change a society ruled by competitiveness, where the winner takes all? You may not want to replace your thinking with AI, but your colleagues will. Their smartwatches or smartglasses will outcompete you with ease and your boss will tell you one day that the company doesn't need you anymore.
Think of it again. Today advertisers fight each other with their ad budgets: those who spend more, get more attention and win. Tomorrow everyone will need a monthly subscription to AI for it will be the price of staying competitive, relevant and employed.
ekojs · 3h ago
> Btw as an aside, we didn’t announce on Friday because we respected the IMO Board's original request that all AI labs share their results only after the official results had been verified by independent experts & the students had rightly received the acclamation they deserved
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
Was OpenAI simply not coordinating with the IMO Board then?
thegeomaster · 2h ago
Yes, there have been multiple (very big) hints dropped by various people that they had no official cooperation.
osti · 2h ago
I think this is them not being confident enough before the event, so they don't wanna be shown a worse result than competitors. By being private they can obviously not publish anything if it didn't work out.
ml-anon · 2h ago
As not-so-subtly hinted at by Terry Tao.
Its a great way to do PR but its a garbage way to to science.
osti · 1h ago
True, but openai definitely isn't trying to do public research on science, they are all about money now.
ml-anon · 1h ago
Thats not a contentious statement. Its still a pathetic way to behave at a kids competition no less.
esafak · 1h ago
They shot themselves in the foot by not showing the confidence that Google did.
pphysch · 2h ago
This reminds me of when OpenAI made a splash (ages ago now) by beating the world's best Dota 2 teams using a RL model.
...Except they had to substantially bend the rules of the game (limiting the hero pool, completely changing/omitting certain mechanics) to pull this off. So they ended up beating some human Dota pros at a psuedo-Dota custom game, which was still impressive, but a very much watered-down result beneath the marketing hype.
It does seem like Money+Attention outweigh Science+Transparency at OpenAI, and this has always been the case.
NitpickLawyer · 24m ago
Limiting the hero pool was fair I'd say. If you can prove RL works on one hero, it's fairly certain it would work on other heroes. All of them at once? Maybe run into problems. But anyway you'd need orders of magnitude more compute so I'd say that was fair game.
dmitrygr · 2h ago
> Was OpenAI simply not coordinating with the IMO Board then?
You are still surprised by sama@'s asinineness? You must be new here.
antonvs · 2h ago
When your goal is to control as much of the world's money as possible, preferably all of it, then everyone is your enemy, including high school students.
dmitrygr · 2h ago
How dare those high school students use their brains to compete with ChatGPT and deny the shareholders their value?
MisterPea · 2h ago
I am still surprised many people trust him. The board's (justified) decision to fire him was so awfully executed that it lead to him having even more slack
- "According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday."
https://x.com/mihonarium/status/1946880931723194389
sherburt3 · 2h ago
What a great metaphor for AI. Taking an event that is a celebration of high school kids' knowledge and abilities and turning it into a marketing stunt for their frankenstein monster that they are building to make all the kids' hard work worth nothing.
ml-anon · 2h ago
Not only, by not officially entering they had no obligation to announce their result so if they didn't achieve a gold medal score they presumably wouldn't have made any announcement and no-one would have been the wiser.
This cowardly bullshit followed by the grandstanding on Twitter is high-school bully behaviour.
gowld · 57s ago
If they failed and remained quiet, then everyone would know that the other companies performed well and they didn't even qualify.
echelon · 2h ago
Google did the correct and respectful thing.
It appears that OpenAI didn't officially enter (whereas Google did), that they knew Google was going to gold medal, and that they released their news ahead of time (disrespecting the kids and organizers) so they could scoop Google.
Really scummy on OpenAI's part.
The IMO closing ceremony was July 19. OpenAI announced on the same day.
IMO requested the tech companies to wait until the following week.
He claims nobody made that request to OpenAI. It was a request made to Google and others who were being judged by the actual contest judges, which OpenAI was not.
No comments yet
ml-anon · 2h ago
This is weasily bullshit from Brown.
mike_d · 2h ago
Proposed question for next IMO: "Show a proof that 'after the closing ceremony' and 'one week later' are not the same unit of time"
lvl155 · 3h ago
Seems OpenAI knew this is forthcoming so they front ran the news? I was really high on Gemini 2.5 Pro after release but I kept going back to o3 for anything I cared about.
sigmoid10 · 3h ago
>I was really high on Gemini 2.5 Pro after release but I kept going back to o3 for anything I cared about
Same here. I was impressed by their benchmarks and topping most leaderboards, but in day to day use they still feel so far behind.
aerhardt · 1h ago
I use o3, openAI API and Claude Code. Genuinely curious what about Gemini 2.5 is so far behind?
danjl · 3h ago
I think that's most likely just your view, and not really based on evidence.
No comments yet
erichocean · 2h ago
I regularly have the opposite experience: o3 is almost unusable, and Gemini 2.5 Pro is reliably great. Claude Opus 4 is a close second.
o3 is so bad it makes me wonder if I'm being served a different model? My o3 responses are so truncated and simplified as to be useless. Maybe my problems aren't a good fit, but whatever it is: o3 output isn't useful.
helloplanets · 2h ago
Are you using a tool other than ChatGPT? If so, check the full prompt that's being sent. It can sometimes kneecap the model.
Tools having slightly unsuitable built in prompts/context sometimes lead to the models saying weird stuff out of the blue, instead of it actually being a 'baked in' behavior of the model itself. Seen this happen for both Gemini 2.5 Pro and o3.
Davidzheng · 2h ago
I have this distinctive feeling that o3 tries to trick me intentionally when it can't solve a problem by cleverly hiding its mistakes. But I could be imagining it
int_19h · 1h ago
It's certainly the "laziest" model, in the sense that it seems to be the likeliest to avoid doing the actual work and generate "TBD" stubs instead.
square_usual · 2h ago
Are you using o3 on the official ChatGPT app or via API? I use it on the app and it performs very well, it's my go-to model for general purpose LLM use.
nomad_horse · 3h ago
Do I understand it correctly that OpenAI self-proclaimed that they got their gold, without the official IMO judges grading their solutions?
NitpickLawyer · 3h ago
Well, I don't doubt that they did get those results, but it is clear now that it was not an official collaboration. It was heavily implied in a statement by IMO's president a few days ago (the one where they said they'd prefer AI companies wait a week before announcing, so that the focus is first on the human competitors).
Goog had an official colab with IMO, and we can be sure they got those results under the imposed constraints (last year they allocated ~48h for silver IIRC) and an official grading by the IMO graders.
> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
That means Google Deepmind is the first OFFICIAL IMO Gold.
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
nomad_horse · 2h ago
Do you know if OpenAI used the same grading criteria as official judges?
tshadley · 1h ago
As IMO medalists they would be expected to I'm sure.
But this can be verified because the results are public:
Childish. And, of course they must have known there was an official LLM cohort taking the real test, and they probably even knew that Gemini got a gold medal, and may have even known that Google planned a press release for today.
cma · 2h ago
I think maybe all Altman companies have used tactics like this.
> We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.
> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.
> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract.
> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.
So, a more charismatic version of Zuck is Zucking, what a surprise. Company culture starts at its origin. Despite Google's corruption, its origin is in academia and it shows even now.
gametorch · 3h ago
(deleted because I was mistaken)
nomad_horse · 3h ago
isn't he an IOI medalist? and even if he was an IMO medalist, isn't there a bit of a conflict of interests?
cbsmith · 3h ago
Just a wee bit.
cbsmith · 3h ago
I'm pretty sure when they got the gold medal they weren't allowed to judge themselves.
guluarte · 2h ago
Looks like models can solve slightly modified problems and extrapolate from their training data, amazing! /s
I will be surprised when a model with only the knowledge of a college student can solve these problems.
jtlicardo · 1h ago
At this point, I wonder how long software engineers will keep convincing themselves they’re irreplaceable
So, the problem wasn't translated to Lean first. But did the model use Lean, or internet search, or a calculator or Python or any other tool during its internal thinking process? OpenAI said theirs didn't, and I'm not sure if this is exactly the same claim. More clarity on this point would be nice.
I would also love to know the rough order of magnitude of the amount of computation used by both systems, measured in dollars. Being able to do it at all is of course impressive, but not useful yet if the price is outrageous. In the absence of disclosure I'm going to assume the price is, in fact, outrageous.
Edit: "No tool use, no internet access" confirmed: https://x.com/FredZhang0/status/1947364744412758305
> This year, our advanced Gemini model operated end-to-end in natural language, producing rigorous mathematical proofs directly from the official problem descriptions – all within the 4.5-hour competition time limit
It is not clear to me from that paragraph if the model was allowed to call tools on its own or not.
It seems that LLMs excel (relative to other paradigms) in the kind of "loose" creative thinking humans do, but are also prone to the same kinds of mistakes humans make (hallucinations, etc). Just as Lean and other formal systems can help humans find subtle errors in their own thinking, they could do the same for LLMs.
I don't think tool use would detract from the achievement, necessarily. I'm just interested to know.
We know from Google's 2024 IMO work that they have a way to translate natural language proofs to formally verifiable ones. It seems like a natural next step would be to leverage this for RLVR in training/fine-tuning. During training, any piece of reasoning generated by the math LLM could be translated, verified, and assigned an appropriate reward, making the reward signal much denser.
Reward for a fully correct proof of a given IMO problem would still be hard to come by, but you could at least discourage the model from doing wrong or indecipherable things. That plus tons of compute might be enough to solve IMO problems.
In fact it probably would be, right? We already know from AlphaProof that by translating LLM output back and forth between formal Lean proofs, you can search the space of reasoning moves efficiently enough to solve IMO-class problems. Maybe you can cut out the middleman by teaching the LLM via RLVR to mimic formal reasoning, and that gets you roughly the same efficiency and ability to solve hard problems.
I think I have a minority opinion here, but I’m a bit disappointed they seem to be moving away from formal techniques. I think if you ever want to truly “automate” math or do it at machine scale, e.g. creating proofs that would amount to thousands of pages of writing, there is simply no way forward but to formalize. Otherwise, one cannot get past the bottleneck of needing a human reviewer to understand and validate the proof.
That said, letting machines go wild in the depths of the consequences of some axiomatic system like ZFC may reveal a method of proof mathematicians would find to be monstrous. So like, if ZFC is inconsistent, then anything can be proven. But short of that, maybe the machines will find extremely powerful techniques which “almost” prove inconsistency that nevertheless somehow lead to logical proofs of the desired claim. I’m thinking by analogy here about how speedrunning seems to often devolve into exploiting an ACE glitch as early as possible, thus meeting the logical requirements of finishing the game while violating the spirit. Maybe we’d have to figure out what “glitchless ZFC” should mean. Maybe this is what logicians have already been doing heh).
Google https://storage.googleapis.com/deepmind-media/gemini/IMO_202...
OpenAI https://github.com/aw31/openai-imo-2025-proofs/
With OpenAI that part takes up about 2/3 if the proof even with its fragmented prose. I don't think it does much better.
Both OpenAI and Google pointed this out, but does that matter a lot? They could have spun up a million parallel reasoning processes to search for a proof that checks out - though of course some large amount of computation would have to be reserved for some kind of evaluator model to rank the proofs and decide which one to submit. Perhaps it was hundreds of years of GPU time.
Though of course it remains remarkable that this kind of process finds solutions at all and is even parallelizable to this degree, perhaps that is what they meant. And I also don't want to diminish the significance of the result, since in the end it doesn't matter if we get AGI with overwhelming compute or not. The human brain doesn't scale as nicely, even if it's more energy efficient.
The article also suggests that the system used isn’t too far ahead of their upcoming general "DeepThink" model / feature, which is they announced for this summer.
Your comparison with chess engines is pretty spot-on, that's how the best of the best chess players do prep nowadays. Gone are the multi person expert teams that analysed positions and offered advice. They now have analysts that use supercomputers to search through bajillions of positions and extract the best ideas, and distill them to their players.
I was recently researching AI's for this, seems it would be a huge unlock for some parts of science where this is the case too like chess
Could it understand the solution is correct by itself (one-shot)? Or did it have just great math intuition and knowledge? How the solutions were validated if it was 10-100 shot?
I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.
I gave it a bunch of recent, answered MathOverflow questions - graduate level maths questions. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!
What comes next for this particular exercise? Thank you.
Something that was hotly debated in the thread with OpenAI's results:
"We also provided Gemini with access to a curated corpus of high-quality solutions to mathematics problems, and added some general hints and tips on how to approach IMO problems to its instructions."
it seems that the answer to whether or not a general model could perform such a feat is that the models were trained specifically on IMO problems, which is what a number of folks expected.
Doesn't diminish the result, but doesn't seem too different from classical ML techniques if quality of data in = quality of data out.
I'm not sure how to implement the "no calculator" rule :) but for this kind of problems it's not critical.
Total = 900Wh = 3.24MJ
So funnily enough, "the AI wrote x times the library of Congress to get there" is good enough of a comparison.
https://cms.math.ca/publications/crux
So the real cost is something much more.
Not sure thats exactly what that means. Its already likely the case that these models contained IMO problems and solutions from pretraining. It's possible this means they were present in the system prompt or something similar.
Obviously the training data contained similar problems, because that's what every IMO participant already studies. It seems unlikely that they had access to the same problems though.
https://mathstodon.xyz/@tao
I don't think I or op suggested it did.
http://simonw.substack.com https://simonwillison.net
Very in character for them!
my point being transformers and llms have all the tailwind of all the infra+lateral discoveries/improvements being put into them.
does that mean they're the one tool to unlock machine intelligence? I dunno
> It is tempting to view the capability of current AI technology as a singular quantity: either a given task X is within the ability of current tools, or it is not. However, there is in fact a very wide spread in capability (several orders of magnitude) depending on what resources and assistance gives the tool, and how one reports their results.
> One can illustrate this with a human metaphor. I will use the recently concluded International Mathematical Olympiad (IMO) as an example. Here, the format is that each country fields a team of six human contestants (high school students), led by a team leader (often a professional mathematician). Over the course of two days, each contestant is given four and a half hours on each day to solve three difficult mathematical problems, given only pen and paper. No communication between contestants (or with the team leader) during this period is permitted, although the contestants can ask the invigilators for clarification on the wording of the problems. The team leader advocates for the students in front of the IMO jury during the grading process, but is not involved in the IMO examination directly.
> The IMO is widely regarded as a highly selective measure of mathematical achievement for a high school student to be able to score well enough to receive a medal, particularly a gold medal or a perfect score; this year the threshold for the gold was 35/42, which corresponds to answering five of the six questions perfectly. Even answering one question perfectly merits an "honorable mention".
> But consider what happens to the difficulty level of the Olympiad if we alter the format in various ways:
* One gives the students several days to complete each question, rather than four and half hours for three questions. (To stretch the metaphor somewhat, consider a sci-fi scenario in the student is still only given four and a half hours, but the team leader places the students in some sort of expensive and energy-intensive time acceleration machine in which months or even years of time pass for the students during this period.)
* Before the exam starts, the team leader rewrites the questions in a format that the students find easier to work with.
* The team leader gives the students unlimited access to calculators, computer algebra packages, formal proof assistants, textbooks, or the ability to search the internet.
* The team leader has the six student team work on the same problem simultaneously, communicating with each other on their partial progress and reported dead ends.
* The team leader gives the students prompts in the direction of favorable approaches, and intervenes if one of the students is spending too much time on a direction that they know to be unlikely to succeed.
* Each of the six students on the team submit solutions, but the team leader selects only the "best" solution to submit to the competition, discarding the rest.
* If none of the students on the team obtains a satisfactory solution, the team leader does not submit any solution at all, and silently withdraws from the competition without their participation ever being noted.
> In each of these formats, the submitted solutions are still technically generated by the high school contestants, rather than the team leader. However, the reported success rate of the students on the competition can be dramatically affected by such changes of format; a student or team of students who might not even reach bronze medal performance if taking the competition under standard test conditions might instead reach gold medal performance under some of the modified formats indicated above.
> So, in the absence of a controlled test methodology that was not self-selected by the competing teams, one should be wary of making apples-to-apples comparisons between the performance of various AI models on competitions such as the IMO, or between such models and the human contestants.
> Related to this, I will not be commenting on any self-reported AI competition performance results for which the methodology was not disclosed in advance of the competition. EDIT: In particular, the above comments are not specific to any single result of this nature.
[0] https://mathstodon.xyz/@tao/114881418225852441
Agree with Tao though, I am skeptical of any result of this type unless there's a lot of transparency, ideally ahead of time. If not ahead of time, then at least the entire prompt and fine-tune data that was used.
A human metaphor for evaluating AI capability - https://news.ycombinator.com/item?id=44622973 - July 2025 (30 comments)
I don't think anybody thinks AI was competing fair and within the rules that apply to humans. But if the humans were competing on the terms that AI solved those problems on, near-unlimited access to energy, raw compute and data, still very few humans could solve those problems within a reasonable timeframe. It would take me probably months or years to educate myself sufficiently to even have a chance.
Said differently, the students, difficulty of the problems, and time limit are specifically coordinated together, so the amount of joules of energy used to produce a solution is not arbitrary. In the grand scheme of how the tech will improve over time, it seems likely that doesn't matter and the computers will win by any metric soon enough, but Tao is completely correct to point out that you haven't accurately told us what the machines can do today, in July 2025, without telling us ahead of time exactly what rules you are modifying.
Taking a step back, it is overused exaggeration to the point where words run out quick and newer tech needs to fight with existing words for dominance. Copilot should be the name of an AI agent. Bard should have been just a text generator. Gemini is the name of a liar. Chat is probably the iphone of naming but the GPT suffix says Creativity had not come to work that day.
Also, how is AI going to change a society ruled by competitiveness, where the winner takes all? You may not want to replace your thinking with AI, but your colleagues will. Their smartwatches or smartglasses will outcompete you with ease and your boss will tell you one day that the company doesn't need you anymore.
Think of it again. Today advertisers fight each other with their ad budgets: those who spend more, get more attention and win. Tomorrow everyone will need a monthly subscription to AI for it will be the price of staying competitive, relevant and employed.
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
From https://x.com/demishassabis/status/1947337620226240803
Was OpenAI simply not coordinating with the IMO Board then?
Its a great way to do PR but its a garbage way to to science.
...Except they had to substantially bend the rules of the game (limiting the hero pool, completely changing/omitting certain mechanics) to pull this off. So they ended up beating some human Dota pros at a psuedo-Dota custom game, which was still impressive, but a very much watered-down result beneath the marketing hype.
It does seem like Money+Attention outweigh Science+Transparency at OpenAI, and this has always been the case.
You are still surprised by sama@'s asinineness? You must be new here.
- OpenAI claims gold-medal performance at IMO 2025 https://news.ycombinator.com/item?id=44613840
- "According to a friend, the IMO asked AI companies not to steal the spotlight from kids and to wait a week after the closing ceremony to announce results. OpenAI announced the results BEFORE the closing ceremony.
According to a Coordinator on Problem 6, the one problem OpenAI couldn't solve, "the general sense of the IMO Jury and Coordinators is that it was rude and inappropriate" for OpenAI to do this.
OpenAI wasn't one of the AI companies that cooperated with the IMO on testing their models, so unlike the likely upcoming Google DeepMind results, we can't even be sure OpenAI's "gold medal" is legit. Still, the IMO organizers directly asked OpenAI not to announce their results immediately after the olympiad.
Sadly, OpenAI desires hype and clout a lot more than it cares about letting these incredibly smart kids celebrate their achievement, and so they announced the results yesterday." https://x.com/mihonarium/status/1946880931723194389
This cowardly bullshit followed by the grandstanding on Twitter is high-school bully behaviour.
It appears that OpenAI didn't officially enter (whereas Google did), that they knew Google was going to gold medal, and that they released their news ahead of time (disrespecting the kids and organizers) so they could scoop Google.
Really scummy on OpenAI's part.
The IMO closing ceremony was July 19. OpenAI announced on the same day.
IMO requested the tech companies to wait until the following week.
They requested week after?
No comments yet
Same here. I was impressed by their benchmarks and topping most leaderboards, but in day to day use they still feel so far behind.
No comments yet
o3 is so bad it makes me wonder if I'm being served a different model? My o3 responses are so truncated and simplified as to be useless. Maybe my problems aren't a good fit, but whatever it is: o3 output isn't useful.
Tools having slightly unsuitable built in prompts/context sometimes lead to the models saying weird stuff out of the blue, instead of it actually being a 'baked in' behavior of the model itself. Seen this happen for both Gemini 2.5 Pro and o3.
Goog had an official colab with IMO, and we can be sure they got those results under the imposed constraints (last year they allocated ~48h for silver IIRC) and an official grading by the IMO graders.
https://x.com/alexwei_/status/1946477754372985146
> 6/N In our evaluation, the model solved 5 of the 6 problems on the 2025 IMO. For each problem, three former IMO medalists independently graded the model’s submitted proof, with scores finalized after unanimous consensus. The model earned 35/42 points in total, enough for gold!
That means Google Deepmind is the first OFFICIAL IMO Gold.
https://x.com/demishassabis/status/1947337620226240803
> We've now been given permission to share our results and are pleased to have been part of the inaugural cohort to have our model results officially graded and certified by IMO coordinators and experts, receiving the first official gold-level performance grading for an AI system!
But this can be verified because the results are public:
https://github.com/aw31/openai-imo-2025-proofs/
> We were trying to get a big client for weeks, and they said no and went with a competitor. The competitor already had a terms sheet from the company were we trying to sign up. It was real serious.
> We were devastated, but we decided to fly down and sit in their lobby until they would meet with us. So they finally let us talk to them after most of the day.
> We then had a few more meetings, and the company wanted to come visit our offices so they could make sure we were a 'real' company. At that time, we were only 5 guys. So we hired a bunch of our college friends to 'work' for us for the day so we could look larger than we actually were. It worked, and we got the contract.
> I think the reason why PG respects Sam so much is he is charismatic, resourceful, and just overall seems like a genuine person.
https://news.ycombinator.com/item?id=3048944
I will be surprised when a model with only the knowledge of a college student can solve these problems.