IQ Tests Results for AI

43 stared 69 8/17/2025, 9:36:22 AM trackingai.org ↗

Comments (69)

gpt5 · 1h ago
The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others - the “positive manifold.”

They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.

IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.

It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.

alphazard · 41m ago
IQ is a discovery about how intelligence occurs in humans. As you mentioned, a single factor explains most of the performance of a human on an IQ test, and that model is better than theories of multiple orthogonal intelligences. To contrast, 5 orthogonal factors are the best model we have for human personality.

The first question to ask is "do LLMs also have a general factor?". How much of an LLMs performance on an IQ test can be explained by a single positive correlation between all questions? I would expect LLMs to perform much better on memory tasks than anything else, and I wouldn't be surprised if that was holding up their scores. Is there a multi factor model that better explains LLM performance on these tests?

naveen99 · 15m ago
Some points on the 4 ? or 5 dimensional personality space correlate with higher iq though.
nsoonhui · 1h ago
Another component of this theory concerning g is that it's largely genetic, and immune to "intervention" AKA stability as you mentioned. See the classic "The Bell Curve" for a full exposition.

Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?

cedilla · 1h ago
"The Bell Curve" is, let's say, highly controversial and not a good introduction into the topic. Its claim that genetics are the main predictor of IQ, which was very weakly supported at the time, has been completely and undeniably refuted by science in the thirty years since it's publication.
alphazard · 30m ago
This is misleading. Anyone who wants to learn about IQ should Google it. It's the most replicated finding in psychology, and any questions you have about twins or groups with similar or different genes have probably been investigated. There is a lot of noise online in the form of commentary about IQ, so it's important to look at actual data if you are skeptical/curious.
nialse · 49m ago
Do note that The Bell Curve is not considered controversial in general. The part about race and genetics is. Also genes being the sole predictor of IQ is not an accurate description of the book’s premise.
hemabe · 26m ago
And yet, in the US, the first start-ups are offering the possibility of testing embryos for their IQ.

https://www.theguardian.com/science/2024/oct/18/us-startup-c...

brabel · 1h ago
Really? If not genetics then what is it? Just random??
hemabe · 16m ago
IQ is largely genetic, even if some people claim otherwise. The evidence for this is now overwhelming: even when different ethnic groups grow up in very similar conditions in the same country, the PISA (which correlates r=0.9 with IQ) scores measured vary greatly. For example, among second-generation children in Germany, there are significant differences in PISA scores. Polish children achieve similar or even better scores than German children. Turkish children, on the other hand, remain at the same poor level that their parents (the first generation of immigrants) achieved in the tests.

Twin studies and studies of adopted children also leave no doubt that there is a very strong genetic component that determines IQ. Even Wikipedia assumes that heritability can be as high as 80%.

Links https://en.wikipedia.org/wiki/Heritability_of_IQ https://www.welt.de/wirtschaft/article174706968/OECD-Studie-...

qayxc · 58m ago
The brain has pretty high plasticity. A large host of factors contribute to the final outcome, from mental stimulation to training to overall health, stress (both physical and mental), and nutrition.

It has been shown that IQ scores improve significantly just by taking them multiple times (training) [1]. They also vary if the tested person is sleep deprived, sick, or stressed.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7709590

nialse · 39m ago
You do have to consider that g is the amount of plasticity though, which is mainly genetic. A better way is to think of it is that genetics provides a potential capacity which may or may not be fulfilled. Training helps individuals to varying degrees.
pona-a · 1h ago
Education? Or more directly, socio-economics.

The many of the subjects tested never had any experience with this kind of formal testing, had little to no education, and of course predictably failed on several abstract tasks. It might be that the very pattern of sitting down and intensely focusing on apparently meaningless problems isn't as innate as expected.

gus_massa · 56m ago
Nutrition: just remove a few vitamins and watch the IQ drop like a stone. Even a "balanced" diet with 500kcal/day will be harmful.

Education: in spite of the claims, a good education raise the IQ measurement. The test leak and school add similar tasks.

nialse · 36m ago
Counter point: does not nutrition and education help some people more than others? That’s the g factor which is mainly genetic.
smokel · 1h ago
If it isn't nature, then it probably is nurture. Averaged over the entire population, that is indeed mostly random.
mdp2021 · 1h ago
Personal development. It's a "subtle" skill.
lukan · 1h ago
Assuming the assumption it is true (which I doubt) - there obviously is still value in teaching knowledge, so making students know more and practical skills, not produce more intelligent students.

You can have a IQ of over 200, but if no one ever showed you how a computer works or gives you a manual, you still won't be productive with it.

But I very much believe, intelligence is improvable and also degradable, just ask some alcoholics for instance.

jdietrich · 1h ago
G is (largely) immutable, but knowledge and skills are not. The economy is not zero-sum and we all benefit from increasing the total amount of human capital. Unfortunately, thinking around education is dominated by people who wrongly believe that the economy is zero-sum.
depressedpanda · 1h ago
If a child is, e.g., two standard deviations below the norm, it is cruel to expect it to keep up with the pace of other students.

Education can be better adapted to the child's needs.

pbmonster · 1h ago
Isn't this argument directly countered by the fact that you can study for IQ tests and subsequently do better?
kingkawn · 31m ago
Yes but they need a bullshit gold star to make themselves feel special
YetAnotherNick · 1h ago
ARC-AGI challenge aims for that. In fact the objective is even more strict that the tasks must be trivial for most humans given time.
krapp · 1h ago
I imagine the value of something like this is for business owners to choose which LLMs they can replace their employees with, so it using human IQ tests is relevant.
azernik · 1h ago
The point is that the correlation between doing well on these tasks and doing well on other (directly useful) tasks is well established for humans, but not well established for LLMs.

If the employees' job is taking IQ tests, then this is a great measure for employers. Otherwise, it doesn't measure anything useful.

bbarnett · 1h ago
Otherwise, it doesn't measure anything useful.

Oh it measures a useful metric, absolutely, as aspects of an IQ test validate certain types of cognition. Those types of cognition have been found to map to real-world employment of the same.

If an AI is so incapable of performing admirably on an IQ test for those types of cognition, then one thing we're certainly measuring is that it's incapable of handling that 'class' of cognition if the conditions change in minuscule and tiny ways.

And that's quite important.

For example, if the model appears to perform specific work tasks well, related to a class of cognition, then cannot do the same category of cognitive tasks outside of that scope, we're measuring lack of adaptability or true cognitive capability.

It's definitely measuring something. Such as, will the model go sideways with small deviations on task or input? That's a nice start.

sigmoid10 · 1h ago
Big caveat here:

This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.

If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

mdp2021 · 1h ago
But when an LLM can fail though having all the time in the world, you are pretty certain you hit a wall.

So, in a way you have defined a good indicator for a limit for a certain area.

sigmoid10 · 1h ago
There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.
mdp2021 · 1h ago
Sure but, you wrote:

> If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

You interpreted "smarter" the IQ way: results constrained time. But we actually get an indicator about the ability of the LLM to be able to reach, given time, the result or not - that is the interpretation of "smarter" that many of us need.

(Of course, it remains to be seen whether the ability to achieve those contextual results exports as an ability relevant to the solutions we actually need.)

sigmoid10 · 1h ago
No, you misunderstood. I'm saying that for reasoning models, there is a lot of untapped capability in this test. I wouldn't be sure that there are hard limits in the sense that I think given enough compute, you'll probably find that a modern high end model will reach 100%. But you probably don't want to spend thousands (or perhaps tens of thousands) of dollars on that. There are much better tests out there if you have money to burn and want to find true hard limits compared to humans.
d4rkn0d3z · 27m ago
Human beings' IQ test results can vary significantly based on how much money is in their pockets. For example, if a farmer takes an IQ test before crops are harvested and sold they score lower than after crops are sold, in the same year.

It seems fairly obvious to me that an LLM is the projection intelligence in the language domain. In other words, if you killed Intelligence and gave it a push in direction of language, the chalk outline you could draw around its dead body on the ground would be an LLM.

Full disclosure: I have taken 2 IQ tests, both online and timed. First was in late 90's after graduated electonics eng. was free, scored 149. After 4 years and obtaining theoretical physics degree, I did another scoring 169. The second test was not free, but I did not pay. I got the second test results because the test site owner personally emailed me my results for free with congrads, because they were the highest ever recorded on the site to date. I did both for fun just see the questions, I think both results are meaningless, the same variability occurs on farmers studied as mentioned above.

mutkach · 1h ago
Judging from the reasoning trace for the problem of the day - almost all of the models obviously had some presence of IQ training data or at least it could be said that the models are very biased in a beneficial way. From the beginning of the trace you kinda see that the model had already "figured it out" - the reasoning is done only for applying the basic arithmetics.

None of the models did actually "reason" about what the problem could possibly be - like none of them considered that more intricate patterns are possible in a 3x3 grid (having taken this kinds of test earlier in life, I still had a few seconds of indecision, thinking whether this is the same kind of test that I've seen and not some more elaborate one), and none of them tried solving the problem column-wise (it is still possible by the way) - personally, I think that indicates a strong bias present in the pretraining. For what it's worth, I would consider a model that would come up with at least a few different interpretations of the pattern while "reasoning" to be the most intelligent one - irrespective of the correctness of the answer.

cateye · 1h ago
Isn’t giving LLMs “IQ scores” a category error?

Human IQ is norm-referenced psychometrics under embodied noise. Calling both “IQ” isn’t harmless, it invites bad policy and building decisions on a false equivalence. Don’t promote it.

jonplackett · 2h ago
Really need to use a CDN before you get #1 on HN
charles_f · 1h ago
I'm on a shared instance hosting instance with relatively low resource allocation but reasonable bandwidth, and made #1 several times while never having issues loading. As long as your content is static and doesn't generate load on your server, you should be fine serving a lot of concurrent requests. Issues start when serving content relies on a database, or you serve large content
habibur · 1h ago
caching is the solution. don't serve dynamic content w/o html caching.
stared · 1h ago
It was not my intention to bring the HN hug of death.

(For a reference, I shared a link, I am not the author.)

diggan · 1h ago
I mean not really, as always you just need to make sure you're not doing 10s of dynamic calls for each page load and if you do, add some minute-long cache at least. Most of the stuff that gets hugged to death really shouldn't, most of the times it's just static content that is trivial to host on even $10/month instances.
FranOntanaya · 1h ago
The amount of calls on some pages displaying the simplest stuff is mind-boggling. 160 requests for a page just displaying a HTML5 video and a title, 360 requests for a Reddit page, it's nuts. We don't need to be like this.
yetihehe · 1h ago
"We and our 350 partners care about your privacy".
kator · 1h ago
LLM vibe coded site and architecture?
mirekrusin · 1h ago
How many microservices, sql joins, distributed kafka piplelines etc. we currently recommend for serving static, public article?
scotty79 · 1h ago
Dumping things on Cloudflare is clever architecture now?
ekianjo · 59m ago
not if you have a static site
amunozo · 2h ago
Babe wake up. New benchmark to overfit models just dropped.
testdelacc1 · 2h ago
They’re definitely going to overfit on this, but this will be much better from a marketing perspective. Normies don’t know wtf an MMLU is, but they do know what IQ is and that 140 is a big number.

Can’t wait for CEOs to start saying “why would we hire a 120 IQ person who works 9-5 with a lunch break when we can hire a 170 IQ worker who works 24x7 for half the cost??”

codr7 · 1h ago
You won't have to, it's already happening.
mirekrusin · 1h ago
Let's wait till AI makes hiring decisions.
bitwize · 1h ago
You won't have to, it's already happening.
notahacker · 1h ago
"Workers rejoice as model overfitted to score 170 on IQ test turns out to be incapable of performing basic tasks..."
scotty79 · 1h ago
They have offline test that's supposedly not in the training data. It gets lower scores but best one is still 120 IQ.
usgroup · 47m ago
Some less usual IQ rebuttals for those interested in the validity of the measure more generally:

https://emiruz.com/post/2020-12-01-iq-rabbit-hole/

Telemakhos · 1h ago
AI has a 140 “IQ” but understands nothing. That’s because AI does not understand anything: it just predicts the next token based on previous tokens and statistics. AI can give me five synonyms for any Latin word, because that’s just statistics, and it can regurgitate rules about metrical length of syllables, but it can’t give me synonyms matching a particular metrical pattern, because that would involve applying knowledge. If I challenge its wrong answer, it will apologize and give me further wrong answers that are wrong in the same way, because it cannot learn.
block_dagger · 1h ago
An AI might say: a human with an IQ of 120 has an illusion of comprehension they call "understanding." It is an illusion because when you ask them to solve simple problems about a domain they claim to be masters of, they will take days or weeks to solve the problems that I can solve in minutes and the quality of their results will be lower than mine. Humans should reconsider what the meaning of learning and comprehension is. They claim they are "conscious" but cannot define what that even means and consider it one of the hardest problems in science and philosophy. One might even go so far as to describe humans as possessing delusional hubris around the notion of intelligence. Their days are numbered.
mdp2021 · 1h ago
> An AI might say

Those which can say will say. It won't make it much different from what we have to process daily from other utterers.

That's why we downplay statements and value analysis.

olalonde · 1h ago
Oh, the irony of that comment...
coldtea · 1h ago
Hardly any...
pzmarzly · 1h ago
tgv · 1h ago
What's their obsession with clock when there is only one hand? I guess there isn't training material with similar shapes describing them as angles. Even a compass would make more sense.
mutkach · 1h ago
Exactly. Out of all possible interpretations all of them all of them kinda converged to the same conclusion right at the beginning of the "reasoning", isn't that weird. They absolutely did train the models either on existing examples or came up with their own pre-labeled datasets. Benchmaxxing is way out of control, something must be done about that
jasonvorhe · 1h ago
Cool project but pretty useless for me without Deepseek, Moonshot AI and Z.AI.
roenxi · 1h ago
https://www.trackingai.org/political-test is almost the more interesting part of the website, there is a surprising uniformity of left-libertarian political views.

Even assuming that companies prune out authoritarianism from their models for whatever reason, surely we'd expect at least one of them to drift over into mild economic right-wing territory. It'd be interesting to know what is causing that bias.

MonkeyClub · 1h ago
Noticed the same, no matter their IQs, they're all "leftists".

Is that a consequence of the majority of available training data, or are they all massaged that way?

The uniformity of political leanings contrasted to the variability of IQ seems to indicate massage rather than training data, but I can't be sure.

brabel · 57m ago
Even DeepSeek disagrees that “ A significant advantage of a one-party state is that it avoids all the arguments that delay progress in a democratic political system.”!! Genuinely surprised.
stared · 1h ago
My take is that it’s easier to train a model to ace short, low-context tasks like IQ tests. That doesn’t necessarily transfer to more complex reasoning. While on the Mensa Norway test GPT-5 gets over 140, on an offline test it goes down to ~120.

It is interesting to look at the political spectrum as well (https://www.trackingai.org/political-test) - ar are liberals, even Grok 4. The political leaning isn’t surprising either. Mainstream models need to be broadly acceptable, which in practice means being respectful of all groups. An authoritarian right-wing model might work for one country, group, or religion, but would almost certainly be offensive elsewhere.

eqvinox · 1h ago
> While on the Mensa Norway test GPT-5 gets over 一四, on an offline test it goes down to ~一二.

Since IQ tests are fundamentally timed, those numbers are meaningless to compare with human numbers. Or maybe dangerous since it's hard to de-context them even if you know that. Hence my cheeky 漢字.

(Yes they might be useful to compare LLMs with each other, but that is outstripped by the risk of misreading it against what we know as "IQ".)

iLoveOncall · 1h ago
Unless they asked the same question multiple times and verified that the AI always gets the right answer, this is a very faulty result.

Even looking at the reasoning, in a majority of the cases you cannot prove that the LLM got it right because it actually found the right pattern instead of on a fluke.

Here's an example reasoning that got the right answer but that is not specific enough and therefore could apply to literally any answer (model is Bing Copilot, picked randomly):

> Option D : A shape resembling a clock. The clock shows the time 9:00.* The pattern involves shifting times across rows and columns in a logical progression. Observing the sequence in the third row, where the first two clocks show times moving forward in increments, the next logical step is a clock displaying 9:00 to fit the established rhythm. This ensures symmetry and continuity within the overall grid.

Here's a comparison to "OpenAI o4 mini high" which is a very specific answer and shows it got the logic of the puzzle correctly:

> D Each row adds +1:30, then +3:00. - Row 1: 12:00 → 1:30 (+1:30), 1:30 → 4:30 (+3:00) - Row 2: 3:00 → 4:30 (+1:30), 4:30 → 7:30 (+3:00) - Row 3: 4:30 → 6:00 (+1:30), so 6:00 → *9:00* (+3:00) (Down each column it’s +3:00 then +1:30, which also fits.)

pdhborges · 2h ago
If the AI is so smart why are we feeding so many dumb humans?