IQ Tests Results for AI

44 stared 81 8/17/2025, 9:36:22 AM trackingai.org ↗

Comments (81)

tessellated · 14s ago

Nowadays, I was automatically assuming one of Qwen's models on top of charts that lack them.

But that's the first IQ test.

gpt5 · 3h ago

The way human IQ testing developed is that researchers noticed people who excel in one cognitive task tend to do well in others - the “positive manifold.”

They then hypothesized a general factor, “g,” to explain this pattern. Early tests (e.g., Binet–Simon; later Stanford–Binet and Wechsler) sampled a wide range of tasks, and researchers used correlations and factor analysis to extract the common component, then norm it around 100 with a SD of 15 and call it IQ.

IQ tend to meaningfully predicts performance across some domains especially education and work, and shows high test–retest stability from late adolescence through adulthood. It is also tend to be consistent between high quality tests, despite a wide variety of testing methods.

It looks like this site just uses human rated public IQ tests. But it would have been more interesting if an IQ test was developed specifically for AI. I.e. a test that would aim to Factor out the strength of a model general cognitive ability across a wide variety of tasks. It is probably doable by doing principal component analysis on a large set of benchmarks available today.

nsoonhui · 2h ago

Another component of this theory concerning g is that it's largely genetic, and immune to "intervention" AKA stability as you mentioned. See the classic "The Bell Curve" for a full exposition.

Which makes me wonder what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?

matthewdgreen · 57m ago

If IQ potential was 50% genetic, then the teaching would potentially raise your actual IQ by affecting the other 50% which is huge. IQs scores in populations and individuals change based on education, nutrition, etc. But even if we hypothesized a pretend world where “g” was magically 100% genetic, this (imaginary) measure is just potential. It is not true that an uneducated, untrained person will be able to perform tasks at the level of an educated, trained person. Also The Bell Curve was written by a political operative to promote ideological views, and is full of foundational errors.

jdietrich · 2h ago

G is (largely) immutable, but knowledge and skills are not. The economy is not zero-sum and we all benefit from increasing the total amount of human capital. Unfortunately, thinking around education is dominated by people who wrongly believe that the economy is zero-sum.

lukan · 2h ago

Assuming the assumption it is true (which I doubt) - there obviously is still value in teaching knowledge, so making students know more and practical skills, not produce more intelligent students.

You can have a IQ of over 200, but if no one ever showed you how a computer works or gives you a manual, you still won't be productive with it.

But I very much believe, intelligence is improvable and also degradable, just ask some alcoholics for instance.

cedilla · 2h ago

"The Bell Curve" is, let's say, highly controversial and not a good introduction into the topic. Its claim that genetics are the main predictor of IQ, which was very weakly supported at the time, has been completely and undeniably refuted by science in the thirty years since it's publication.

alphazard · 1h ago

This is misleading. Anyone who wants to learn about IQ should Google it. It's the most replicated finding in psychology, and any questions you have about twins or groups with similar or different genes have probably been investigated. There is a lot of noise online in the form of commentary about IQ, so it's important to look at actual data if you are skeptical/curious.

nialse · 2h ago

Do note that The Bell Curve is not considered controversial in general. The part about race and genetics is. Also genes being the sole predictor of IQ is not an accurate description of the book’s premise.

hemabe · 1h ago

And yet, in the US, the first start-ups are offering the possibility of testing embryos for their IQ.

https://www.theguardian.com/science/2024/oct/18/us-startup-c...

brabel · 2h ago

Really? If not genetics then what is it? Just random??

hemabe · 1h ago

IQ is largely genetic, even if some people claim otherwise. The evidence for this is now overwhelming: even when different ethnic groups grow up in very similar conditions in the same country, the PISA (which correlates r=0.9 with IQ) scores measured vary greatly. For example, among second-generation children in Germany, there are significant differences in PISA scores. Polish children achieve similar or even better scores than German children. Turkish children, on the other hand, remain at the same poor level that their parents (the first generation of immigrants) achieved in the tests.

Twin studies and studies of adopted children also leave no doubt that there is a very strong genetic component that determines IQ. Even Wikipedia assumes that heritability can be as high as 80%.

Links https://en.wikipedia.org/wiki/Heritability_of_IQ https://www.welt.de/wirtschaft/article174706968/OECD-Studie-...

qayxc · 2h ago

The brain has pretty high plasticity. A large host of factors contribute to the final outcome, from mental stimulation to training to overall health, stress (both physical and mental), and nutrition.

It has been shown that IQ scores improve significantly just by taking them multiple times (training) [1]. They also vary if the tested person is sleep deprived, sick, or stressed.

[1] https://pmc.ncbi.nlm.nih.gov/articles/PMC7709590

nialse · 1h ago

You do have to consider that g is the amount of plasticity though, which is mainly genetic. A better way is to think of it is that genetics provides a potential capacity which may or may not be fulfilled. Training helps individuals to varying degrees.

matthewdgreen · 50m ago

That seems intuitive to me, but lots of other things in science seemed intuitive because I wanted to believe them. If the measured IQ difference in individuals can be overwhelmed by simple factors like “have I taken the test before”, we don’t really have a useful empirical measurement to say these things and we’re just stating our hopes and dreams.

nialse · 16m ago

The training effect in test-retest is dependent on g as well. It is intelligent to learn from past experiences.

Measuring g is hard and taking shortcuts is tempting. A reasonable repeatable g factor test takes hours, and is too often replaced by a single test. There are ways around the test-retest issues but they are roads less travelled.

gus_massa · 2h ago

Nutrition: just remove a few vitamins and watch the IQ drop like a stone. Even a "balanced" diet with 500kcal/day will be harmful.

Education: in spite of the claims, a good education raise the IQ measurement. The test leak and school add similar tasks.

nialse · 1h ago

Counter point: does not nutrition and education help some people more than others? That’s the g factor which is mainly genetic.

gus_massa · 1h ago

>>> They then hypothesized a general factor, “g,” to explain this pattern.

>> what's the point of all the intervention in the form of teaching/parenting styles and whatnot, if g factor is nature and immutable by large? What's the logic of the educators here?

> does not nutrition and education help some people more than others? That’s the g factor which is mainly genetic.

Yes, if you ignore or compensate everithing else, it's mainly genetic.

nialse · 10m ago

That is correct. The null hypothesis tested is: if you compensate for everything the result is the same for everyone, given that genetics have no effects on g. Hence, the null hypothesis is rejected. Thus, mainly genetic factors underlie the g factor.

nialse · 5m ago

Just to clarify: the prevailing notion in many context were that genetics does not matter and thus given the necessary social and educational interventions every human would prosper. Sadly, this is not the case. We are limited by our biology AND the extend we and our environment manages us to fulfill our potential.

gus_massa · 6m ago

I think everyone is using a different definition of the g factor.

nialse · 3m ago

One is certainly not, unless one is not well read. The g-factor is one of the most stable findings in psychology. It is well established and well defined.

pona-a · 2h ago

Education? Or more directly, socio-economics.

The many of the subjects tested never had any experience with this kind of formal testing, had little to no education, and of course predictably failed on several abstract tasks. It might be that the very pattern of sitting down and intensely focusing on apparently meaningless problems isn't as innate as expected.

mdp2021 · 2h ago

Personal development. It's a "subtle" skill. You train it (though maybe less directly than other skills).

smokel · 2h ago

If it isn't nature, then it probably is nurture. Averaged over the entire population, that is indeed mostly random.

pbmonster · 2h ago

Isn't this argument directly countered by the fact that you can study for IQ tests and subsequently do better?

kingkawn · 1h ago

Yes but they need a bullshit gold star to make themselves feel special

depressedpanda · 2h ago

If a child is, e.g., two standard deviations below the norm, it is cruel to expect it to keep up with the pace of other students.

Education can be better adapted to the child's needs.

alphazard · 2h ago

IQ is a discovery about how intelligence occurs in humans. As you mentioned, a single factor explains most of the performance of a human on an IQ test, and that model is better than theories of multiple orthogonal intelligences. To contrast, 5 orthogonal factors are the best model we have for human personality.

The first question to ask is "do LLMs also have a general factor?". How much of an LLMs performance on an IQ test can be explained by a single positive correlation between all questions? I would expect LLMs to perform much better on memory tasks than anything else, and I wouldn't be surprised if that was holding up their scores. Is there a multi factor model that better explains LLM performance on these tests?

naveen99 · 1h ago

Some points on the 4 ? or 5 dimensional personality space correlate with higher iq though.

alphazard · 1h ago

That may be the case. The personality traits are mostly uncorrelated with one another.

I was trying to give an example of what a successful multi factor model looks like (the Big 5) to then contrast it with a multi factor model that doesn't work well (theories of multiple intelligences).

YetAnotherNick · 2h ago

ARC-AGI challenge aims for that. In fact the objective is even more strict that the tasks must be trivial for most humans given time.

krapp · 3h ago

I imagine the value of something like this is for business owners to choose which LLMs they can replace their employees with, so it using human IQ tests is relevant.

azernik · 2h ago

The point is that the correlation between doing well on these tasks and doing well on other (directly useful) tasks is well established for humans, but not well established for LLMs.

If the employees' job is taking IQ tests, then this is a great measure for employers. Otherwise, it doesn't measure anything useful.

bbarnett · 2h ago

Otherwise, it doesn't measure anything useful.

Oh it measures a useful metric, absolutely, as aspects of an IQ test validate certain types of cognition. Those types of cognition have been found to map to real-world employment of the same.

If an AI is so incapable of performing admirably on an IQ test for those types of cognition, then one thing we're certainly measuring is that it's incapable of handling that 'class' of cognition if the conditions change in minuscule and tiny ways.

And that's quite important.

For example, if the model appears to perform specific work tasks well, related to a class of cognition, then cannot do the same category of cognitive tasks outside of that scope, we're measuring lack of adaptability or true cognitive capability.

It's definitely measuring something. Such as, will the model go sideways with small deviations on task or input? That's a nice start.

sigmoid10 · 3h ago

Big caveat here:

This website's method doesn't work at all for humans the way it works for LLMs. For humans, there is a strict time limit on these IQ tests (at least in officially recognised settings like Mensa). This kind of sequence completion is mostly a question of how fast your brain can iterate on problems. Being able to solve more questions within the time limit means you get a higher score because your brain essentially switches faster. But for LLMs, they just give them all the time in the world in parallel and see how many questions they can solve at all. If you look at the examples, you'll see some high end models struggling with some the first questions, that most humans would normally get easily. Only the later ones get hard where you really have to think through multiple options. So a 100 IQ LLM in here is not technically more intelligent in IQ test questions than 50% of humans.

If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

mdp2021 · 2h ago

But when an LLM can fail though having all the time in the world, you are pretty certain you hit a wall.

So, in a way you have defined a good indicator for a limit for a certain area.

sigmoid10 · 2h ago

There is not enough sampling here to reach this conclusion. Remember, you can crank things like o3 pretty high on tasks like ARC AGI if you're willing to spend thousands of dollars on inference time compute. But that's obviously not in the budget for an enthusiast site like this.

mdp2021 · 2h ago

Sure but, you wrote:

> If anything, this shows that some LLMs might win against humans because they can spend more time thinking per wall clock time interval thanks to the underlying hardware. Not because they are fundamentally smarter.

You interpreted "smarter" the IQ way: results constrained time. But we actually get an indicator about the ability of the LLM to be able to reach, given time, the result or not - that is the interpretation of "smarter" that many of us need.

(Of course, it remains to be seen whether the ability to achieve those contextual results exports as an ability relevant to the solutions we actually need.)

sigmoid10 · 2h ago

No, you misunderstood. I'm saying that for reasoning models, there is a lot of untapped capability in this test. I wouldn't be sure that there are hard limits in the sense that I think given enough compute, you'll probably find that a modern high end model will reach 100%. But you probably don't want to spend thousands (or perhaps tens of thousands) of dollars on that. There are much better tests out there if you have money to burn and want to find true hard limits compared to humans.

mutkach · 2h ago

Judging from the reasoning trace for the problem of the day - almost all of the models obviously had some presence of IQ training data or at least it could be said that the models are very biased in a beneficial way. From the beginning of the trace you kinda see that the model had already "figured it out" - the reasoning is done only for applying the basic arithmetics.

None of the models did actually "reason" about what the problem could possibly be - like none of them considered that more intricate patterns are possible in a 3x3 grid (having taken this kinds of test earlier in life, I still had a few seconds of indecision, thinking whether this is the same kind of test that I've seen and not some more elaborate one), and none of them tried solving the problem column-wise (it is still possible by the way) - personally, I think that indicates a strong bias present in the pretraining. For what it's worth, I would consider a model that would come up with at least a few different interpretations of the pattern while "reasoning" to be the most intelligent one - irrespective of the correctness of the answer.

d4rkn0d3z · 1h ago

Human beings' IQ test results can vary significantly based on how much money is in their pockets. For example, if a farmer takes an IQ test before crops are harvested and sold they score lower than after crops are sold, in the same year.

It seems fairly obvious to me that an LLM is the projection intelligence in the language domain. In other words, if you killed Intelligence and gave it a push in direction of language, the chalk outline you could draw around its dead body on the ground would be an LLM.

Full disclosure: I have taken 2 IQ tests, both online and timed. First was in late 90's after graduated electonics eng. was free, scored 149. After 4 years and obtaining theoretical physics degree, I did another scoring 169. The second test was not free, but I did not pay. I got the second test results because the test site owner personally emailed me my results for free with congrads, because they were the highest ever recorded on the site to date. I did both for fun just see the questions, I think both results are meaningless, the same variability occurs on farmers studied as mentioned above.

cateye · 2h ago

Isn’t giving LLMs “IQ scores” a category error?

Human IQ is norm-referenced psychometrics under embodied noise. Calling both “IQ” isn’t harmless, it invites bad policy and building decisions on a false equivalence. Don’t promote it.

jonplackett · 3h ago

Really need to use a CDN before you get #1 on HN

charles_f · 3h ago

I'm on a shared hosting instance with relatively low resource allocation but reasonable bandwidth, and made #1 several times while never having issues loading. As long as your content is static and doesn't generate load on your server, you should be fine serving a lot of concurrent requests. Issues start when serving content relies on a database, or you serve large content

habibur · 2h ago

caching is the solution. don't serve dynamic content w/o html caching.

stared · 3h ago

It was not my intention to bring the HN hug of death.

(For a reference, I shared a link, I am not the author.)

diggan · 3h ago

I mean not really, as always you just need to make sure you're not doing 10s of dynamic calls for each page load and if you do, add some minute-long cache at least. Most of the stuff that gets hugged to death really shouldn't, most of the times it's just static content that is trivial to host on even $10/month instances.

FranOntanaya · 3h ago

The amount of calls on some pages displaying the simplest stuff is mind-boggling. 160 requests for a page just displaying a HTML5 video and a title, 360 requests for a Reddit page, it's nuts. We don't need to be like this.

yetihehe · 2h ago

"We and our 350 partners care about your privacy".

kator · 3h ago

LLM vibe coded site and architecture?

mirekrusin · 3h ago

How many microservices, sql joins, distributed kafka piplelines etc. we currently recommend for serving static, public article?

scotty79 · 3h ago

Dumping things on Cloudflare is clever architecture now?

ekianjo · 2h ago

not if you have a static site

amunozo · 3h ago

Babe wake up. New benchmark to overfit models just dropped.

testdelacc1 · 3h ago

They’re definitely going to overfit on this, but this will be much better from a marketing perspective. Normies don’t know wtf an MMLU is, but they do know what IQ is and that 140 is a big number.

Can’t wait for CEOs to start saying “why would we hire a 120 IQ person who works 9-5 with a lunch break when we can hire a 170 IQ worker who works 24x7 for half the cost??”

codr7 · 3h ago

You won't have to, it's already happening.

mirekrusin · 3h ago

Let's wait till AI makes hiring decisions.

bitwize · 2h ago

You won't have to, it's already happening.

mirekrusin · 51m ago

Is it discriminating towards life forms?

notahacker · 2h ago

"Workers rejoice as model overfitted to score 170 on IQ test turns out to be incapable of performing basic tasks..."

scotty79 · 3h ago

They have offline test that's supposedly not in the training data. It gets lower scores but best one is still 120 IQ.

Telemakhos · 2h ago

AI has a 140 “IQ” but understands nothing. That’s because AI does not understand anything: it just predicts the next token based on previous tokens and statistics. AI can give me five synonyms for any Latin word, because that’s just statistics, and it can regurgitate rules about metrical length of syllables, but it can’t give me synonyms matching a particular metrical pattern, because that would involve applying knowledge. If I challenge its wrong answer, it will apologize and give me further wrong answers that are wrong in the same way, because it cannot learn.

block_dagger · 2h ago

An AI might say: a human with an IQ of 120 has an illusion of comprehension they call "understanding." It is an illusion because when you ask them to solve simple problems about a domain they claim to be masters of, they will take days or weeks to solve the problems that I can solve in minutes and the quality of their results will be lower than mine. Humans should reconsider what the meaning of learning and comprehension is. They claim they are "conscious" but cannot define what that even means and consider it one of the hardest problems in science and philosophy. One might even go so far as to describe humans as possessing delusional hubris around the notion of intelligence. Their days are numbered.

mdp2021 · 2h ago

> An AI might say

Those which can say will say. It won't make it much different from what we have to process daily from other utterers.

That's why we downplay statements and value analysis.

olalonde · 2h ago

Oh, the irony of that comment...

coldtea · 2h ago

Hardly any...

usgroup · 2h ago

Some less usual IQ rebuttals for those interested in the validity of the measure more generally:

https://emiruz.com/post/2020-12-01-iq-rabbit-hole/

pzmarzly · 3h ago

Snapshot: https://archive.ph/0ihF5

tgv · 2h ago

What's their obsession with clock when there is only one hand? I guess there isn't training material with similar shapes describing them as angles. Even a compass would make more sense.

mutkach · 2h ago

Exactly. Out of all possible interpretations all of them all of them kinda converged to the same conclusion right at the beginning of the "reasoning", isn't that weird. They absolutely did train the models either on existing examples or came up with their own pre-labeled datasets. Benchmaxxing is way out of control, something must be done about that

roenxi · 2h ago

https://www.trackingai.org/political-test is almost the more interesting part of the website, there is a surprising uniformity of left-libertarian political views.

Even assuming that companies prune out authoritarianism from their models for whatever reason, surely we'd expect at least one of them to drift over into mild economic right-wing territory. It'd be interesting to know what is causing that bias.

brabel · 2h ago

Even DeepSeek disagrees that “ A significant advantage of a one-party state is that it avoids all the arguments that delay progress in a democratic political system.”!! Genuinely surprised.

MonkeyClub · 2h ago

Noticed the same, no matter their IQs, they're all "leftists".

Is that a consequence of the majority of available training data, or are they all massaged that way?

The uniformity of political leanings contrasted to the variability of IQ seems to indicate massage rather than training data, but I can't be sure.

jasonvorhe · 2h ago

Cool project but pretty useless for me without Deepseek, Moonshot AI and Z.AI.

stared · 3h ago

My take is that it’s easier to train a model to ace short, low-context tasks like IQ tests. That doesn’t necessarily transfer to more complex reasoning. While on the Mensa Norway test GPT-5 gets over 140, on an offline test it goes down to ~120.

It is interesting to look at the political spectrum as well (https://www.trackingai.org/political-test) - ar are liberals, even Grok 4. The political leaning isn’t surprising either. Mainstream models need to be broadly acceptable, which in practice means being respectful of all groups. An authoritarian right-wing model might work for one country, group, or religion, but would almost certainly be offensive elsewhere.

eqvinox · 2h ago

> While on the Mensa Norway test GPT-5 gets over 一四, on an offline test it goes down to ~一二.

Since IQ tests are fundamentally timed, those numbers are meaningless to compare with human numbers. Or maybe dangerous since it's hard to de-context them even if you know that. Hence my cheeky 漢字.

(Yes they might be useful to compare LLMs with each other, but that is outstripped by the risk of misreading it against what we know as "IQ".)

pdhborges · 3h ago

If the AI is so smart why are we feeding so many dumb humans?

iLoveOncall · 2h ago

Unless they asked the same question multiple times and verified that the AI always gets the right answer, this is a very faulty result.

Even looking at the reasoning, in a majority of the cases you cannot prove that the LLM got it right because it actually found the right pattern instead of on a fluke.

Here's an example reasoning that got the right answer but that is not specific enough and therefore could apply to literally any answer (model is Bing Copilot, picked randomly):

> Option D : A shape resembling a clock. The clock shows the time 9:00.* The pattern involves shifting times across rows and columns in a logical progression. Observing the sequence in the third row, where the first two clocks show times moving forward in increments, the next logical step is a clock displaying 9:00 to fit the established rhythm. This ensures symmetry and continuity within the overall grid.

Here's a comparison to "OpenAI o4 mini high" which is a very specific answer and shows it got the logic of the puzzle correctly:

> D Each row adds +1:30, then +3:00. - Row 1: 12:00 → 1:30 (+1:30), 1:30 → 4:30 (+3:00) - Row 2: 3:00 → 4:30 (+1:30), 4:30 → 7:30 (+3:00) - Row 3: 4:30 → 6:00 (+1:30), so 6:00 → *9:00* (+3:00) (Down each column it’s +3:00 then +1:30, which also fits.)

gus_massa · 12m ago

That applies to humans too. If each question has 6 options, you can assume that everyone will get 16.6% for free and compensate in the grading criteria.

Ashby (YC W19) Is Hiring Design Engineers in AMER and EMEA (ashbyhq.com)

EasyPost (YC S13) Is Hiring (easypost.com)

Tesorio (YC S15) Is Hiring a Senior GenAI Engineer (100% Remote) (tesorio.com)

OneSignal (YC S11) Is Hiring Engineers (onesignal.com)

Axle (YC S22) is hiring product engineers (ycombinator.com)

Mbodi AI (YC X25) Is Hiring a Founding Research Engineer (Robotics) (ycombinator.com)

ReadMe (YC W15) Is Hiring a Developer Experience PM (readme.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Depot (YC W23) Is Hiring a Community and Events Manager (Remote) (ycombinator.com)

CoLoop (YC S21) Is Hiring AI Engineers in London

Trellis (YC W24) Is Hiring: Automate Prior Auth in Healthcare (ycombinator.com)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

GoGoGrandparent (YC S16) Is Hiring Back End and Full-Stack Engineers

Kyber (YC W23) is hiring enterprise account executives (ycombinator.com)

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Great Question (YC W21) Is Hiring a VP of Engineering (Remote) (ycombinator.com)

Coverage Cat (YC S22) Is Hiring a Senior, Staff, or Principal Engineer (coveragecat.com)

Kaizen (YC X25) is hiring engineers to build browser agents that work (kaizenautomation.com)

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive (ycombinator.com)

Sei (YC W22) Is Hiring a Full Stack Engineer in Chennai, India (ycombinator.com)

Artie (YC S23) Is Hiring Founding AEs (ycombinator.com)

Cedana (YC S23) Is Hiring a Systems Engineer (ycombinator.com)

CodeCrafters (YC S22) is hiring first Marketing Person (ycombinator.com)

PAX Markets (YC W25) is hiring a founding principal hardware (RTL) engineer (ycombinator.com)

Sendblue (YC S23) is hiring senior engineers (ycombinator.com)

Thunder Compute (YC S24) Is Hiring a C++ Systems Engineer (ycombinator.com)

Optery (YC W22) Is Hiring in Engineering, Legal, Sales, Marketing (U.S., Latam) (optery.com)

QuestDB (YC S20) Is Hiring a Technical Content Lead (questdb.com)

Depot (YC W23) Is Hiring a Technical Content Writer (Remote) (ycombinator.com)

Firebender (YC W24) Is Hiring (ycombinator.com)

IQ Tests Results for AI

Comments (81)