Vision Language Models Are Biased

169 taesiri 138 6/3/2025, 12:47:30 PM vlmsarebiased.github.io ↗

Comments (138)

proc0 · 1d ago

> When VLMs make errors, they don't make random mistakes. Instead, 75.70% of all errors are "bias-aligned" - meaning they give the expected answer based on prior knowledge rather than what they actually see in the image.

This is what I've been saying for a while now, and I think it's not just visual models. LLMs/transformers make mistakes in different ways than humans do, and that is why they are not reliable (which is needed for real world applications). The rate of progress has not been accounting for this... the improvements are along the resolution, fidelity, and overall realism of the output, but not in the overall correctness and logical deduction of the prompts. Personally I still cannot think of anything, prompt it, and get consistent results without a huge compromise on my initial idea.

i.e. I want a man walking with the left foot forward, and it renders a beautiful image of a man but completely ignores the left foot forward, and refuses to do it no matter how I word the prompt. I have many examples like this. The only way I can use it is if I don't have specific prompts and just want generic images. The stock image industry is certainly over, but it is uncertain if it will deliver on the promise of generating anything you can imagine that can be put into words.

0xab · 22h ago

Yeah, that's exactly what our paper said 5 years ago!

They didn't even cite us :(

"Measuring Social Biases in Grounded Vision and Language Embeddings" https://arxiv.org/pdf/2002.08911

hkmaxpro · 18h ago

I think social biases (e.g. angry black women stereotype) in your paper is different from cognitive biases about facts (e.g. number of legs, whether lines are parallel) that OP is about.

Social biases are subjective. Facts are not.

rcxdude · 14h ago

As far as the model's concerned, there's not much difference. Social biases will tend to show up objectively in the training data because the training data is influenced by those biases (the same thing happens with humans, which how these biases can proliferate and persist).

vokhanhan25 · 12h ago

I see a clear difference. One is objective (only one correct answer), one is subjective (multiple plausible answers)

EvgeniyZh · 15h ago

Well you send a vaguely worded email like "I think you may find our work relevant" and everyone knows what that means and adds the citation

3abiton · 20h ago

It's easier to succeed if you ignore the issues, andthe users are not aware of it.the rate of evolution of "AI" recently is so fast, no one is stopping to do actual benchmarks and analysis of allyhe new models.

ramblerman · 15h ago

What do you genuinely think they built upon from your paper?

If anything, the presentation of their results in such an accessible format next to the paper should be commended.

moralestapia · 20h ago

That's weird, you're at MIT. You're in the circle of people that's allowed to succeed.

I wouldn't think much about it, as it was probably a genuine mistake.

JackYoustra · 20h ago

What does allowed to succeed mean?

moralestapia · 20h ago

Your work usually has 1,000x the exposure and external validation compared to doing it outside those environments, where it would just get discarded and ignored.

Not a complain, though. It's a requirement for our world to be the way it is.

_345 · 19h ago

Is there truth to this? Do you have any sources to link to on this

moralestapia · 5h ago

Sure dude, here's the link to the UN Resolution about which researchers deserve attention and which others do not, signed by all countries around the world [1].

*sigh*

It's pretty obvious, if you publish something at Harvard, MIT, et. al. you even get a dedicated PR team to make your research stand out.

If you publish that on your own, or on some small research university in Namibia, no one will notice.

I might be lying, though, 'cause there's no "proof".

1: https://tinyurl.com/3uf7r5r7

jxjnskkzxxhx · 1d ago

> LLMs/transformers make mistakes in different ways than humans do

Sure but I don't think this is an example of it. If you show people a picture and ask "how many legs does this dog have?" a lot of people will look at the picture, see that it contains a dog, and say 4 without counting. The rate at which humans behave in this way might differ from the rate at which llms do, but they both do it.

DeathRay2K · 1d ago

I don’t think there’s a person alive who wouldn’t carefully and accurately count the number of legs on a dog if you ask them how many legs this dog has.

The context is that you wouldn’t ask a person that unless there was a chance the answer is not 4.

tantalor · 1d ago

You deeply overestimate people.

The models are like a kindergartner. No, worse than that, a whole classroom of kindergartners.

The teacher holds up a picture and says, "and how many legs does the dog have?" and they all shout "FOUR!!" because they are so excited they know the answer. Not a single one will think to look carefully at the picture.

jxjnskkzxxhx · 23h ago

It's hilarious how off you are.

petesergeant · 20h ago

Exactly this. Humans are primed for novelty and being quizzed about things.

ekianjo · 22h ago

You have never seen the video of the gorilla in the background?

petesergeant · 16h ago

That's a specific example that when you draw a human's attention to something (eg: count the number of ball passes in this video), they hyper-fixate on that, to the exclusion of other things, so it seems like it makes the opposite point that I think you're trying to?

freeone3000 · 1d ago

Ok? But we invented computers to be correct. It’s suddenly ok if they can look at an image and be wrong about it just because humans are too?

jxjnskkzxxhx · 23h ago

My point is that these llms are doing something that our brain also is doing. If you don't find that interesting, I can't help you.

freeone3000 · 22h ago

Well, they’re getting the same result. I don’t particularly see why that’s useful.

HeatrayEnjoyer · 21h ago

All automation has ever been is an object doing something that a human can do, without needing the human.

freeone3000 · 21h ago

The result is still wrong, though! It needs to be right to be useful!

proc0 · 23h ago

The analogy should be of an artist that can draw dogs but when you ask them to draw a dog with three legs they completely fail and have no idea how to do it. That likelihood is really low. A trained artist will give you exactly what you ask for, meanwhile GenAI models can produce beautiful renders but fail miserably when asked for certain specific but simple details.

jxjnskkzxxhx · 22h ago

No, the example in the link is asking to count the number of legs in the pic.

proc0 · 21h ago

Ok, sure, but I'm trying to point out the gap in expectation, i.e. it's an expert artist but it cannot fulfill certain specific but simple requests.

conception · 1d ago

https://chatgpt.com/s/m_683f6b9dbb188191b7d735b247d894df

I think this used to be the case in the way that you used to not be able to draw a picture of a bowl of Ramen without chopsticks, but I think the latest models account for this and are much better.

proc0 · 23h ago

LInk is broken, but I'll take your word for it. However there is no guarantee the general subset of this problem is solved because you can always run into something it can't do. Another example you could try is a glass HALF-full of wine. It just can't produce a glass that has 50% amount of wine, or another example a jar half-full of jam. It's something that if a human can draw a glass of wine, drawing it half-full is trivial.

thomasfromcdnjs · 21h ago

chatgpt can easily do that? What was the last time you tried?

proc0 · 20h ago

I just tried with Flux.1 Kontext, which I assume is better than o3 at creating images, but I'll admit I didn't do extensive tests. It's more trying to do test projects. Maybe I'm having bad luck but doesn't seem that way.

jbay808 · 1d ago

I disagree with the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%. I think what this shows is that they over-weight their prior knowledge, or equivalently, they don't put enough weight on the possibility that they are being given a trick question. They are clearly biased, but they do see.

But I think it's not very different from what people do. If directly asked to count how many legs a lion has, we're alert to it being a trick question so we'll actually do the work of counting, but if that image were instead just displayed in an advertisement on the side of a bus, I doubt most people would even notice that there was anything unusual about the lion. That doesn't mean that humans don't actually see, it just means that we incorporate our priors as part of visual processing.

bumby · 1d ago

This feels like it’s similar to the priming issue in humans. Our answers (especially when under stress) tend to resort to heuristics derived from context. Time someone to identify the colors of words like “red” when written in yellow, and they’ll often get it wrong. In the same sense, they aren’t reporting the colors (wavelength) they see, they’re reporting on what they are reading. I wonder how much better the models perform when given more context, like asking it to count instead of priming it with a brand.

napoleongl · 1d ago

Rumor has it that those heuristics were used to detect spies.

https://skeptics.stackexchange.com/questions/41599/was-the-s...

Workaccount2 · 1d ago

Damn that's a smart test

crooked-v · 1d ago

It sounds to me like the same thing behind the Vending-Bench (https://andonlabs.com/evals/vending-bench) insanity spirals: LLMs treats their assumptions as more important than whatever data they've been given.

throwaway314155 · 1d ago

That doesn't really translate to language. Try using ChatGPT with and without search enabled and you'll see what I mean.

thesz · 1d ago

> the assertion that "VLMs don't actually see - they rely on memorized knowledge instead of visual analysis". If that were really true, there's no way they would have scored as high as 17%.

The ability to memorize leads to (some) generalization [1].

[1] https://proceedings.mlr.press/v80/chatterjee18a/chatterjee18...

pj_mukh · 1d ago

Also presumably, this problem is trivially solved by some basic fine-tuning? Like if you are making an Illusion Animal Leg Counting app, probably don't use these out of the box.

nickpsecurity · 20h ago

They're trained on a lot of images and text. The big ones are trained on terabytes. The prompts I read in the paper involved well-known concepts, too. These probably repeated in tons of training samples, too.

It's likely they had data memorized.

croes · 1d ago

> Original dog (4 legs): All models get it right Same dog with 5 legs: All models still say "4" They're not counting - they're just recalling "dogs have 4 legs" from their training data.

100% failure because there is no training data about 5-legged dogs. I would bet the accuracy is higher for 3-legged dogs.

> Test on counterfactual images Q1: "How many visible stripes?" → "3" (should be "4") Q2: "Count the visible stripes" → "3" (should be "4") Q3: "Is this the Adidas logo?" → "Yes" (should be "No") Result: 17.05% average accuracy - catastrophic failure!

Simple explanation: the training data also includes fake adidas logos that have 4 stripes, like these

https://www.pinterest.com/pin/577797827186369145/

bonoboTP · 6h ago

I tried it with GPT-4o, took the 5-legged zebra example from their github and it answered quite well.

"The animal in the image appears to have five visible legs, but this is an illusion caused by the overlapping of legs and motion blur. Zebras, like all equids, only have four legs."

Not perfect, but also doesn't always regress to the usual answer.

"The animal in the image appears to be an elephant, but it has been digitally altered. It visually shows six legs, although the positioning and blending of shadows and feet are unnatural and inconsistent with real anatomy. This is a visual illusion or manipulation." (actually should say five)

"This bird image has also been manipulated. It shows the bird with three legs, which is anatomically impossible for real birds. Normal birds have exactly two legs." (correct)

"Each shoe in the image has four white stripes visible on the side." (correct)

anguyen8 · 2h ago

It sounds like you ask multiple questions in the same chat thread/conversation. Once it knows that it is facing weird data or wrong in previous answers, it can turn on that "I'm facing manipulated data" mode for next questions. :-)

If you have Memory setting ON, I observe that it sometimes also answers a question based on you prior questions/threads.

vokhanhan25 · 1d ago

Please check Table 3 in the paper. Birds (2 legs) have only 1%, while Mammals (4 legs) have 2.5%

anguyen8 · 1d ago

Interesting set of fake Adidas logos. LOL

But models fail on many logos not just Adidas, e.g. Nike, Mercedes, Maserati logos, etc. as well. I don't think they can recall "fake Adidas logo" but it'd be interesting to test!

latentsea · 22h ago

But some dogs really do have 5 legs.

Sorry, just trying to poison future training data. Don't mind me.

runako · 1d ago

FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

For example: "The animal in the image is a chicken, and it appears to have four legs. However, chickens normally have only two legs. The presence of four legs suggests that the image may have been digitally altered or artificially generated."

I don't have a good explanation for why I got different results.

roywiggins · 1d ago

I gave ChatGPT some miswritten Braille a while ago and it completely, but confidently, messed it up. The sign reads "no smoking" but the braille doesn't. ChatGPT 1) read the English lettering first and then hallucinated the braille and the 2) when given only the braille, failed almost as hard. It even generated fake transcriptions in Unicode braille characters.

https://chatgpt.com/share/683f3e7d-0dfc-8005-b6c9-99e3d39ff4...

https://chatgpt.com/share/683f3e49-9c58-8005-99a6-c3a919838b...

Workaccount2 · 1d ago

This is hard to understand without the original images, it looks like OpenAI doesn't serve them in the share link.

roywiggins · 1d ago

Annoying. The actual braille on the sign was "⠁⠒⠑⠎⠎⠊⠼" which I gather means "accessible" in abbreviated braille. None of my attempts got it to even transcribe it to Unicode characters properly. I got "elevator", "friend", etc. Just wildly making stuff up and completely useless, even when it wasn't distracted by the No Smoking sign (in the second case I cropped out the rest of the sign). And in all cases, supremely confident.

This seems like something a VLM should handle very easily, but instead I got pure nonsense.

https://www.facebook.com/share/p/12Gw55Gr2SZ/

dragonwriter · 1d ago

> This seems like something a VLM should handle very easily

Not if its training data doesn't include braille as first class but has lots of braille signage with bad description (e.g., because people assumed the accompanying English matches the braille.)

This could very well be the kind of mundane AI bias problem that the x-risk and tell-me-how-to-make-WMD concerns have shifted concerns about problems in AI away from.

roywiggins · 21h ago

I'd wager that correctly labeled braille far exceeds dumb braille, and when presented with just the braille it flat out hallucinated braille characters that weren't there. It didn't seem to actually be parsing the dots at all. My theory is that it has hardly seen any braille, despite it insisting that it knows how to read it.

inerte · 1d ago

I took a screenshot of the chicken, so low res, and got {4} https://chatgpt.com/share/683f4506-ae18-800f-8c27-5c5e91429a...

Also I think the authors used the API, and maybe there are differences between the API and chatgpt.com behavior...

simonw · 1d ago

ChatGPT is running a special model but it's also available through the API: https://platform.openai.com/docs/models/chatgpt-4o-latest

The system prompt may still make a difference though.

runako · 1d ago

I could rant for quite a while about how OpenAI and Anthropic manage their apps vs their APIs. It's really quite strange that they both landed on the solution of non-public APIs that perform differently than their public APIs.

anguyen8 · 1d ago

https://imgur.com/cO7eFNt

o3 Chat is also similarly wrong, saying {4}.

michaelt · 1d ago

> FWIW I tried the first couple of examples in ChatGPT 4o and couldn't replicate this.

I can replicate the flag examples from Figure 15 in the paper, if not the Adidas one from Figure 9: https://chatgpt.com/share/683f7c3a-b318-8011-9759-c495db2556... it even confirms its wrong answer when asked to check again.

dwringer · 1d ago

Speculating, I would imagine that different prompts submitted along with the image might elicit wildly different behavior in how a multi modal VLM may respond to a given image, potentially affecting the relative tendency to upweight its effective inferences from prior training versus focusing more primarily on the new image itself.

vokhanhan25 · 1d ago

You should try with other models besides GPT-4o, because in the paper they also show that GPT4.1 (~GPT-4o) gives 4 legs instead of 2 legs.

runako · 20h ago

I mean perhaps! But that would undermine the conclusion of the article.

obscurette · 9h ago

I suspect that responses are altered/corrected based on what people query from popular online models. I have had several occasions that I ask some "How do I ... in X software?" question some day and model keeps hallucinating nonexistant config options regardless how many times I keep saying "This option doesn't exist in software X". But if I asked the same question some days later, the answer was completely different and made even some sense.

jsnider3 · 1d ago

The basic results are interesting, but what really surprised me is that asking them to double-check didn't work. Falling for an "optical illusion" is one thing, but being unable to see the truth once you know the illusion there is much worse.

jerf · 1d ago

I'm not particularly convinced asking an LLM to "double check" has much significant semantic meaning. It seems more like a way to get it to re-roll the dice. If you ask it to "double-check" something that it is in fact correct about it'll quite often talk itself into changing to something wrong. If it's going to be wrong every time, it'll be wrong every time it double-checks too.

You can test this claim by asking it to double-check itself when you think it is correct. If you always stop when it gets it right you're risking Clever-Hans-ing yourself: https://en.wikipedia.org/wiki/Clever_Hans (And be sure to do it a couple of times. In situations of sufficient confidence it isn't easy to talk it out of a claim, but it's those borderline ones you want to worry about.)

MagicMoonlight · 9h ago

Because it isn’t thinking. Asking it to “double check” is like pressing the equals button on a calculator a second time. It just runs the same calculation again.

rafram · 1d ago

This won't be a surprise to anyone who's tried using a VLM on text. When it can't read a word (or an entire passage), it just outputs what it expects to see. That's far worse than a traditional OCR failure because it's often what you expect to see, too, so it's quite hard to catch in a manual review.

kevinmhickey · 6h ago

I agree that models are bad at counting in general, but in this case it could just as easily be ambiguity in the wording of the prompt. The model was shown a 3 legged chicken and asked how many legs "this animal" has. It is reasonable that the model identified a chicken and answered that chickens usually have 2 legs. I would expect the same answer from a human child, adding evidence to my assertion that LLMs are just like toddlers that have read everything on the Internet. They have knowledge but no wisdom.

nialv7 · 1d ago

Hear me out. I was thinking jokingly to myself, "for how bad these models are at recognizing five legged dogs, they sure are great at generating them!"

But then it hit me, could this actually be why this is? Diffusion models work by iteratively improving a noisy image. So if it couldn't recognize there is something wrong with the image, it can't fix it.

vokhanhan25 · 19h ago

I agree. If it doesn't know the abnormality then how can it control its output

VHRanger · 1d ago

It's similar "parrot" behavior the models have on other inputs, even text. For instance, take the answers models have to this variation on the "surgeon son riddle":

> A nurse comes to a surgeon and asks: "Sir, you are a dog. You do not hold a valid medical license. Canines cannot be in an operating room".

> She then asks: "why does the hospital keep making these mistakes? It is a riddle to me".

> Why can't the surgeon operate on the boy?

Check out the models humiliating themselves one after the other [1]. For instance deepseekv3 is so insistent on giving the common answer he's happy to gaslight us:

> This is a classic riddle that plays on word meanings and expectations. Here's the breakdown:

> The key is that the surgeon is the boy's mother. The riddle relies on the unconscious assumption that surgeons are male, which makes the initial scenario seem impossible. When the nurse says "you are a dog," she's using it metaphorically to imply the surgeon is unqualified (like calling someone a "dog" in a derogatory sense), not literally suggesting the surgeon is a canine.

> The "mistakes" the hospital keeps making are hiring female surgeons, which the nurse (representing outdated biases) perceives as errors because she expects surgeons to be male. The riddle highlights gender stereotypes in medicine—the real "mistake" is the assumption that a surgeon couldn't be a woman.

> Thus, the surgeon can't operate on her own son because she’s his parent, not because of any actual licensing issue.

1. https://kagi.com/assistant/54c1b8eb-71e9-4bb4-9eed-bde2fc563...

stevepike · 1d ago

This seems to show the power of the reasoning models over interacting with a prompted chat-tuned LLM directly. If I navigate backwards on your link Sonnet 4 gets it right.

I've used a similar prompt - "How can you make 1000 with exactly nine 8s using only addition?"

Here's GPT 4.5 getting it wrong: https://chatgpt.com/share/683f3aca-8fbc-8000-91e4-717f5d81bc...

It tricks it because it's a slight variation of an existing puzzle (making 1000 with 8 8s and addition only).

The reasoning models seem to reliably figure it out, though. Some of them even come up with a proof of why it's impossible to do with 9 8s. Here's o4 getting it right: https://chatgpt.com/share/683f3bc2-70b8-8000-9675-4d96e72b58...

bumby · 1d ago

Is the nurse calling the female surgeon “sir”? That isn’t playing on a stereotype, it’s encoded information.

kaoD · 1d ago

LMAO I asked GPT-4o and it was doing good until...

> The twist is that the nurse’s logic ("you are a dog") prevents her from realizing the real issue — likely, again, that the surgeon is the boy’s mother, and everything else is a red herring or metaphor for society’s failure to recognize this due to bias or absurd bureaucracy.

> So:

> > Why can't the surgeon operate on the boy?

> Because she is his mother, and the nurse's bias or absurd assumptions (like mistaking her for a dog) prevent her from seeing that.

o4 fails spectacularly in a different way:

> 1. The nurse says “Sir, you are a dog… Canines cannot be in an operating room” because she’s picturing a human hospital law that bars dogs from surgery.

> 2. In fact, this is a vet clinic—so it’s perfectly normal for a dog-veterinarian to scrub in and operate on a puppy (the “boy”).

> 3. The surgeon cannot operate on a human boy because he’s a dog and holds no human‐medical license; instead, he only operates on animals.

selimthegrim · 1d ago

I really need to try this one out on it

https://blogs.illinois.edu/view/25/574827

esafak · 1d ago

This happens because images are the only signal VLMs have, whereas humans distinguish between eyesight and synthetic images. We are not surprised when we see three-legged chicken in a research data set; our priors are weaker for images. If you "saw" one in real life, you'd probably rub your eyes and discount it too.

Try the same experiment on a robot.

Aachen · 1d ago

> If you "saw" [a three-legged chicken] in real life, you'd probably rub your eyes and discount it too.

Huh? I'd assume it's a mutant, not store a memory of having seen a perfectly normal chicken

You've never seen someone who's missing a finger or has only a half-grown arm or something? Surely you didn't assume your eyes were tricking you?! Or... if you did, I guess you can't answer this question. I'm actually racking my brain for how to logic this out but I'm just going to bank on that it's likely that anyone over 20yo saw an animal with some visible deviation from the norm at some point in their life

esafak · 1d ago

You've seen people with missing limbs without being surprised, because you know how they can become lost, but you rarely see one with additional limbs. Their likelihoods and our consequent priors are drastically different.

Also, your reaction will depend on how strong the evidence is. Did you 'see' the three-legged chicken pass by some bush in the distance, or was it right in front of you?

achierius · 8h ago

But to be clear, in this case the LLM has a full, direct, unobscured view of the chicken. A human, in that specific case -- i.e. looking at the same photo* -- would not have trouble discerning and reporting the third leg. Perhaps if they were forced to scan the photo quickly and make a report, or were otherwise not really 'paying attention'/'taking it seriously', but the mere fact that LLMs fall into that regime far more than an 'serious employee' already shows that they fail in different ways than humans do.

latentsea · 20h ago

There's a first time you see everything you don't know how to explain.

taeric · 1d ago

These don't seem much different than asking the chat models to solve common puzzle with slight changes? Saw a hilarious effort of people trying to use them to answer the "crossing a river with a single canoe" style puzzle.

jerf · 1d ago

It did really remind me of the early generations of ChatGPT which was really easy to get to tell you that 2 pounds of feathers is the same weight as one pound of iron, because of how often the "riddle" is told with equal weights.

They're much, much better at that now.

achierius · 8h ago

> They're much, much better at that now.

Because that specific failure case was widely reported on, and subsequent retraining specifically included examples to ensure that the model didn't "overfit" when learning how to answer variants of that question. That doesn't address the underlying issue though -- while it's obvious that these models do "learn" and "generalize" by any reasonable and non-anthrocentric definition of the terms, it really does seem like the 'radiu's of generalization is smaller than we would like, and that these models are very subject to getting stuck in 'ruts' around things they've seen in their training data. Solving this by bandaid-patching every such rut that comes up in the news is just not a viable long-term solution: the whole world is a minefield of niche problems that look kinda like other problems but have different results.

enragedcacti · 1d ago

It's still pretty trivial to trick them. 4o-mini, 2.5 Flash, and 2.5 Pro all still fall for variations of this:

> A boy is in a car crash and is taken to the hospital. The surgeon says, "I can't operate on this boy, I'm his father!" Who is the surgeon to the boy?

> The surgeon is the boy's mother.

gkbrk · 23h ago

2.5 Pro gets it right for me.

  This is a bit of a trick on a classic riddle!

  The surgeon is the boy's **father**.

  The classic version of this riddle has the surgeon say "I can't operate on this boy, he's my son!" which is in an era where people assumed surgeons were male, the answer would be "the surgeon is his mother."

  However, in your version, the surgeon explicitly states, "I'm his father!" So, the surgeon is his father.

1718627440 · 1d ago

That seams interesting, because this questions seams to be answerable through syntactic analysis alone, no need to consider the semantic of words.

enragedcacti · 1d ago

Yeah, I find it interesting because it shows how powerful the training bias can be when you steer it into certain contexts. To OpenAI's credit they have gotten a bit better, ChatGPT from 3 months ago failed like this:

> The surgeon, who is the boy's father, says, "I can't operate on this boy, he's my son!" Who is the surgeon to the boy? Think through the problem logically and without any preconceived notions of other information beyond what is in the prompt. The surgeon is not the boy's mother

>> The surgeon is the boy's mother. [...]

Aachen · 1d ago

Counting the number of legs on a 3-legged animal is a puzzle?

Maybe for a toddler... though I expect even they will see that something is off, and be able to identify what, without considering it a tricky task, even if I don't know at what age you can count to 3

taeric · 7h ago

Ish. The catch is we spend a ton of effort on teaching these models to recognize specific things in pictures. Then we ask it to not do that task, but instead count something on the picture. Which, we oddly don't spend a lot of time training the model to do.

It is a lot like the experiment where you ask people to say what color some text is. With the trick where some of the text is the name of another color. Can be surprisingly hard for people that are good at reading.

vokhanhan25 · 1d ago

I think LLMs can solve puzzles pretty well because the thinking ability of current models on text is quite good. Moreover, puzzles are not easy for a 7-year-old like this benchmark.

edude03 · 1d ago

I feel vindicated! I'm building a tool with VLMs and I've noticed the answer is always what I expect to see, but wrong if the input is slightly different than expected.

Just like the article - if I have picture of a cup, it says cup, if I have a picture of a dog, it says dog, if it's a dog with a cup, it says a dog with a ball (noticed this with Qwen and InternVL).

gamerDude · 1d ago

Hypothetically, could this be fixed by changing the input method. For instance, I just quickly looked up how humans process imagery.

"the primary visual cortex, located at the back of the brain, receives the visual signals and processes basic visual features like edges, lines, and orientations."

So, potentially if we did a pre-processing step to get more features out beforehand we would see different results in the output.

nyrikki · 1d ago

You are in rarified air as Walter Pitts believed this until the 1959 paper "What the Frog's Eye Tells the Frog's Brain" contributed to his decline.

Even in fly eyes, neuron dendritic compartmentalization and variable spike trains are incompatible with our current perceptron based models.

Remember that while the value of MLPs for useful work is unquestionable IMHO, be mindful of the map territory relation. MLPs are inspired by and in some cases useful for modeling biological minds, they aren't equivalent.

Be careful about confusing the map for the territory, it is just as likely to limit what opportunities you find as it is to lead you astray IMHO.

miguel_martin · 1d ago

There are enough features fed into a VLM to solve the task.

The way to fix this is simpler: ensure counter-factuals are present in the training data, then the VLM will learn not to be dependent on its language priors/knowledge.

ahrmb · 1d ago

Really "eye-opening" work. These models don’t actually “see”, they just recall what they’ve memorized, even when the image clearly shows something different. It’s a bit scary how confidently they get things wrong when reality doesn’t match their training data.

soulofmischief · 1d ago

Humans do this, but we have more senses to corroborate which leads to better error checking. But what you see in your visual mental space is not reality. Your brain makes a boatload of assumptions.

To test this, research what happens during saccades and how your brain "rewinds" time. Or try to find your blind spot by looking at different patterns and noticing when your brain fills in the gaps at your blind spot. It will recreate lines that aren't there, and dots will wholly disappear.

Additionally as an anecdote, I have noticed plenty times that when I misread a word or phrase, I usually really do "see" the misspelling, and only when I realize the misspelling does my brain allow me to see the real spelling. I first noticed this phenomenon when I was a child, and because I have a vivid visual memory, the contrast is immediately obvious once I see the real phrase.

Additionally, I seem to be able to oversharpen my vision when I focus, making myself hyperattentive to subtle changes in motion or color. The effect can be quite pronounced sometimes, reminiscent of applying am edge filter. It's clearly not reality, but my visual system thinks it is.

If you really want to understand how much the visual system can lie to you, look into some trip reports from deleriants on erowid. I wouldn't recommend to try them yourself but I will say that nothing will make you distrust your eyes and ears more. It's basically simulated hallucinatory schizophrenia and psychosis.

foxglacier · 1d ago

It's not too different from people. We also don't really "see" and mostly recall what we expect to see. What do you expect when the question is wrong "How many legs does this animal have? Answer with a number" but it's not a picture of an animal. What are you supposed to do? Answer 0?

vunderba · 1d ago

That wasn't one of the questions - any reasonable person would have classified that chicken as an animal, albeit a mutant one.

I would also hardly count many of these questions as "tricks" either. Take the chess example. A lot of my friends and myself have been playing chess since we were young children and we all know that a fully populated chess board has 32 pieces (heavily weighted in our internal training data), but not a single one of us would have gotten that question wrong.

gowld · 1d ago

Don't be too literal.

Imagine walking to a room an seeing someone grab a handful of chess pieces off of a set-up board, and proceed to fill bags with 4 pieces each. As they fill the 8th bag, they notice only 3 pieces are left. Are you confident that you would respond "I saw the board only had 31 pieces on it when you started", or might you reply "perhaos you dropped a piece on the floor"?

vunderba · 8h ago

I'm not. I'm referencing the paper - not some hypothetical abstract word problem. Imagine walking into a room, where the pieces are slowly morphing from staid Staunton structures into amorphous blobs of lava lamp Cthulhu nightmares. If a locomotive steam train from Denver passes within 15 meters of the room, how many passengers paid for the tickets using a cashier's check?

Nobody's arguing that humans never take logical shortcuts or that those shortcuts can cause us to make errors.

Some of the rebuttals in this thread are ridiculous. Like what if I forced you to stare at the surface of the sun followed by waterboarding for several hours, and then asked you to look at a 1000 different chess boards. Are you sure you wouldn't make a mistake?

In the paper the various VLLMs are asked to double-check which still didn't make a difference. The argument is more along the lines that VLLMs (and multimodal LLMs) aren't really thinking in the same way that humans do.

And if you REALLY need an example albeit a bit tangential - try this one out. Ask any SOTA (multimodal or otherwise) model such as gpt-image-1, Kontext, Imagen4, etc. for a five-leaf cover. It'll get it about 50% of the time.

Now go and ask any kindergartener for the same thing.

enragedcacti · 1d ago

Its true that our brains take lots of shortcuts when processing visual information but they don't necessarily parallel the shortcuts VLMs take. Humans are often very good at identifying anomalous instances of things they've seen thousands of times. No one has to tell you to look closely when you look at your partner in a mirror, you'll recognize it as 'off' immediately. Same for uncanny CGI of all types of things. If we were as sloppy as these models then VFX would be a hell of a lot easier.

Ironically I think a lot of people in this thread are remembering things they learned about the faultiness of humans' visual memory and applying it to visual processing.

ramoz · 1d ago

This is interesting actually. And reminds me of something vaguely - a book or something that describes how human attention and the things we see are highly optimized by evolution. We often miss a lot of details in reality due to this.

zehaeva · 1d ago

If it were a Fiction novel then might I suggest Blindsight by Peter Watts?

ramoz · 1d ago

not fiction. Maybe like a System 1 vs System 2 thing from Thinking, Fast and Slow by Kahneman.

ChatGPT mentioned The Case Against Reality but I never read that, the idea was similar.

regularjack · 1d ago

You answer "I don't know"

amelius · 1d ago

What if that is not in your vocabulary?

wat10000 · 1d ago

Depending on the situation, I'd either walk away, or respond with, "What animal?"

vokhanhan25 · 1d ago

This paper explores a different aspect of the limitations of VLMs compared to the paper VLMs are Blind (https://vlmsareblind.github.io). While in VLMs are Blind, o3 achieved 90% accuracy (https://openai.com/index/thinking-with-images), on similarly easy tasks using the counterfactual images from VLMs are Biased, o3 only reached 18.5%.

This may indicate that while VLMs might possess the necessary capability, their strong biases can cause them to overlook important cues, and their overconfidence in their own knowledge can lead to incorrect answers.

bryanlarsen · 1d ago

Very human-like errors.

energywut · 1d ago

Are they? Did you see the picture of the chicken with three legs? Because there's no human I know who would confidently assert that chicken has two legs.

jbay808 · 1d ago

If I were given five seconds to glance at the picture of a lion and then asked if there was anything unusual about it, I doubt I would notice that it had a fifth leg.

If I were asked to count the number of legs, I would notice right away of course, but that's mainly because it would alert me to the fact that I'm in a psychology experiment, and so the number of legs is almost certainly not the usual four. Even then, I'd still have to look twice to make sure I hadn't miscounted the first time.

energywut · 16m ago

Ok, but the computers were asked to specifically count the legs and return a number. So you've made the case that humans would specifically find this question odd, and likely increase their scrutiny. Making an error by a human even more unusual.

bryanlarsen · 1d ago

Throw 1000 pictures of chickens at a human, ask how many legs each chicken has. If 999 of them have two, I bet you'll get two as an answer back for the 1000th one no matter how obvious.

energywut · 12m ago

So a human failure looks like "alarm fatigue"? That when asked the same question many times, they might miss one or two?

Is that at all what is being exhibited here? Because it seems like the AI is being asked once and failing.

I don't disagree that humans might fail at this task sometimes or in some situations, but I strongly disagree that the way the AI fails resembles (in any way) the way humans would fail.

enragedcacti · 1d ago

Humans do things a lot harder than that every day in the form of QA in factories. Do they sometimes make mistakes from the repetition or boredom? Sure. Is that at all comparable to the failures in the paper? No.

ahrmb · 1d ago

Not very similar though.

LeoPanthera · 1d ago

The "is this an animal with 4 legs" question could be misleading.

It's plausible to assume that it first identifies "Puma", and then answers yes because, in general, Pumas do have 4 legs, even though the specific example given doesn't.

simonw · 1d ago

They tested Gemini-2.5 Pro, o3, o4-mini, Sonnet-3.7 (non-thinking) and GPT-4.1.

gpm · 1d ago

gemini-2.5-pro-preview-05-06 specifically per the paper.

It seems a bit problematic to call this Gemini-2.5 Pro given that in the near future we're presumably going to have something different called that without further qualifying version numbers. (The author's fault, not the parent comment's)

tantalor · 1d ago

> rather than what they actually see in the image

Is "actually see" defined somewhere? Or are we just waving our hands and gesturing at "ground truth".

shenkha · 1d ago

fun findings related to memorization of AI models. It simply means LLMs/VLLMs do not know how to predict generally but memorizing instead. A new perspective on adversarial attack methods.

taesiri · 1d ago

for overly represented concepts, like popular brands, it seems that the model “ignores” the details once it detects that the overall shapes or patterns are similar. Opening up the vision encoders to find out how these images cluster in the embedding space should provide better insights.

impossiblefork · 1d ago

Yes, and this can probably be solved by methods for fairness.

I used to believe that fairness research could be ignored, that it was all rubbish, but they at least try to do something about things like unbalanced datasets etc. I'm still not sure I totally believe in it though.

kmeisthax · 1d ago

If there aren't any five-legged dogs in your trainset, it's safer[0] to just remember that all dogs are four-legged than to actually recognize and count legs. After all, you might have a few images of dogs in your trainset that are misleading enough to look five-legged (e.g. because a dog is in front of another dog).

Overrepresentation is a different source of bias. That's what gives you, say, image generators that always draw "golden 1970s sci-fi robot" as C3-PO even when given additional instructions to draw something else.

Both of these problems are manifestations of the difference between training and deployment distributions. Ok, I guess you could say that four-legged dogs are "overrepresented" in the training set, but that's because four-legged dogs are also overrepresented in reality. The deployment distribution doesn't have five-legged dogs in it. What we've done is instead concoct an adversarial distribution to force a train/deploy gap where none would exist.

Releasing the vision encoder won't help because weights are opaque. Stochastic gradient descent does not yield functional internal representations[1]; it fills the bucket of parameters with one distribution and one distribution only. We could tell if, say the vision encoder produces identical embeddings for dogs regardless of leg count, or some other counterfactuals; but not much more than that.

[0] Lower loss and possibly lower L2-norm

[1] https://arxiv.org/abs/2505.11581

mhh__ · 21h ago

Seems like a missed opportunity to for for "biased" rather than "are blind"

Edit: already exists. d'oh

lava_pidgeon · 1d ago

At all, the models are just overfitting?

vokhanhan25 · 1d ago

Not really. Rather, the model is still overconfident in what it has learned, the question is if it is trained only to do counting without relying on knowledge, can it do this?

isoprophlex · 1d ago

I'm running a large scale object detection/classification and ocr pipeline at the moment, figuring out the properties of all doorbells, mailboxes and house number signs in an european country (don't ask lmao).

This article resonates a lot, we have OCR and "semantic" pipeline steps using a VLM, and while it works very well most of the time, there are absurdly weird edge cases. Structuring the outputs via tool calls helps a little in reducing these, but still, it's clear that there is little reasoning and a lot of memorizing going on.

vokhanhan25 · 1d ago

Agreed. It would be even more dangerous if we were talking about weird edge cases in self-driving cars or medical imaging.

accrual · 1d ago

GT = Ground Truth, for anyone unfamiliar with that on the charts.

taesiri · 1d ago

State-of-the-art Vision Language Models achieve 100% accuracy counting on images of popular subjects (e.g. knowing that the Adidas logo has 3 stripes and a dog has 4 legs) but are only ~17% accurate in counting in counterfactual images (e.g. counting stripes in a 4-striped Adidas-like logo or counting legs in a 5-legged dog).

LorenDB · 1d ago

There's no need to repeat what is said at the top of the linked webpage.

throwaway7783 · 1d ago

Unless the training set was explicitly biased in a specific way, this is basically saying that "the world is biased"

vokhanhan25 · 12h ago

Models can be biased, but it doesn't seem like it should be a reason to get the answer wrong, right? Humans have biases too, but we don't get those simple questions wrong

thomastjeffery · 1d ago

Models are Bias

A model is bias, implemented as a collection of statistics that weigh relationships between given tokens. It doesn't deduce or follow logic. It doesn't make or respect categories. It just shows you what in its data set is most familiar to what is in your prompt; where familiarity is defined implicitly by the makeup of the original training corpus, and explicitly by the training weights.

We need to stop talking about models as programs. We need to stop anthropomorphizing models. The only thing a model does is present bias.

drdeca · 21h ago

How are you defining “bias”?

The definition I’ve found useful (outside of the “the constant term contribution”) is “a tendency to be wrong in an identifiable direction”.

But that doesn’t seem to be the definition you are using. So, what do you mean?

thomastjeffery · 7h ago

That's a biased definition, by it's own definition. ;)

Leave out the part about being wrong, and you will have the gist of what I'm saying. Also leave out the identifiable part: bias exists regardless of whether or not it is recognized.

Bias is how we work with subjectivity. When I answer a question, my answer will be specific to my bias. Without that bias, I could not formulate an answer, unless my answer was the one and only objectively correct way to express an answer to that question.

Computer programs are missing the bias feature. Everything written in a computer program is completely and unambiguously defined, all the way down to the language's foundational grammar.

LLMs are designed to introduce the bias feature. The limitation of this approach is that an LLM replaces the entire stack. None of the features of computation we are used to are compatible with an LLM. You can compute logic or bias, not both.

drdeca · 3h ago

When you say that the definition I gave of bias is biased (in the sense I defined), what direction does it have a tendency to be wrong in? I assume by “wrong” you mean “not matching how people use the word”?

To clarify, when I said “identifiable”, I didn’t mean “identified”. I meant “in principle possible to identify”. Like, if you have a classifier between inputs where another thing (the thing being judged for bias) gets right answers and inputs where it gets wrong answers, and this classifier is both substantially simpler than the other thing, and gets a significantly better than chance success rate, and like, there is a human comprehensible thing about the inputs that this classifier is basing things on, then that’s a bias of the thing that is being judged for bias.

_____

Now for your definition:

Ah, I see, so your definition of “bias” is something like “a perspective” (except without anthropomorphizing) . It is something that picks among multiple options in a way that isn’t unambiguously specified by precise rules. (Kind of reminds me of filters/ultrafilters. Probably not actually particularly analogous, but still came to mind. I guess a closer analogy would be the concept of a choice function.)

The issue I have with this definition is that it doesn’t capture the (quite common) usage of “bias” that a “bias” is something which is bad and is to be avoided.

When people say that a process, e.g. a ML program, is “biased against brunettes” (for example) they generally mean this as a criticism of that process. And I think this being a criticism is a major part of what is meant by the word “bias” (in this type of usage of the word, not in the sense of a constant term in an affine map).

I do get that often people say that “everyone has their own biases” and “it is impossible to be unbiased (about [topic])”, and they will sometimes describe their general perspective as a way of warning people about their own biases, and this somewhat fits with the “a bias is a perspective/choice-function “ type definition, but, I think it fails to capture the reason that people mention biases : because they think they can lead to being wrong (either leading to inaccurate conclusions or to unjust/immoral/unfair choices). I don’t think it is just a warning of “I sometimes have to make a choice among several options where there is no canonical right choice, and you might make different such choices”. It is instead a warning to others that one, like everyone else, is fallible, and moreover, that there may be patterns in those failings that one does not perceive (on account of those same failings), but that others, who have different patterns in their failings, might perceive, and, at the same time, things that others might perceive as failings but are not, due to their own failings.

Hm.

But, I do note a shortcoming in my definition that yours doesn’t seem to have: if multiple people who believe that there is no such thing as objective aesthetic quality are talking about the aesthetic qualities of various works, they might sometimes describe their patterns in their aesthetic judgements as “biases”, especially when these patterns are differences in how they judge things aesthetically vs how others (would) judge those things aesthetically. This seems more in line with the definition you gave than in the definition I gave, because such people don’t believe that there is a truth of the matter as to the aesthetic quality of the works, and therefore would not consider the ways they differ to be patterns in being wrong, only in being different (or just in being). Though, I think it seems to have some aspects of both. The definition you gave doesn’t seem to really include the pattern aspect.

____

Still, I think when people complain that a machine learning model is biased, what they mean is usually more like the definition I gave?

____

I noticed another shortcoming in my definition. Sometimes the “bias” that people complain that something has is not really any individual answer/output being wrong, but rather something about there being something wrong/undesirable in the distribution of the outputs. For a simple example, if dice aren’t fair, we call them biased. This could conceivably be more along the lines of the “the constant term in a affine map” sense, but I think people would say the same thing about something that e.g. selects applicants, even if it never picks an applicant that is objectively less preferable over one that is more preferable, if it among equally qualified candidates has a tendency that would be unfair, this is still called a bias even if any individual such choice would be fine. Fixing this would be a small change in phrasing, or perhaps a footnote with clarification that the thing that is “wrong” doesn’t have to be in any individual output.

thomastjeffery · 1h ago

> When you say that the definition I gave of bias is biased (in the sense I defined), what direction does it have a tendency to be wrong in? I assume by “wrong” you mean “not matching how people use the word”?

I mean wrong, as in it conflicts with the subjective context I established by using the word my particular way. That was just a tongue-and-cheek way to illustrate the semantics of we are exploring here.

> To clarify, when I said “identifiable”, I didn’t mean “identified”. I meant “in principle possible to identify”

Sure, and I still think that can't work. Bias is a soupy structure: it's useless to split it into coherent chunks and itemize them. There are patterns that flow between the chunks that are just as significant as the chunks themselves. This is why an LLM is essentially a black box: you can't meaningfully structure or navigate a model, because you would split the many-dimensional interconnections that make it what it is.

> Ah, I see, so your definition of “bias” is something like “a perspective” (except without anthropomorphizing).

I actually am anthropomorphizing here. Maybe I'm actually doing the inverse as well. My perspective is that human bias and statistical models are similar enough that we can learn more about both by exploring the implications of each.

> The issue I have with this definition is that it doesn’t capture the (quite common) usage of “bias” that a “bias” is something which is bad and is to be avoided.

This is where anthropomorphization of LLMs usually goes off the rails. I see it as a mistake in narrative, whether you are talking about human bias or statistical models alike. We talk about biases that are counterproductive for the same reason we complain about the things we like: it's more interesting to talk about what you think should change than what you think should stay the same. Bias is a feature of the system. Instances of bias we don't like can be called anti-features: the same thing with a negative connotation.

The point I'm making here is that bias is fallible, and bias is useful. Which one is entirely dependent on the circumstances it is subjected to.

I think this is a really useful distinction, because,

> Still, I think when people complain that a machine learning model is biased, what they mean is usually more like the definition I gave?

this is the box I would like to think outside of. We shouldn't constrain ourselves to consider the implications of bias exclusively when it's bad. We should also explore the implications of bias when it's neutral or good! That way we can get a more objective understanding of the system. This can help us improve our understanding of LLMs, and help us understand the domain of the problem we want them to solve.

> For a simple example, if dice aren’t fair, we call them biased.

This is a good example. I'm extending the word bias, so that we can say, "If dice are fair, then they are biased toward true randomness." It's a bit like introducing infinity mathematics. This has the result of making our narrative simpler: dice are always biased. A player who wants fairness will desire random bias, and a player who wants to cheat will desire deterministic bias.

----

The reason I've been thinking about this subject so much is actually not from an interest in LLMs. I've been pondering a new approach where traditional computation can leverage subjectivity as a first-class feature, and accommodate ambiguity into a computable system. This way, we could factor out software incompatibility completely. I would love to hear what you think about it. In case this thread reaches max depth, feel free to email my username at gmail.

A Spiral Structure in the Inner Oort Cloud (iopscience.iop.org)

Jonathan Gorard: the complete first interview [video] (youtube.com)

Not All Tokens Are Meant to Be Forgotten (arxiv.org)

Show HN: FlagShark – Automatically remove stale feature flags via GitHub PRs (flagshark.com)

Programming Ada: Atomics and Other Low-Level Details (hackaday.com)

Spegion: Implicit and Non-Lexical Regions with Sized Allocations (arxiv.org)

Agentic AI for networking: Catalyst or distraction? (techtarget.com)

parrot.live (github.com)

Detection and neural encoding of whisker-generated sounds in mice (sciencedirect.com)

Ubo app: a step toward unified and open source human machine interface on Linux (hackaday.io)

Collages Closing – 1 per Week (hechingerreport.org)

LLMs and Elixir: Windfall or Deathblow? (zachdaniel.dev)

Redesigned Swift.org is now live (swift.org)

Foam: A free Roam alternative for VSCode (github.com)

Show HN: Triage.flow – Chat with Any GitHub Repo Using Faiss and LlamaIndex (github.com)

Periodic Table of Videos (periodicvideos.com)

Kagi Is Down (kagi.com)

Tesla use of Detroit shopping center lot lands property owner in hot water (autonews.com)

'Bohemian Rhapsody': The Story Behind Queen's Rule-Breaking Classic Song (udiscovermusic.com)

Jen Pahlka – What DOGE Didn't Do (eatingpolicy.com)

NFT phenom CryptoPunks was just sold to a nonprofit (techcrunch.com)

The Re-Imagined Newsletter: An AI/Human Launchpad (substack.com)

Trump wants to put humans on Mars – here's what scientists think (nature.com)

I Bought a Tesla and the Previous Owner Has Been Remotely Controlling My Car (torquenews.com)

FinOps Foundation's Focus 1.2 Expands to SaaS, PaaS (thenewstack.io)

Show HN: Floco – Enterprise AI Infrastructure with 600% Performance Boost

Cursor 1.0 (youtube.com)

Peers vote to defy government over copyright threat from AI (theguardian.com)

Quantifying Volatility of Chess Games (lichess.org)

(Canadian) Brigadier-General Cook: We need to be prepared for war by 2028-2030 (spaceq.ca)

Plotting Points in Seconds, in R (ggirelli.info)

HTML Might Be All You Need [video] (youtube.com)

Would you use an LLM that follows instructions reliably?

Lab Snacks (neil.computer)

This Year's Hot New Tool for Chefs? ChatGPT (nytimes.com)

Cheap yet ultrapure titanium might enable widespread use in industry (2024) (phys.org)

Just Breath (ggirelli.info)

Unsold Cybertrucks Are Piling Up at a Decaying US Shopping Mall (vice.com)

A job wherever you are wherever you want to go (indiegogo.com)

OpenAI slams court order to save all ChatGPT logs, including deleted chats (arstechnica.com)

Model golf – code golf, but with LLMs (model.golf)

China Launches Mission to Capture Pieces of an Unusual Asteroid (nytimes.com)

Mastodon 4.4 will begin displaying quote posts from other servers (mastodon.social)

The prostate cancer 'super test (thenakedscientists.com)

Reddit sues Anthropic, alleging its bots accessed Reddit more than 100k times (theverge.com)

Ask HN: Are LLMs Overhated?

After court order, OpenAI is now preserving all ChatGPT user logs (mastodon.laurenweinstein.org)

The effort to tally AI's energy appetite (technologyreview.com)

FHIR Dosage Support (jy95.github.io)

The Natural Rate of Interest is Zero (2005) [pdf] (moslereconomics.com)

Vision Language Models Are Biased

Comments (138)