My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
planb · 18m ago
And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.
puttycat · 19m ago
You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
ben_w · 14m ago
> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).
joshstrange · 1h ago
I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
qwertytyyuu · 20m ago
https://imgur.com/a/mzZ77xI
here are a few i tried the models, looks like the newer vesion of gemini is another improvement?
puttycat · 17m ago
The bicycle are still very far from actual ones.
bravesoul2 · 16m ago
Is there a good model (any architecture) for vector graphics out of interest?
neepi · 1h ago
My only take home is they are all terrible and I should hire a professional.
keiferski · 24m ago
As the other guy said, these are text models. If you want to make images use something like Midjourney.
Promoting a pelican riding a bicycle makes a decent image there.
dist-epoch · 54m ago
Most of them are text-only models. Like asking a person born blind to draw a pelican, based on what they heard it looks like.
neepi · 33m ago
That seems to be a completely inappropriate use case?
I would not hire a blind artist or a deaf musician.
namibj · 27m ago
It's a proxy for abstract designing, like writing software or designing in a parametric CAD.
Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle.
You have to make a mental model and then turn it into applicable instructions.
Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.
neepi · 7m ago
I would not assume that this methodology applies to applied engineering, as a former actual real tangible meat space engineer. Things are a little nuanced and the nuances come from a combination of communication and experience, neither of which any LLM has any insight into at all. It's not out there on the internet to train it with and it's not even easy to put it into abstract terms which can be used as training data. And engineering itself in isolation doesn't exist - there is a whole world around it.
Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.
The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.
I guess the idea is that by asking the model to do something that is inherently hard for it we might learn something about the baseline smartness of each model which could be considered a predictor for performance at other tasks too.
dist-epoch · 28m ago
The point is about exploring the capabilities of the model.
Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.
dmd · 21m ago
Sorry, Beethoven, you just don’t seem to be a match for our org. Best of luck on your search!
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
Promoting a pelican riding a bicycle makes a decent image there.
I would not hire a blind artist or a deaf musician.
Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.
Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.
Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.
The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.
Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.
You too, Monet. Scram.
https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...
> Claude 4 will rat you out to the feds!
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.