The last six months in LLMs, illustrated by pelicans on bicycles

962 swyx 234 6/8/2025, 7:38:37 AM simonwillison.net ↗

Comments (234)

isx726552 · 32d ago

> I’ve been feeling pretty good about my benchmark! It should stay useful for a long time... provided none of the big AI labs catch on.

> And then I saw this in the Google I/O keynote a few weeks ago, in a blink and you’ll miss it moment! There’s a pelican riding a bicycle! They’re on to me. I’m going to have to switch to something else.

Yeah this touches on an issue that makes it very difficult to have a discussion in public about AI capabilities. Any specific test you talk about, no matter how small … if the big companies get wind of it, it will be RLHF’d away, sometimes to the point of absurdity. Just refer to the old “count the ‘r’s in strawberry” canard for one example.

simonw · 32d ago

Honestly, if my stupid pelican riding a bicycle benchmark becomes influential enough that AI labs waste their time optimizing for it and produce really beautiful pelican illustrations I will consider that a huge personal win.

benmathes · 24d ago

"personal" doing a lot of work there :-)

(And I'd be envious of your impact, of course)

Choco31415 · 32d ago

Just tried that canard on GPT-4o and it failed:

"The word "strawberry" contains 2 letter r’s."

belter · 31d ago

I tried

strawberry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said three

strawberrry -> DeepSeek, GeminiPro and ChatGPT4o all correctly said four

stawberrry -> DeepSeek, GeminiPro all correctly said three

ChatGPT4o even in a new Chat, incorrectly said the word "stawberrry" contains 4 letter "r" characters. Even provided this useful breakdown to let me know :-)

Breakdown: stawberrry → s, t, a, w, b, e, r, r, r, y → 4 r's

And then asked if I meant "strawberry" instead and said because that one has 2 r's....

MattRix · 32d ago

This is why things like the ARC Prize are better ways of approaching this: https://arcprize.org

whiplash451 · 31d ago

Well, ARC-1 did not end well for the competitors of tech giants and it’s very unclear that ARC-2 won’t follow the same trajectory.

wolfmanstout · 31d ago

This doesn’t make ARC a bad benchmark. Tech giants will have a significant advantage in any benchmark they are interested in, _especially_ if the benchmark correlates with true general intelligence.

lofaszvanitt · 31d ago

You push sha512 hashes of things in a githup repo and a short sentence:

x8 version: still shit . . x15 version: we are closing, but overall a shit experience :D

this way they won't know what to improve upon. of course they can buy access. ;P

when they finally solve your problem you can reveal what was the benchmark.

adrian17 · 32d ago

> This was one of the most successful product launches of all time. They signed up 100 million new user accounts in a week! They had a single hour where they signed up a million new accounts, as this thing kept on going viral again and again and again.

Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.

haiku2077 · 32d ago

Congratulations, you are almost fully unplugged from social media. This product launch was a huge mainstream event; for a few days GPT generated images completely dominated mainstream social media.

sigmoid10 · 31d ago

If you primarily consume text-based social media (HN, reddit with legacy UI) then it's kind of easy to not notice all the new kinds of image infographics and comics that now completely flood places like instagram or linkedin.

derwiki · 32d ago

Not sure if this is sarcasm or sincere, but I will take it as sincere haha. I came back to work from parental leave and everyone had that same Studio Ghiblized image as their Slack photo, and I had no idea why. It turns out you really can unplug from social media and not miss anything of value: if it’s a big enough deal you will find out from another channel.

stavros · 31d ago

Why does everyone keep calling news "social media"? Have I missed a trend? Knowing what my friend Steve is up to is social media, knowing what AI is up to is news.

haiku2077 · 31d ago

You did miss a trend: https://www.pewresearch.org/short-reads/2024/09/17/more-amer...

loudmax · 31d ago

I'm afraid a lot of Americans consume the news like they consume sports media. They root for their team and select a news stream that presents them with the most favorable coverage.

stavros · 31d ago

As a non-American, I can assure you that's pretty much everywhere.

dgfitz · 32d ago

I missed it until this thread. I think I’m proud of myself.

tough · 31d ago

You're one of today's lucky 10.000

https://xkcd.com/1053/

dgfitz · 28d ago

I’m not, I still don’t know what it is, just that it was some kind of fad.

Semaphor · 31d ago

Facebook, discord, reddit, HN. Hadn’t heard of it either. But for FB, Reddit, and Discord I strictly curate what I see.

azinman2 · 32d ago

Except this went very mainstream. Lots of turn myself into a muppet, what is the human equivalent for my dog, etc. TikTok is all over this.

It really is incredible.

thierrydamiba · 32d ago

The big trend was around the ghiblification of images. Those images were everywhere for a period of time.

Jedd · 32d ago

Yeah, but so were the bored ape NFTs - none of these ephemeral fads are any indication of quality, longevity, legitimacy, or interest.

mrkurt · 32d ago

If we try really hard, I think we can make an exhaustive list of what viral fads on the internet are not. You made a small start.

none of these ephemeral fads are any indication of quality, longevity, legitimacy, interest, substance, endurance, prestige, relevance, credibility, allure, staying-power, refinement, or depth.

Aurornis · 32d ago

100 million people didn’t sign up to make that one image meme and then never use it again.

That many signups is impressive no matter what. The attempts to downplay every aspect of LLM popularity are getting really tiresome.

jodrellblank · 32d ago

I think it sounds far more likely that 100M people signed up to poke at the latest viral novelty and create one meme, than that 100M people suddenly discovered they had a pressing long-term need for AI images all on the same day.

Doesn’t it?

gretch · 32d ago

It's neither of these options in this false dichotomy.

100M people signed up and did at least 1 task. Then, most likely some % of them discovered it was a useful thing (if for nothing else than just to make more memes), and converted into a MAU.

If I had to use my intuition, I would say it's 5% - 10%, which represents a larger product launch than most developers will ever participate in, in the context of a single day.

Of course the ongoing stickiness of the MAU also depends on the ability of this particular tool to stay on top amongst increasing competition.

oblio · 31d ago

Apparently OpenAI is losing money like crazy on this and their conversion rates to paid are abysmal, even for the cheaper licenses. And not even their top subscription covers its cost.

Uber at a 10x scale.

I should add that compared to the hype, at a global level Uber is a failure. Yes, it's still a big company, yes, it's profitable now, but I think it was launched 10+ years ago and it's barely becoming net profitabile over it's existence now and shows no signs of taking over the world. Sure, it's big in the US and a few specific markets. But elsewhere it's either banned for undermining labor practices or has stiff local competition or it's just not cost competitive and it won't enter the market because without the whole "gig economy" scam it's just a regular taxi company with a better app.

simonw · 31d ago

Is that information about their low conversion rates from credible sources?

oblio · 31d ago

It's quite hard to say for sure, and I will prefix my comment by saying his blog posts are very long and quite doomerist about LLMs, but he makes a decent case about OpenAI financials:

https://www.wheresyoured.at/wheres-the-money/

https://www.wheresyoured.at/openai-is-a-systemic-risk-to-the...

A very solid argument is like that against propaganda: it's not so much about what is being said but what about isn't. OpenAI is basically shouting about every minor achievement from the rooftops so the fact that they are remarkably silent about financial fundamentals says something. At best something mediocre or more likely bad.

cdblades · 31d ago

All very fair caveats/heads up about Ed Zitron, but just for context for others: he is an actual journalist that has been in the tech space for a long time, and has been critical of lots of large figures in tech for a long time. He has a cohesive thesis around the tech industry, so his thoughts on AI/LLMs aren't out of nowhere and disconnected.

Basically, it's one of those things you may read and find that, all things considered, you don't agree with the conclusions, but there's real substance there and you'll probably benefit from reading a few of his articles.

ben_w · 32d ago

While 100M signing up just for one pic is certainly possible, I note that several hundred million people regularly share photographs of their lunch, so it is very plausible that in signing up for the latest meme generator they found they liked the ability to generate custom images of whatever they consider to be pretty pictures every day.

otabdeveloper4 · 31d ago

> 100 million people didn’t sign up to make that one image meme and then never use it again.

Source? They did exactly that.

simonw · 31d ago

What's your source for saying they did exactly that?

baq · 32d ago

It’s hard to think of a worse analogy TBH. My wife is using ChatGPT to change photos (still is to this day), she didn’t use it or any other LLM until that feature hit. It is a fad, but it’s also a very useful tool.

Ape NFTs are… ape NFTs. Useless. Pointless. Negative value for most people.

Jedd · 31d ago

I would note that I was replying to a comment about the 'big trend of ghiblification' of images.

Reproducing a certain style of image has been a regular fad since profile pictures became a thing sometime last century.

I was not meaning to suggest that large language & diffusion models are fads.

(I do think their capabilities are poorly understood and/or over-estimated by non-technical and some technical people alike, but that invites a more nuanced discussion.)

While I'm sure your wife is getting good value out of the system, whether it's a better fit for purpose, produces a better quality, or provides a more satisfying workflow -- than say a decent free photo editor -- or whether other tools were tried but determined to be too limited or difficult, etc -- only you or her could say. It does feel like a small sample set, though.

senthil_rajasek · 32d ago

"My wife is using ChatGPT to change photos (still is to this day), she didn’t use it or any other LLM until that feature hit."

This is deja vu, except instead of ChatGPT to edit photos it was instagram a decade ago.

jauntywundrkind · 32d ago

Applying some filters and adding some overlay text is something some folks did, but there's such a massive creative world that's opened up, where all we have to do is ask.

baq · 32d ago

You either haven’t tried it or are just trolling.

senthil_rajasek · 32d ago

I am contrasting how instagram filters gave users some control and increased user base and how today editing photos with LLMs is doing the same and pulling in a wider user base.

djhn · 32d ago

I tried it and I don’t get it. What and where are the legal usecases? What can you do with these low-resolution images?

micromacrofoot · 32d ago

they're not but I'm already seeing ai generated images on billboards for local businesses, they're in production workflows now and they aren't going anywhere

sandspar · 31d ago

I just don't understand how people can see "100 million signups in a week" and immediately dismiss it. We're not talking about fidget spinners. I don't get why this sentiment is so common here on HackerNews. It's become a running joke in other online spaces, "HackerNews commenters keep saying that AI is a nothingburger." It's just a groupthink thing I guess, a kneejerk response.

pintxo · 31d ago

I assume, when people dismiss it, they are not looking at it through the business lens and the 100m user signups KPI, but they are dismissing it on technical grounds, as an LLM is just a very big statistical database which seems incapable of solving problems beyond (impressive looking) text/image/video generation.

sandspar · 31d ago

Makes sense. Although I think that's an error. TikTok is "just" a video sharing site. Joe Rogan is "just" a podcaster. Dumb things that affect lots of people are important.

otabdeveloper4 · 31d ago

> We're not talking about fidget spinners.

We're talking about Hitler memes instead? I don't understand your feigned outrage.

The actual valid commercial use case for generative images hasn't been found yet. (No, making blog spam prettier is not a good use case.)

simonw · 31d ago

Everything Everywhere All At Once won a bunch of Oscars. They used generative AI tools for some of their post-production work (achieved by a tiny team), for example to help clean up the backgrounds in the scene with the silent dialog between the two rocks.

stavros · 31d ago

You're right, nothing has value unless someone figures out how to make money with it. Except OpenAI, apparently, because the fact that people buy ChatGPT to make images doesn't seem to count as a commercial use case.

otabdeveloper4 · 31d ago

OpenAI is not profitable and we don't know if it ever will be.

dbdoskey · 29d ago

OpenAI is not profitable because it is spending resources into moving forward and training new models and creating new tools.

stavros · 31d ago

Have we shifted the goalposts from "something people will pay for" to "needs to be profitable even with massive R&D" then?

otabdeveloper4 · 31d ago

OpenAI is not "something people will pay for" at the moment though.

stavros · 31d ago

Except lots of people are paying for it. I'll refer you to the other post on the front page for the calculation that OpenAI would have to get just an extra $10/yr from their users to break even.

otabdeveloper4 · 31d ago

Your response reminds me of that joke about selling a dollar bill for ninety cents.

stavros · 31d ago

Your response makes me think we have different definitions for profitability.

herval · 32d ago

They still are. Instagram is full of accounts posting gpt-generated cartoons (and now veo3 videos). I’ve been tracking the image generation space from day one, and it never stuck like this before

simonw · 32d ago

Anecdotally, I've had several conversations with people way outside the hyper-online demographic who have been really enjoying the new ChatGPT image generation - using it for cartoon photos of their kids, to create custom birthday cards etc.

I think it's broken out into mainstream adoption and is going to stay there.

It reminds me a little of Napster. The Napster UI was terrible, but it let people do something they had never been able to do before: listen to any piece of music ever released, on-demand. As a result people with almost no interest in technology at all were learning how to use it.

Most people have never had the ability to turn a photo of their kids into a cute cartoon before, and it turns out that's something they really want to be able to do.

herval · 32d ago

Definitely. It’s not just online either - half the billboards I see now are AI. The posters at school. The “we’re hiring!” ad at the local McDonalds. It’s 100x cheaper and faster than any alternative (stock images, hiring an editor or illustrator, etc), and most non technical people can get exactly what they want in a single shot, these days.

MattRix · 32d ago

To be clear: they already had image generation in ChatGPT, but this was a MUCH better one than what they had previously. Even for you with your stable diffusion app, it would be a significant upgrade. Not just because of image quality, but because it can actually generate coherent images and follow instructions.

MIC132 · 31d ago

As impressive as it is, for some uses it still is worse than a local SD model. It will refuse to generate named anime characters (because of copyright, or because it just doesn't know them, even not particularly obscure ones) for example. Or obviously anything even remotely spicy. As someone who mostly uses image generation to amuse myself (and not to post it, where copyright might matter) it's honestly somewhat disappointing. But I don't expect any of the major AI companies to release anything without excessive guardrails.

bufferoverflow · 32d ago

Have you missed how everyone was Ghiblifying everything?

adrian17 · 32d ago

I saw that, I just didn't connect it with newly added multimodal image generation. I knew variations of style transfer (or LoRA for SD) were possible for years, so I assumed it exploded in popularity purely as a meme, not due to OpenAI making it much more accessible.

Again, I was aware that they added image generation, just not how much of a deal it turned out to be. Think of it like me occasionally noticing merchandise and TV trailers for a new movie without realizing it became the new worldwide box office #1.

andrepd · 32d ago

Oh you mean the trend of the day on the social media monoculture? I don't take that as an indicator of any significance.

Philpax · 32d ago

One should not be proud of their ignorance.

DaSHacka · 31d ago

Except when it comes to using social media, where "ignorance" unironically is strength

nathan_phoenix · 32d ago

My biggest gripe is that he's comparing probabilistic models (LLMs) by a single sample.

You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...

Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.

simonw · 32d ago

It might not be 100% clear from the writing but this benchmark is mainly intended as a joke - I built a talk around it because it's a great way to make the last six months of model releases a lot more entertaining.

I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.

(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)

I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.

demosthanos · 32d ago

I'd say definitely do not do that. That would make the benchmark look more serious while still being problematic for knowledge cutoff reasons. Your prompt has become popular even outside your blog, so the odds of some SVG pelicans on bicycles making it into the training data have been going up and up.

Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...

diggan · 32d ago

Yeah, this is the problem with benchmarks where the questions/problems are public. They're valuable for some months, until it bleeds into the training set. I'm certain a lot of the "improvements" we're seeing are just benchmarks leaking into the training set.

travisgriggs · 32d ago

That’s ok, once bicycle “riding” pelicans become normative, we can ask it for images of pelicans humping bicycles.

The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.

zahlman · 32d ago

I can't fathom this working, simply because building a model that relates the word "ride" to "hump" seems like something that would be orders of magnitude easier for an LLM than visualizing the result of SVG rendering.

diggan · 32d ago

> The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible

Until there is enough unique/new subject-verb-objects examples/benchmarks so the trained model actually generalized it just like you did. (Public) Benchmarks needs to constantly evolve, otherwise they stop being useful.

demosthanos · 32d ago

To be fair, once it does generalize the pattern then the benchmark is actually measuring something useful for deciding if the model will be able to product a subject-verb-object SVG.

throwaway31131 · 32d ago

I’d say it doesn’t really matter. There is no universally good benchmark and really they should only be used to answer very specific questions which may or may not be relevant to you.

Also, as the old saying goes, the only thing worse than using benchmarks is not using benchmarks.

6LLvveMx2koXfwn · 32d ago

I would definitely say he had no intention of doing that and was doubling down on the original joke.

colecut · 32d ago

The road to hell is paved with the best intentions

clarification: I enjoyed the pelican on a bike and don't think it's that bad =p

telotortium · 32d ago

Yeah, Simon needs to release a new benchmark under a pen name, like Stephen King did with Richard Bachman.

Breza · 27d ago

Richard Bachman, you say? https://chatgpt.com/share/684c3f20-575c-800a-9ea2-889dd3deaf...

fzzzy · 32d ago

Even if it is a joke, having a consistent methodology is useful. I did it for about a year with my own private benchmark of reasoning type questions that I always applied to each new open model that came out. Run it once and you get a random sample of performance. Got unlucky, or got lucky? So what. That's the experimental protocol. Running things a bunch of times and cherry picking the best ones adds human bias, and complicates the steps.

simonw · 32d ago

It wasn't until I put these slides together that I realized quite how well my joke benchmark correlates with actual model performance - the "better" models genuinely do appear to draw better pelicans and I don't really understand why!

pama · 32d ago

How did the pelicans of point releases of V3 and of R1 (R1-0528) do compared to the original versions of the models?

og_kalu · 32d ago

LLMs also have a 'g factor' https://www.sciencedirect.com/science/article/pii/S016028962...

MichaelZuo · 32d ago

I imagine the straightforward reason is that the “better” models are in fact significantly smarter in some tangible way, somehow.

johnrob · 32d ago

Well, the most likely single random sample would be a “representative” one :)

tuananh · 32d ago

until they start targeting this benchmark

simonw · 32d ago

Right, that was the closing joke for the talk.

jonstewart · 32d ago

It is funny to think that a hundred years in the future there may be some vestigial area of the models’ networks that’s still tuned to drawing pelicans on bicycles.

more-nitor · 32d ago

I just don't get the fuss from the pro-LLM people who don't want anyone to shame their LLMs...

people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.

Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...

...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?

obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...

Breza · 27d ago

Another advantage is you can easily include deprecated models in your comparisons. I maintain our internal LLM rankings at work. Since the prompts have remained the same, I can do things like compare the latest Gemini Pro to the original Bard.

Breza · 27d ago

I'd be really interested in evaluating the evaluations of different models. At work, I maintain our internal LLM benchmarks for content generation. We've always used human raters from MTurk, and the Elo rankings generally match what you'd expect. I'm looking at our options for having LLMs do the evaluating.

In your case, it would be neat to have a bunch of different models (and maybe MTurk) pick the winners of each head-to-head matchup and then compare how stable the Elo scores are between evaluators.

dilap · 32d ago

Joke or not, it still correlates much better with my own subjective experiences of the models than LM Arena!

ontouchstart · 32d ago

Very nice talk, acceptable by general public and by AI agent as well.

Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?

Your talk might influence the funding of AI startups.

#butterflyEffect

threecheese · 32d ago

I welcome a VC funded pelican … anything! Clippy 2.0 maybe?

Simon, hope you are comfortable in your new role of AI Celebrity.

planb · 32d ago

And by a sample that has become increasingly known as a benchmark. Newer training data will contain more articles like this one, which naturally improves the capabilities of an LLM to estimate what’s considered a good „pelican on a bike“.

criddell · 32d ago

And that’s why he says he’s going to have to find a new benchmark.

viraptor · 32d ago

Would it though? There really aren't that many valid answers to that question online. When this is talked about, we get more broken samples than reasonable ones. I feel like any talk about this actually sabotages future training a bit.

I actually don't think I've seen a single correct svg drawing for that prompt.

cyanydeez · 32d ago

So what you really need to do is clone this blog post, find and replace pelican with any other noun, run all the tests, and publish that.

Call it wikipediaslop.org

YuccaGloriosa · 32d ago

If the any other noun becomes fish... I think I disagree.

puttycat · 32d ago

You are right, but the companies making these models invest a lot of effort in marketing them as anything but probabilistic, i.e. making people think that these models work discretely like humans.

In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.

ben_w · 32d ago

> In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.

Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/

jodrellblank · 32d ago

You claim those are drawn by people with "perfect knowledge about bikes" and "perfect drawing skills"?

ben_w · 32d ago

More that "these models work … like humans" (discretely or otherwise) does not imply the quotation.

Most humans do not have perfect drawing skills and perfect knowledge about bikes and birds, they do not output such a simple drawing correctly 100% of the time.

"Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence — the modal human has just a handful of things they're good at, and one of those is the language they use, another is their day job.

Most of us can't draw, and demonstrably can't remember (or figure out from first principles) how a bike works. But this also applies to "smart" subsets of the population: physicists have https://xkcd.com/793/, and there's this famous rocket scientist who weighed in on rescuing kids from a flooded cave, they come up with some nonsense about a submarine.

Retric · 32d ago

It’s not that humans have perfect drawing skills, it’s that humans can judge their performance and get better over time.

Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here. Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

The cost and speed advantage of LLM’s is real as long as you’re fine with extremely low quality. Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.

ben_w · 32d ago

> Ask 100 random people to draw a bike and in 10 minutes and they’ll on average suck while still beating the LLM’s here.

Y'see, this is a prime example of what I meant with ""Average human" is a much lower bar than most people want to believe, mainly because most of us are average on most skills, and also overestimate our own competence".

An expert artist can spend 10 minutes and end up with a brief sketch of a bike. You can witness this exact duration yourself (with non-bike examples) because of a challenge a few years back to draw the same picture in 10 minutes, 1 minute, and 10 seconds.

A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/

> Give em an incentive and 10 months and the average person is going to be able to make at least one quite decent drawing of a bike.

Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.

> Ask a model for 10,000 drawings so you can pick the best and you get a marginal improvements based on random chance at a steep price.

If you do so as a human, rating and comparing images? Then the cost is your own time.

If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.

Retric · 32d ago

As an objective criteria what percentage include peddles and a chain connecting one of the wheels? I quickly found a dozen and stopped counting. Now do the same for those LLM images and it’s clear humans win.

> ""Average human" is a much lower bar than most people want to believe

I have some basis for comparison. I’ve seen 6 years olds draw better bikes than those LLM’s.

Look through that list again the worst example does even have wheels, multiple of them have wheels without being connected to anything.

Now if you’re arguing the average human is worse than the average 6 year old I’m going to disagree here.

> Given mandatory art lessons in school are longer than 10 months, and yet those bike examples exist, I have no reason to believe this.

Art lessons don’t cumulatively spend 10 months teaching people how to draw a bike. I don’t think I cumulatively spent 6 months drawing anything. Painting, collage, sculpture, coloring, etc art covers a lot and wasn’t an every day or even every year thing. My mandatory collage class was art history, we didn’t create any art.

You may have spent more time in class studying drawing, but that’s not some universal average.

> If you automate it in literally the manner in this write-up (pairwise comparison via API calls to another model to get ELO ratings), ten thousand images is like $60-$90, which is on the low end for a human commission.

Not every one of those images had a price tag but one was 88 cents, * 10,000 = 8,800$ just to make the image for a test even at 4c/image your looking at 400$. Cheaper models existed but fairly consistently had worse performance.

simonw · 32d ago

The 88 cent one was the most expensive almost my an order of magnitude. Most of these cost less than a cent to generate - that's why I highlighted the price on the o1 pro output.

Retric · 32d ago

Yes, but if you’re averaging cheap and expensive options the expensive ones make a significant difference. Cheaper is bound by 0 so it can’t differ as much from the average.

Also, when you’re talking about how cheap something is, including the price makes sense. I had no idea on many of those models.

simonw · 32d ago

If you're interested, you can get cost estimates from my pricing calculator site here: https://www.llm-prices.com/#it=11&ot=1200

That link seeds it with 11 input tokens and 1200 output tokens - 11 input tokens is what most models use for "Generate an SVG of a pelican riding a bicycle" and 1200 is the number of output tokens used for some of the larger outputs.

Click on different models to see estimated prices. They range from 0.0168 cents for Amazon Nova Micro (that's less than 2/100ths of a cent) up to 72 cents for o1-pro.

The most expensive model most people would consider is Claude 4 Opus, at 9 cents.

GPT-4o is the upper end of the most common prices, at 1.2 cents.

Retric · 32d ago

Thanks

zahlman · 32d ago

> A normal person spending as much time as they like gets you the pictures that I linked to in the previous post, because they don't really know what a bike is. 45 examples of what normal people think a bike looks like: https://www.gianlucagimini.it/portfolio-item/velocipedia/

A normal person given the ability to consult a picture of a bike while drawing will do much better. An LLM agent can effectively refresh its memory (or attempt to look up information on the Internet) any time it wants.

ben_w · 31d ago

> A normal person given the ability to consult a picture of a bike while drawing will do much better. An LLM agent can effectively refresh its memory (or attempt to look up information on the Internet) any time it wants.

Some models can when allowed to, but I don't belive Simon Willson was testing that?

rightbyte · 32d ago

That blog post is a 10/10. Oh dear I miss the old internet.

cyanydeez · 32d ago

Humans absolutely do not work discretely.

loloquwowndueo · 32d ago

They probably meant deterministically as opposed to probabilistically. Which also humans dont work like that :)

aspenmayer · 32d ago

I thought they meant discreetly.

bufferoverflow · 32d ago

> work discretely like humans

What kind of humans are you surrounded by?

Ask any human to write 3 sentences about a specific topic. Then ask them the same exact question next day. They will not write the same 3 sentences.

mooreds · 32d ago

My biggest gripe is that he outsourced evaluation of the pelicans to another LLM.

I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.

Other ways:

* wisdom of the crowds (have people vote on it)

* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)

* wisdom of the LLMs (use more than one LLM)

Would have been neat to see what the human consensus was and if it differed from the LLM consensus

Anyway, great talk!

zahlman · 32d ago

It would have been interesting to see if the LLM that Claude judged worst would have attempted to justify itself....

timewizard · 32d ago

My biggest gripe is he didn't include a picture of an actual pelican.

https://www.google.com/search?q=pelican&udm=2

The "closest pelican" is not even close.

qeternity · 32d ago

I think you mean non-deterministic, instead of probabilistic.

And there is no reason that these models need to be non-deterministic.

skybrian · 32d ago

A deterministic algorithm can still be unpredictable in a sense. In the extreme case, a procedural generator (like in Minecraft) is deterministic given a seed, but you will still have trouble predicting what you get if you change the seed, because internally it uses a (pseudo-)random number generator.

So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.

rvz · 32d ago

> I think you mean non-deterministic, instead of probabilistic.

My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".

zurichisstained · 32d ago

Wow, I love this benchmark - I've been doing something similar (as a joke for and much less frequently), where I ask multiple models to attempt to create a data structure like:

``` const melody = [ { freq: 261.63, duration: 'quarter' }, // C4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 293.66, duration: 'triplet' }, // D4 { freq: 0, duration: 'triplet' }, // triplet rest { freq: 329.63, duration: 'half' }, // E4 ] ```

But with the intro to Smoke on the Water by Deep Purple. Then I run it through the Web Audio API and see how it sounds.

It's never quite gotten it right, but it's gotten better, to the point where I can ask it to make a website that can play it.

I think yours is a lot more thoughtful about testing novelty, but its interesting to see them attempt to do things that they aren't really built for (in theory!).

https://codepen.io/mvattuone/pen/qEdPaoW - ChatGPT 4 Turbo

https://codepen.io/mvattuone/pen/ogXGzdg - Claude Sonnet 3.7

https://codepen.io/mvattuone/pen/ZYGXpom - Gemini 2.5 Pro

Gemini is by far the best sounding one, but it's still off. I'd be curious how the latest and greatest (paid) versions fare.

(And just for comparison, here's the first time I did it... you can tell I did the front-end because there isn't much to it!) https://nitter.space/mvattuone/status/1646610228748730368#m

ojosilva · 32d ago

Drawbacks for using a pelican on a bicycle svg: it's a very open-ended prompt, no specific criteria to judge, and lately the svg all start to look similar, or at least like they accomplished the same non-goals (there's a pelican, there's a bicycle and I'm not sure its feet should be on the saddle or on the pedals), so it's hard to agree on which is better. And, certainly, having a LLM as a judge, the entire game becomes double-hinged and who knows what to think.

Also, if it becomes popular, training sets may pick it up and improve models unfairly and unrealistically. But that's true of any known benchmark.

Side note: I'd really like to see the Language Benchmark Game become a prompt based languages * models benchmark game. So we could say model X excels at Python Fasta, etc. although then the risk is that, again, it becomes training set and the whole thing self-rigs itself.

dr_kretyn · 32d ago

I'm slightly confused by your example. What's the actual prompt? Is your expectation that a text model is going to know how to perform the exact song in audio?

zurichisstained · 32d ago

Ohhh absolutely not, that would be pretty wild - I just wanted to see if it could understand musical notation enough to come up with the correct melody.

I know there are far better ways to do gen AI with music, this was just a joke prompt that worked far better than I expected.

My naive guess is all of the guitar tabs and signal processing info it's trained on gives it the ability to do stuff like this (albeit not very well).

bredren · 32d ago

Great writeup.

This measure of LLM capability could be extended by taking it into the 3D domain.

That is, having the model write Python code for Blender, then running blender in headless mode behind an API.

The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)

So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.

For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.

For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.

I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.

joshstrange · 32d ago

I really enjoy Simon’s work in this space. I’ve read almost every blog post they’ve posted on this and I love seeing them poke and prod the models to see what pops out. The CLI tools are all very easy to use and complement each other nicely all without trying to do too much by themselves.

And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.

Thank you Simon!

alanmoraes · 19d ago

I also like what he writes and the way he does it.

blackhaj7 · 32d ago

Same sentiment!

dotemacs · 32d ago

The same here.

Because of him, I installed a RSS reader so that I don't miss any of his posts. And I know that he shares the same ones across Twitter, Mastodon & Bsky...

franze · 32d ago

Here Claude Opus Extended Thinking https://claude.ai/public/artifacts/707c2459-05a1-4a32-b393-c...

ramesh31 · 32d ago

Single shot?

franze · 32d ago

2 shot, first one did just generate the svg not the shareable html page around it. in the second go it also worked on the svg as i did not forbid it.

anon373839 · 32d ago

Enjoyable write-up, but why is Qwen 3 conspicuously absent? It was a really strong release, especially the fine-grained MoE which is unlike anything that’s come before (in terms of capability and speed on consumer hardware).

simonw · 32d ago

Omitting Qwen 3 is my great regret about this talk. Honestly I only realized I had missed it after I had delivered the talk!

It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.

Maxious · 32d ago

Cut for time - qwen3 was pelican tested too https://simonwillison.net/2025/Apr/29/qwen-3/

username223 · 32d ago

Interesting timeline, though the most relevant part was at the end, where Simon mentions that Google is now aware of the "pelican on bicycle" question, so it is no longer useful as a benchmark. FWIW, many things outside of the training data will pants these models. I just tried this query, which probably has no examples online, and Gemini gave me the standard puzzle answer, which is wrong:

"Say I have a wolf, a goat, and some cabbage, and I want to get them across a river. The wolf will eat the goat if they're left alone, which is bad. The goat will eat some cabbage, and will starve otherwise. How do I get them all across the river in the fewest trips?"

A child would pick up that you have plenty of cabbage, but can't leave the goat without it, lest it starve. Also, there's no mention of boat capacity, so you could just bring them all over at once. Useful? Sometimes. Intelligent? No.

NohatCoder · 32d ago

If you calculate ELO based on a round-robin tournament with all participants starting out on the same score, then the resulting ratings should simply correspond to the win count. I guess the algorithm in use take into account the order of the matches, but taking order into account is only meaningful when competitors are expected to develop significantly, otherwise it is just added noise, so we never want to do so in competitions between bots.

I also can't help but notice that the competition is exactly one match short, for some reason exactly one of the 561 possible pairings has not been included.

simonw · 32d ago

Yeah, that's a good call out: Elo isn't actually necessary if you can have every competitor battle every other competitor exactly once.

The missing match is because one single round was declared a draw by the model, and I didn't have time to run it again (the Elo stuff was very much rushed at the last minute.)

qwertytyyuu · 32d ago

https://imgur.com/a/mzZ77xI here are a few i tried the models, looks like the newer vesion of gemini is another improvement?

puttycat · 32d ago

The bicycle are still very far from actual ones.

simonw · 32d ago

I think the most recent Gemini Pro bicycle may be the best yet - the red frame is genuinely the right shape.

layer8 · 32d ago

The pelican, on the other hand...

pjs_ · 32d ago

https://www.gianlucagimini.it/portfolio-item/velocipedia/

landgenoot · 32d ago

If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.

diggan · 32d ago

Lets give it a try, if you're willing to be the experiment subject :)

The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/

I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.

zahlman · 32d ago

> The colors use traditional bicycle brown (#8B4513) and a classic blue for the pelican (#4169E1) with gold accents for the beak (#FFD700).

The output pelican is indeed blue. I can't fathom where the idea that this is "classic", or suitable for a pelican, could have come from.

diggan · 32d ago

My guess would be that it doesn't see the web colors (CSS color hexes) as proper hex triplets, but because of tokenization it could be something dumb like '#8B','451','3' instead. I think the same issue happens around multiple special characters after each other too.

zahlman · 32d ago

No, it's understanding the colors properly. The SVG that the LLM created does use #4169E1 for the pelican color, and the LLM correctly describes this color as blue. The problem is that pelicans should not be blue.

cap11235 · 31d ago

Qwen3, at least, tokenizes each character of "#8B4513" separately.

mormegil · 32d ago

Did the testing prompt for LLMs include a clause forbidding the use of any tools? If not, why are you adding it here?

simonw · 32d ago

The way I run the pelican on a bicycle benchmark is to use this exact prompt:

  Generate an SVG of a pelican riding a bicycle

And execute it via the model's API with all default settings, not via their user-facing interface.

Currently none of the model APIs enable tools unless you ask them to, so this method excludes the use of additional tools.

diggan · 32d ago

The models that are being put under the "Pelican" testing don't use a GUI to create SVGs (either via "tools" or anything else), they're all Text Generation models so they exclusively use text for creating the graphics.

There are 31 posts listed under "pelican-riding-a-bicycle" in case you wanna inspect the methodology even closer: https://simonwillison.net/tags/pelican-riding-a-bicycle/

ramesh31 · 32d ago

>If you would give a human the SVG documentation and ask to write an SVG, I think the results would be quite similar.

It certainly would, and it would cost at minimum an hour of the human programmer's time at $50+/hr. Claude does it in seconds for pennies.

joshuajooste05 · 32d ago

Does anyone have any thoughts on privacy/safety regarding what he said about GPT memory.

I had heard of prompt injection already. But, this seems different, completely out of humans control. Like even when you consider web search functionality, he is actually right, more and more, users are losing control over context.

Is this dangerous atm? Do you think it will become more dangerous in the future when we chuck even more data into context?

ActorNightly · 32d ago

Sort of. The thing is with agentic models, you are basically entering probability space where it can do real actions in the form of http requests if the statistical output leads it to it.

threeseed · 32d ago

I've had Cursor/Claude try to call rm -rf on my entire User directory before.

The issue is that LLMs have no ability to organise their memory by importance. Especially as the context size gets larger.

So when they are using tools they will become more dangerous over time.

Joker_vD · 32d ago

> most people find it difficult to remember the exact orientation of the frame.

Isn't it Δ∇Λ welded together? The bottom left and right vertices are where the wheels are attached to, the middle bottom point is where the big gear with the pedals is. The lambda is for the front wheel because you wouldn't be able to turn it if it was attached to a delta. Right?

I guess having my first bicycle be a cheap Soviet-era produced one paid off: I spent loads of time fidgeting with the chain tension, and pulling the chain back onto the gears, so I guess I had to stare at the frame way too much to forget even by today the way it looks.

pbronez · 32d ago

There are a lot of structural details that people tend to gloss over. This was illustrated by an Italian art project:

https://www.gianlucagimini.it/portfolio-item/velocipedia/

> back in 2009 I began pestering friends and random strangers. I would walk up to them with a pen and a sheet of paper asking that they immediately draw me a men’s bicycle, by heart. Soon I found out that when confronted with this odd request most people have a very hard time remembering exactly how a bike is made.

irthomasthomas · 32d ago

The best pelicans come from running a consortium of models. I use pelicans as evals now. https://x.com/xundecidability/status/1921009133077053462 Test it using VibeLab (wip) https://x.com/xundecidability/status/1926779393633857715

nowayno583 · 32d ago

That was a very fun recap, thanks for sharing. It's easy to forget how much better these things have gotten. And this was in just six months! Crazy!

djherbis · 32d ago

Kaggle recently ran a competition to do just this (draw SVGs from prompts, using fairly small models under the hood).

The top results (click on the top Solutions) were pretty impressive: https://www.kaggle.com/competitions/drawing-with-llms/leader...

JimDabell · 32d ago

See also: The recent history of AI in 32 otters

https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

pbhjpbhj · 32d ago

That is otterly fantastic. The post there shows the breadth too - both otters generated via text representations (in TikZ) and by image generators. The video at the end, wow (and funny too).

Thanks for sharing.

nine_k · 32d ago

Am I the only one who can't but see these attempts much like attempts of a kid learning to draw?

Ygg2 · 32d ago

Yes. Kids don't draw that good of a line at the start.

Here is better example of start https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTfTfAA...

nine_k · 32d ago

Have you tried giving a kid a vector-drawing tool?

I did that to my daughter when she was not even 6 years old. The results were somehow similar: https://photos.app.goo.gl/XSLnTEUkmtW2n7cX8

(Now she's much better, but prefers raster tools, e.g. https://www.deviantart.com/sofiac9/art/Ivy-with-riding-gear-...)

0points · 31d ago

So the only bird slightly resembling a pelican beak was drawn by gemini 2.5 pro. In general, none of the output resembles a pelican enough so you could separate it from "a bird".

OP seem to ignore that pelican has a distinct look when evaluating these doodles.

simonw · 31d ago

The pelican's distinct look - and the fact that none of the models can capture it - is the whole point.

0points · 30d ago

> The pelican's distinct look - and the fact that none of the models can capture it - is the whole point.

You didn't even mention the beak or the lack of similarities in your blog.

Your text is centered around this rather peculiar statement:

> Most importantly: pelicans can’t ride bicycles.

simonw · 30d ago

The blog was my attempt to capture the key ideas from the talk, which was full of jokes that don't come across as well in text as they do out loud.

"Pelicans can't ride bicycles" is a good joke.

jfengel · 32d ago

It's not so great at bicycles, either. None of those are close to rideable.

But bicycles are famously hard for artists as well. Cyclists can identify all of the parts, but if you don't ride a lot it can be surprisingly difficult to get all of the major bits of geometry right.

mattlondon · 32d ago

Most recent Gemini 2.5 one looks pretty good. Certainly rideable.

buserror · 31d ago

The hilarious bit is that this page will soon be scraped by ai-bots as learning material, and they'll all learn to draw pelicans on bicycles using this as their primary example material, as they'll be the only examples.

GIGO in motion :-)

zahlman · 32d ago

> If you lost interest in local models—like I did eight months ago—it’s worth paying attention to them again. They’ve got good now!

> As a power user of these tools, I want to stay in complete control of what the inputs are. Features like ChatGPT memory are taking that control away from me.

You reap what you sow....

> I already have a tool I built called shot-scraper, a CLI app that lets me take screenshots of web pages and save them as images. I had Claude build me a web page that accepts ?left= and ?right= parameters pointing to image URLs and then embeds them side-by-side on a page. Then I could take screenshots of those two images side-by-side. I generated one of those for every possible match-up of my 34 pelican pictures—560 matches in total.

Surely it would have been easier to use a local tool like ImageMagick? You could even have the AI write a Bash script for you.

> ... but prompt injection is still a thing.

...Why wouldn't it always be? There's no quoting or escaping mechanism that's actually out-of-band.

> There’s this thing I’m calling the lethal trifecta, which is when you have an AI system that has access to private data, and potential exposure to malicious instructions—so other people can trick it into doing things... and there’s a mechanism to exfiltrate stuff.

People in 2025 actually need to be told this. Franklin missed the mark - people today will trip over themselves to give up both their security and their liberty for mere convenience.

simonw · 32d ago

I had the LLM write a bash script for me that used my https://shot-scraper.datasette.io/ tool - on the basis that it was a neat opportunity to demonstrate another of my own projects.

And honestly, even with LLM assistance getting Image Magick to output a 1200x600 image with two SVGs next to each other that are correctly resized to fill their half of the image sounds pretty tricky. Probably easier (for Claude) to achieve with HTML and CSS.

voiper1 · 32d ago

Isn't "left or right" _followed_ by rationale asking it to rationalize it's 1 word answer - I thought we need to get AI to do the chain of though _before_ giving it's answer for it to be more accurate?

simonw · 32d ago

Yes it is - I would likely have gotten better results if I'd asked for the rationale first.

zahlman · 32d ago

> And honestly, even with LLM assistance getting Image Magick to output a 1200x600 image with two SVGs next to each other that are correctly resized to fill their half of the image sounds pretty tricky.

FWIW, the next project I want to look at after my current two, is a command-line tool to make this sort of thing easier. Likely featuring some sort of Lisp-like DSL to describe what to do with the input images.

dirtyhippiefree · 32d ago

Here’s the spot where we see who’s TL;DR…

> Claude 4 will rat you out to the feds!

>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.

ben_w · 32d ago

I'd say that's too short.

> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.

> It turns out nearly all of the models do the same thing.

dirtyhippiefree · 32d ago

I totally agree, but I needed you to post the other half because of TL;DR…

gscott · 32d ago

I am interested in this ratting you out thing. At some point you have a video feed into AI from a Jarvis like headset device, you walking down the street and cross the street in the middle not at a sidewalk... does it rat you out? Does it make a list of every crime no matter how small? Or just the big ones?

yubblegum · 32d ago

I was looking at that and wondering about swatting via LLMs by malicious users.

pier25 · 32d ago

Definitely getting better but even the best result is not very impressive.

darkoob12 · 31d ago

Should we be that excited about AI and calling a fraud and plagiarism machine "ChatGPT Mischief Buddy" without any moral deliberation?

simonw · 31d ago

The "mischief buddy" joke is a poke at exactly that.

spaceman_2020 · 32d ago

I don’t know what secret sauce Anthropic has, but in real world use, Sonnet is somehow still the best model around. Better than Opus and Gemini Pro

diggan · 32d ago

Statements like these are useless without sharing exactly all the models you've tried. Sonnet beats O1 Pro Mode for example? Not in my experience, but I haven't tried the latest Sonnet versions, only the one before, so wouldn't claim O1 Pro Mode beats everything out there.

Besides, it's so heavily context-dependent that you really need your own private benchmarks to make head or tails out of this whole thing.

deadbabe · 32d ago

As a control, he should go on fiver and have a human generate a pelican riding a bicycle, just to see what the eventual goal is.

gus_massa · 32d ago

Someone did this. Look at this sibling comment by ben_w https://news.ycombinator.com/item?id=44216284 about an old similar project.

zahlman · 32d ago

> back in 2009 I began pestering friends and random strangers. I would walk up to them with a pen and a sheet of paper asking that they immediately draw me a men’s bicycle, by heart.

Someone commissioned to draw a bicycle on Fiverr would not have to rely on memory of what it should look like. It would take barely any time to just look up a reference.

wohoef · 32d ago

Quite a detailed image using claude sonnet 4: https://ibb.co/39RbRm5W

mromanuk · 32d ago

The last animation is hilarious, represents very well the AI Hype cycle vs reality.

big_hacker · 32d ago

Honestly the metric which increased the most is the marketing and astroturfing budget of the major players (OpenAI, Anthropic, Google and Deepseek).

Say what you want about Facebook but at least they released their flagship model fully open.

mdaniel · 32d ago

> model fully open.

uh-huh https://www.llama.com/llama4/license/

bravesoul2 · 32d ago

Is there a good model (any architecture) for vector graphics out of interest?

simonw · 32d ago

I was impressed by Recraft v3, which gave me an editable vector illustration with different layers - https://simonwillison.net/2024/Nov/15/recraft-v3/ - but as I understand it that one is actually still a raster image generator with a separate step to convert to vector at the end.

bravesoul2 · 32d ago

Now that is a pelican on a bicycle! Thanks

neepi · 32d ago

My only take home is they are all terrible and I should hire a professional.

jug · 32d ago

Before that, you might ask ChatGPT to create a vector image of a pelican riding a bicycle and then running the output through a PNG to SVG converter...

Result: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican...

These are tough benchmarks to trial reasoning by having it _write_ an SVG file by hand and understanding how it's to be written to achieve this. Even a professional would struggle with that! It's _not_ a benchmark to give an AI the best tools to actually do this.

YuccaGloriosa · 32d ago

I think you made an error there png is a bitmap format

sethaurus · 32d ago

You've misunderstood. The parent was making a specific point — if you want an SVG of a penguin, the easiest way to AI-generate it is to get an image generator to create a (vector-styled) bitmap, then auto-vectorize it to SVG. But the point of this benchmark is that it's asking models to create an SVG the hard way, by writing its code directly.

keiferski · 32d ago

As the other guy said, these are text models. If you want to make images use something like Midjourney.

Promoting a pelican riding a bicycle makes a decent image there.

keiferski · 32d ago

* Prompting

dist-epoch · 32d ago

Most of them are text-only models. Like asking a person born blind to draw a pelican, based on what they heard it looks like.

neepi · 32d ago

That seems to be a completely inappropriate use case?

I would not hire a blind artist or a deaf musician.

__alexs · 32d ago

I guess the idea is that by asking the model to do something that is inherently hard for it we might learn something about the baseline smartness of each model which could be considered a predictor for performance at other tasks too.

simonw · 32d ago

Yeah, that's part of the point of this. Getting a state of the art text generating LLM to generate SVG illustrations is an inappropriate application of them.

It's a fun way to deflate the hype. Sure, your new LLM may have cost XX million to train and beat all the others on the benchmarks, but when you ask it to draw a pelican on a bicycle it still outputs total junk.

dist-epoch · 32d ago

tried starting from an image:

https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51e5cd89...

lol: https://gemini.google.com/share/4d1746a234a8

dist-epoch · 32d ago

The point is about exploring the capabilities of the model.

Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.

kevindamm · 32d ago

Yeah, I suppose it is similar.. I don't know their diameters, rotations, nor the distance between their centers, nor which two dimensions, so I would have to guess a lot about what you meant.

namibj · 32d ago

It's a proxy for abstract designing, like writing software or designing in a parametric CAD.

Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.

Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.

neepi · 32d ago

I would not assume that this methodology applies to applied engineering, as a former actual real tangible meat space engineer. Things are a little nuanced and the nuances come from a combination of communication and experience, neither of which any LLM has any insight into at all. It's not out there on the internet to train it with and it's not even easy to put it into abstract terms which can be used as training data. And engineering itself in isolation doesn't exist - there is a whole world around it.

Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.

The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.

rjsw · 32d ago

I don't think any of that matters, CEOs will decide to use it anyway.

neepi · 32d ago

This is sad but true.

dist-epoch · 32d ago

https://www.solidworks.com/lp/evolve-your-design-workflows-a...

neepi · 32d ago

Yeah good luck with that. Seriously.

dmd · 32d ago

Sorry, Beethoven, you just don’t seem to be a match for our org. Best of luck on your search!

You too, Monet. Scram.

wongogue · 32d ago

Even Beethoven?

vunderba · 32d ago

This test isn't really about the quality of the image itself (multimodals like gpt-image-1 or even standard diffusion models would be far superior) - it's about following a spec that describes how to draw.

A similar test would be if you asked for the pelican on a bicycle through a series of LOGO instructions.

spaceman_2020 · 32d ago

My only take home is that a spanner can work as a hammer, but you probably should just get a hammer

GaggiX · 32d ago

An expert at writing SVGs?

matkoniecz · 32d ago

it depends on quality you need and your budget

neepi · 32d ago

Ah yes the race to the bottom argument.

ben_w · 32d ago

When I was at university, they got some people from industry to talk to us all about our CVs and how to do interviews.

My CV had a stupid cliché, "committed to quality", which they correctly picked up on — "What do you mean?" one of them asked me, directly.

I thought this meant I was focussed on being the best. He didn't like this answer.

His example, blurred by 20 years of my imperfect human memory, was to ask me which is better: a Porsche, or a go-kart. Now, obviously (or I wouldn't be saying this), Porsche was a trick answer. Less obviously is that both were trick answers, because their point was that the question was under-specified — quality is the match between the product and what the user actually wants, so if the user is a 10 year old who physically isn't big enough to sit in a real car's driver's seat and just wants to rush down a hill or along a track, none of "quality" stuff that makes a Porsche a Porsche is of any relevance at all, but what does matter is the stuff that makes a go-kart into a go-kart… one of which is the affordability.

LLMs are go-karts of the mind. Sometimes that's all you need.

neepi · 32d ago

I disagree. Quality depends on your market position and what you are bringing to the market. Thus I would start with market conditions and work back to quality. If you can't reach your standards in the market then you shouldn't enter it. And if your standards are poor, you should be ashamed.

Go kart or porsche is irrelevant.

ben_w · 32d ago

> Quality depends on your market position and what you are bringing to the market.

That's the point.

The market for go-karts does not support Porche.

If you bring a Porche sales team to a go-kart race, nobody will be interested.

Porche doesn't care about this market. It goes both ways: this market doesn't care about Porche, either.

atxtechbro · 32d ago

Thank you, Simon! I really enjoyed your PyBay 2023 talk on embeddings and this is great too! I like the personalized benchmark. Hopefully the big LLM providers don't start gaming the pelican index!

beefnugs · 31d ago

I think its hilarious how humans can make mistakes interpreting the crazy drawings : He says "I like how it solved the problem of pelicans not fitting on bicycles by adding a second smaller bicycle to the stack."

no... that is an attempt at it actually drawing the pedals, and putting the pelicans feet right on the pedals!

NicoSchwandner · 32d ago

Nice post, thanks!

m3047 · 32d ago

TIL: Snitchbench!

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Researcher (ycombinator.com)

Metriport (YC S22) is hiring engineers to improve healthcare data exchange (ycombinator.com)

Telli (YC F24) Is Hiring Engineers [On-Site Berlin] (hi.telli.com)

Continue (YC S23) is hiring software engineers in San Francisco (ycombinator.com)

UpCodes (YC S17) is hiring a Head of Ops to automate construction compliance (up.codes)

Enhanced Radar (YC W25) is hiring a founding engineer

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Kyber (YC W23) Is Hiring Enterprise BDRs (ycombinator.com)

MindsDB (YC W20) is hiring an AI solutions engineer (job-boards.greenhouse.io)

Recurse Center (YC S10) Is Hiring a Career Facilitator (recurse.notion.site)

Cua (YC X25) is hiring an engineer (ycombinator.com)

Noloco (YC S21) is hiring a founder's associate in Barcelona (ycombinator.com)

14.ai (YC W24) hiring founding engineers in SF to build a Zendesk alternative (14.ai)

Lago (Open-Source Usage Based Billing) is hiring for ten roles (ycombinator.com)

Spark AI (YC W24) is hiring a full-stack engineer in SF (founding team) (ycombinator.com)

Bitmovin (YC S15) Is Hiring a Junior Solutions Engineer in Denver (bitmovin.com)

SigNoz (YC W21, Open Source Datadog) Is Hiring DevRel Engineers (Remote)(US) (ycombinator.com)

AccessOwl (YC S22) is hiring an Elixir Engineer to connect 100s of SaaS (ycombinator.com)

FurtherAI (YC W24) Is Hiring for Software and AI Roles (ycombinator.com)

Yarn (YC W24) is hiring engineers in NYC (ycombinator.com)

Expand.ai (YC S24) is hiring a founding engineer

Optifye.ai (YC W25) is hiring a back end engineer

Kastle (S24) is hiring an engineer (ycombinator.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Qfex (YC X25) – Back End Engineer for a 24/7 Stock Exchange (ycombinator.com)

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer (ycombinator.com)

Jiga (YC W21) Is Hiring Software Engs to Make Life of Mech Engs Easier (workatastartup.com)

Foundry (YC F24) Hiring Early Engineer to Build Web Agent Infrastructure (ycombinator.com)

Blaze (YC S24) Is Hiring (ycombinator.com)

Infracost (YC W21) is hiring software engineers (GMT+2 to GMT-6) (infracost.io)

The last six months in LLMs, illustrated by pelicans on bicycles

Comments (234)