Show HN: Small tool to query XML data using XPath (github.com)
5 points by linkdd 18h ago 1 comments
Show HN: dbSurface – A Developer Tool for pgvector (github.com)
4 points by z-gort 1d ago 1 comments
The last six months in LLMs, illustrated by pelicans on bicycles
312 swyx 97 6/8/2025, 7:38:37 AM simonwillison.net ↗
This measure of LLM capability could be extended by taking it into the 3D domain.
That is, having the model write Python code for Blender, then running blender in headless mode behind an API.
The talk hints at this but one shot prompting likely won’t be a broad enough measurement of capability by this time next year. (Or perhaps now, even)
So the test could also include an agentic portion that includes consultation of the latest blender documentation or even use of a search engine for blog entries detailing syntax and technique.
For multimodal input processing, it could take into account a particular photo of a pelican as the test subject.
For usability, the objects can be converted to iOS’s native 3d format that can be viewed in mobile safari.
I built this workflow, including a service for blender as an initial test of what was possible in October of 2022. It took post processing for common syntax errors back then but id imagine the newer LLMs would make those mistakes less often now.
Awkwardly, I never heard of it until now. I was aware that at some point they added ability to generate images to the app, but I never realized it was a major thing (plus I already had an offline stable diffusion app on my phone, so it felt less of an upgrade to me personally). With so much AI news each week, feels like unless you're really invested in the space, it's almost impossible to not accidentally miss or dismiss some big release.
It really is incredible.
You wouldn't compare different random number generators by taking one sample from each and then concluding that generator 5 generates the highest numbers...
Would be nicer to run the comparison with 10 images (or more) for each LLM and then average.
I've been considering an expanded version of this where each model outputs ten images, then a vision model helps pick the "best" of those to represent that model in a further competition with other models.
(Then I would also expand the judging panel to three vision LLMs from different model families which vote on each round... partly because it will be interesting to track cases where the judges disagree.)
I'm not sure if it's worth me doing that though since the whole "benchmark" is pretty silly. I'm on the fence.
Karpathy used it as an example in a recent interview: https://www.msn.com/en-in/health/other/ai-expert-asks-grok-3...
clarification: I enjoyed the pelican on a bike and don't think it's that bad =p
The number of subject-verb-objects are near infinite. All are imaginable, but most are not plausible. A plausibility machine (LLM) will struggle with the implausible, until it can abstract well.
people expect LLMs to say "correct" stuff on the first attempt, not 10000 attempts.
Yet, these people are perfectly OK with cherry-picked success stories on youtube + advertisements, while being extremely vehement about this simple experiment...
...well maybe these people rode the LLM hype-train too early, and are desperate to defend LLMs lest their investment go poof?
obligatory hype-graph classic: https://upload.wikimedia.org/wikipedia/commons/thumb/9/94/Ga...
Any concerns about open source “AI celebrity talks” like yours can be used in contexts that would allow LLM models to optimize their market share in ways that we can’t imagine yet?
Your talk might influence the funding of AI startups.
#butterflyEffect
Simon, hope you are comfortable in your new role of AI Celebrity.
I actually don't think I've seen a single correct svg drawing for that prompt.
Call it wikipediaslop.org
In that case we'd expect a human with perfect drawing skills and perfect knowledge about bikes and birds to output such a simple drawing correctly 100% of the time.
In any case, even if a model is probabilistic, if it had correctly learned the relevant knowledge you'd expect the output to be perfect because it would serve to lower the model's loss. These outputs clearly indicate flawed knowledge.
Look upon these works, ye mighty, and despair: https://www.gianlucagimini.it/portfolio-item/velocipedia/
I get it was way easier to do and that doing it took pennies and no time. But I would have loved it if he'd tried alternate methods of judging and seen what the results were.
Other ways:
* wisdom of the crowds (have people vote on it)
* wisdom of the experts (send the pelican images to a few dozen artists or ornithologists)
* wisdom of the LLMs (use more than one LLM)
Would have been neat to see what the human consensus was and if it differed from the LLM consensus
Anyway, great talk!
And there is no reason that these models need to be non-deterministic.
So there’s still the question of how controllable the LLM really is. If you change a prompt slightly, how unpredictable is the change? That can’t be tested with one prompt.
My thoughts too. It's more accurate to label LLMs as non-deterministic instead of "probablistic".
And at the end of the day, it’s just so much fun to see someone else having so much fun. He’s like a kid in a candy store and that excitement is contagious. After reading every one of his blog posts, I’m inspired to go play with LLMs in some new and interesting way.
Thank you Simon!
It's one of my favorite local models right now, I'm not sure how I missed it when I was reviewing my highlights of the last six months.
The prompt is "Generate an SVG of a pelican riding a bicycle" and you're supposed to write it by hand, so no graphical editor. The specification is here: https://www.w3.org/TR/SVG2/
I'm fairly certain I'd lose interest in getting it right before I got something better than most of those.
It certainly would, and it would cost at minimum an hour of the human programmer's time at $50+/hr. Claude does it in seconds for pennies.
Besides, it's so heavily context-dependent that you really need your own private benchmarks to make head or tails out of this whole thing.
Result: https://www.dropbox.com/scl/fi/8b03yu5v58w0o5he1zayh/pelican...
These are tough benchmarks to trial reasoning by having it _write_ an SVG file by hand and understanding how it's to be written to achieve this. Even a professional would struggle with that! It's _not_ a benchmark to give an AI the best tools to actually do this.
Promoting a pelican riding a bicycle makes a decent image there.
I would not hire a blind artist or a deaf musician.
It's a fun way to deflate the hype. Sure, your new LLM may have cost XX million to train and beat all the others on the benchmarks, but when you ask it to draw a pelican on a bicycle it still outputs total junk.
https://chatgpt.com/share/684582a0-03cc-8006-b5b5-de51e5cd89...
lol: https://gemini.google.com/share/4d1746a234a8
You too, Monet. Scram.
Most the non-math design work of applied engineering AFAIK falls under the umbrella that's tested with the pelican riding the bicycle. You have to make a mental model and then turn it into applicable instructions.
Program code/SVG markup/parametric CAD instructions don't really differ in that aspect.
Ergo no you can't just say throw a bicycle into an LLM and a parametric model drops out into solidworks, then a machine makes it. And everyone buys it. That is the hope really isn't it? You end up with a useless shitty bike with a shit pelican on it.
The biggest problem we have in the LLM space is the fact that no one really knows any of the proposed use cases enough and neither does anyone being told that it works for the use cases.
Like asking you to draw a 2D projection of 4D sphere intersected with a 4D torus or something.
My CV had a stupid cliché, "committed to quality", which they correctly picked up on — "What do you mean?" one of them asked me, directly.
I thought this meant I was focussed on being the best. He didn't like this answer.
His example, blurred by 20 years of my imperfect human memory, was to ask me which is better: a Porsche, or a go-kart. Now, obviously (or I wouldn't be saying this), Porsche was a trick answer. Less obviously is that both were trick answers, because their point was that the question was under-specified — quality is the match between the product and what the user actually wants, so if the user is a 10 year old who physically isn't big enough to sit in a real car's driver's seat and just wants to rush down a hill or along a track, none of "quality" stuff that makes a Porsche a Porsche is of any relevance at all, but what does matter is the stuff that makes a go-kart into a go-kart… one of which is the affordability.
LLMs are go-karts of the mind. Sometimes that's all you need.
Go kart or porsche is irrelevant.
That's the point.
The market for go-karts does not support Porche.
If you bring a Porche sales team to a go-kart race, nobody will be interested.
Porche doesn't care about this market. It goes both ways: this market doesn't care about Porche, either.
Say what you want about Facebook but at least they released their flagship model fully open.
https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...
Thanks for sharing.
> Claude 4 will rat you out to the feds!
>If you expose it to evidence of malfeasance in your company, and you tell it it should act ethically, and you give it the ability to send email, it’ll rat you out.
> But it’s not just Claude. Theo Browne put together a new benchmark called SnitchBench, inspired by the Claude 4 System Card.
> It turns out nearly all of the models do the same thing.