My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

437 simonw 316 7/29/2025, 1:45:07 PM simonwillison.net ↗

Comments (316)

NitpickLawyer · 10h ago
> Two years ago when I first tried LLaMA I never dreamed that the same laptop I was using then would one day be able to run models with capabilities as strong as what I’m seeing from GLM 4.5 Air—and Mistral 3.2 Small, and Gemma 3, and Qwen 3, and a host of other high quality models that have emerged over the past six months.

Yes, the open-models have surpassed my expectations in both quality and speed of release. For a bit of context, when chatgpt launched in Dec22, the "best" open models were GPT-J(~6-7B) and GPT-neoX (~22B?). I actually had an app running live, with users, using gpt-j for ~1 month. It was a pain. The quality was abysmal, there was no instruction following (you had to start your prompt like a story, or come up with a bunch of examples and hope the model will follow along) and so on.

And then something happened, LLama models got "leaked" (I still think it was a on purpose leak - don't sue us, we never meant to release, etc), and the rest is history. With L1 we got lots of optimisations like quantised models, fine-tuning and so on, L2 really saw fine-tuning go off (most of the fine-tunes were better than what meta released), we got alpaca showing off LoRA, and then a bunch of really strong models came out (mistrals, mixtrals, L3, gemmas, qwens, deepseeks, glms, granites, etc.)

By some estimations the open models are ~6mo behind what SotA labs have released. (note that doesn't mean the labs are releasing their best models, it's likely they keep those in house to use on next runs data curation, synthetic datasets, for distilling, etc). Being 6mo behind is NUTS! I never in my wildest dreams believed we'll be here. In fact I thought it would take ~2years to reach gpt3.5 levels. It's really something insane that we get to play with these models "locally", fine-tune them and so on.

genewitch · 9h ago
I'll bite. How do i train/make and/or use LoRA, or, separately, how do i fine-tune? I've been asking this for months, and no one has a decent answer. websearch on my end is seo/geo-spam, with no real instructions.

I know how to make an SD LoRA, and use it. I've known how to do that for 2 years. So what's the big secret about LLM LoRA?

techwizrd · 9h ago
We have been fine-tuning models using Axolotl and Unsloth, with a slight preference for Axolotl. Check out the docs [0] and fine-tune or quantize your first model. There is a lot to be learned in this space, but it's exciting.

0: https://axolotl.ai/ and https://docs.axolotl.ai/

arkmm · 7h ago
When do you think fine tuning is worth it over prompt engineering a base model?

I imagine with the finetunes you have to worry about self-hosting, model utilization, and then also retraining the model as new base models come out. I'm curious under what circumstances you've found that the benefits outweigh the downsides.

reissbaker · 6h ago
For self-hosting, there are a few companies that offer per-token pricing for LoRA finetunes (LoRAs are basically efficient-to-train, efficient-to-host finetunes) of certain base models:

- (shameless plug) My company, Synthetic, supports LoRAs for Llama 3.1 8b and 70b: https://synthetic.new All you need to do is give us the Hugging Face repo and we take care of the rest. If you want other people to try your model, we charge usage to them rather than to you. (We can also host full finetunes of anything vLLM supports, although we charge by GPU-minute for full finetunes rather than the cheaper per-token pricing for supported base model LoRAs.)

- Together.ai supports a slightly wider number of base models than we do, with a bit more config required, and any usage is charged to you.

- Fireworks does the same as Together, although they quantize the models more heavily (FP4 for the higher-end models). However, they support Llama 4, which is pretty nice although fairly resource-intensive to train.

If you have reasonably good data for your task, and your task is relatively "narrow" (i.e. find a specific kind of bug, rather than general-purpose coding; extract a specific kind of data from legal documents rather than general-purpose reasoning about social and legal matters; etc), finetunes of even a very small model like an 8b will typically outperform — by a pretty wide margin — even very large SOTA models while being a lot cheaper to run. For example, if you find yourself hand-coding heuristics to fix some problem you're seeing with an LLM's responses, it's probably more robust to just train a small model finetune on the data and have the finetuned model fix the issues rather than writing hardcoded heuristics. On the other hand, no amount of finetuning will make an 8b model a better general-purpose coding agent than Claude 4 Sonnet.

delijati · 2h ago
Do you maybe know if there is a company in the EU that hosts models (DeepSeek, Qwen3, Kimi)?
tough · 6h ago
only for narrow applications where your fine tune can let you use a smaller model locally , specialised and trained for your specific use-case mostly
whimsicalism · 7h ago
finetuning rarely makes sense unless you are an enterprise and even generally doesn't in most cases there either.
syntaxing · 8h ago
What hardware do you train on using axolotl? I use unsloth with Google colab pro
qcnguy · 8h ago
LLM fine tuning tends to destroy the model's capabilities if you aren't very careful. It's not as easy or effective as with image generation.
israrkhan · 1h ago
do you have a suggestion or a way to measure if model capabilities are getting destroyed? how do one measure it objectively?
RALaBarge · 53m ago
Ask it a series of the same questions after you train that you posed before training started. Is the quality lower?
jasonjmcghee · 2h ago
brev.dev made an easy to follow guide a while ago but apparently Nvidia took it down or something when they bought them?

So here's the original

https://web.archive.org/web/20231127123701/https://brev.dev/...

notpublic · 9h ago
https://github.com/unslothai/unsloth

I'm not sure if it contains exactly what you're looking for, but it includes several resources and notebooks related to fine-tuning LLMs (including LoRA) that I found useful.

svachalek · 8h ago
For completeness, for Apple hardware MLX is the way to go.
minimaxir · 9h ago
If you're using Hugging Face transformers, the library you want to use is peft: https://huggingface.co/docs/peft/en/quicktour

There are Colab Notebook tutorials around training models with it as well.

electroglyph · 5h ago
unsloth is the easiest way to finetune due to the lower memory requirements
pdntspa · 5h ago
Have you tried asking an LLM?
Nesco · 6h ago
Zuck wouldn’t have leaked it on 4chan of all the places
vaenaes · 6h ago
Why not?
tough · 6h ago
prob just told an employee to get it done no?
tonyhart7 · 10h ago
is GLM 4.5 better than Qwen3 coder??
diggan · 10h ago
For what? It's really hard to say what model is "generally" better then another, as they're all better/worse at specific things.

My own benchmarks has a bunch of different tasks I use various local models for, and I run it when I wanna see if a new model is better than the existing ones I use. The output is basically a markdown table with a description of which model is best for what task.

They're being sold as general purpose things that are better/worse at everything but reality doesn't reflect this, they all have very specific tasks they're worse/better at, and the only way to find that out is by having a private benchmark you run yourself.

kelvinjps10 · 9h ago
coding? they are coding models? what specific tasks is one performing better than the other?
diggan · 9h ago
They may be, but there are lots of languages, lots of approaches, lots of methodologies and just a ton of different ways to "code", coding isn't one homogeneous activity that one model beats all the other models at.

> what specific tasks is one performing better than the other?

That's exactly why you create your own benchmark, so you can figure that out by just having a list of models, instead of testing each individually and basing it on "feels better".

whimsicalism · 9h ago
glm 4.5 is not a coding model
simonw · 9h ago
It may not be code-only, but it was trained extensively for coding:

> Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model's performance on key downstream domains.

From my notes here: https://simonwillison.net/2025/Jul/28/glm-45/

whimsicalism · 8h ago
yes, all reasoning models currently are, but it’s not like ds coder or qwen coder
simonw · 8h ago
I don't see how the training process for GLM-4.5 is materially different from that used for Qwen3-235B-A22B-Instruct-2507 - they both did a ton of extra reinforcement learning training related to code.

Am I missing something?

whimsicalism · 8h ago
I think the primary thing you're missing is that Qwen3-235B-A22B-Instruct-2507 != Qwen3-Coder-480B-A35B-Instruct. And the difference there is that while both do tons of code RL, in one they do not monitor performance on anything else for forgetting/regression and focus fully on code post-training pipelines and it is not meant for other tasks.
NitpickLawyer · 10h ago
I haven't tried them (released yesterday I think?). The benchmarks look good (similar I'd say) but that's not saying much these days. The best test you can do is have a couple of cases that match your needs, and run them yourself w/ the cradle that you are using (aider, cline, roo, any of the CLI tools, etc). Openrouter usually has them up soon after launch, and you can run a quick test really cheap (and only deal with one provider for billing & stuff).
bob1029 · 8h ago
> still think it’s noteworthy that a model running on my 2.5 year old laptop (a 64GB MacBook Pro M2) is able to produce code like this—especially code that worked first time with no further edits needed.

I believe we are vastly underestimating what our existing hardware is capable of in this space. I worry that narratives like the bitter lesson and the efficient compute frontier are pushing a lot of brilliant minds away from investigating revolutionary approaches.

It is obvious that the current models are deeply inefficient when you consider how much you can decimate the precision of the weights post-training and still have pelicans on bicycles, etc.

jonas21 · 8h ago
Wasn't the bitter lesson about training on large amounts of data? The model that he's using was still trained on a massive corpus (22T tokens).
itsalotoffun · 8h ago
I think GP means that if you internalize the bitter lesson (more data more compute wins), you stop imagining how to squeeze SOTA minus 1 performance out of constrained compute environments.
yahoozoo · 8h ago
What does that have to do with quantizing?
righthand · 9h ago
Did you understand the implementation or just that it produced a result?

I would hope an LLM could spit out a cobbled form of answer to a common interview question.

Today a colleague presented data changes and used an LLM to build a display app for the JSON for presentation. Why did they not just pipe the JSON into our already working app that displays this data?

People around me for the most part are using LLMs to enhance their presentations, not to actually implement anything useful. I have been watching my coworkers use it that way for months.

Another example? A different coworker wanted to build a document macro to perform bulk updates on courseware content. Swapping old words for new words. To build the macro they first wrote a rubrick to prompt an LLM correctly inside of a word doc.

That filled rubrik is then used to generate a program template for the macro. To define the requirements for the macro the coworker then used a slideshow slide to list bullet points of functionality, in this case to Find+Replace words in courseware slides/documents using a list of words from another text document. Due to the complexity of the system, I can’t believe my colleague saved any time. The presentation was interesting though and that is what they got compliments on.

However the solutions are absolutely useless for anyone else but the implementer.

simonw · 9h ago
I scanned the code and understood what it was doing, but I didn't spend much time on it once I'd seen that it worked.

If I'm writing code for production systems using LLMs I still review every single line - my personal rule is I need to be able to explain how it works to someone else before I'm willing to commit it.

I wrote a whole lot more about my approach to using LLMs to help write "real" code here: https://simonwillison.net/2025/Mar/11/using-llms-for-code/

photon_lines · 8h ago
This is why I love using the Deep-Seek chain of reason output ... I can actually go through and read what it's 'thinking' to validate whether it's basing its solution on valid facts / assumptions. Either way thanks for all of your valuable write-ups on these models I really appreciate them Simon!
vessenes · 4h ago
Nota bene - there is a fair amount of research that indicates models outputs and ‘thoughts’ do not necessarily align with their chain of reasoning output.

You can validate this pretty easily by asking some logic or coding questions: you will likely note that a final output is not necessarily the logical output of the end of the thinking; sometimes significantly orthogonal to it, or returning to reasoning in the middle.

All that to say - good idea to read it, but stay vigilant on outputs.

shortrounddev2 · 4h ago
Serious question: if you have to read every line of code in order to validate it in production, why not just write every line of code instead?
simonw · 4h ago
Because it's much, much faster to review a hundred lines of code than it is to write a hundred lines of code.

(I'm experienced at reading and reviewing code.)

paufernandez · 2h ago
Simon, don't you fear "atrophy" in your writing ability?
simonw · 1h ago
I think it will happen a bit, but I'm not worried about it.

My ability to write with a pen has suffered enormously now that I do most of my writing on a phone or laptop - but I'm writing way more.

I expect I'll become slower at writing code without an LLM, but the volume of (useful) code I produce will be worth the trade off.

th0ma5 · 8h ago
[flagged]
dang · 6h ago
Please don't cross into personal attack in HN comments.

https://news.ycombinator.com/newsguidelines.html

Edit: twice is already a pattern - https://news.ycombinator.com/item?id=44110785. No more of this, please.

Edit 2: I only just realized that you've been frequently posting abusive replies in a way that crosses into harangue if not harassment:

https://news.ycombinator.com/item?id=44725284 (July 2025)

https://news.ycombinator.com/item?id=44725227 (July 2025)

https://news.ycombinator.com/item?id=44725190 (July 2025)

https://news.ycombinator.com/item?id=44525830 (July 2025)

https://news.ycombinator.com/item?id=44441154 (July 2025)

https://news.ycombinator.com/item?id=44110817 (May 2025)

https://news.ycombinator.com/item?id=44110785 (May 2025)

https://news.ycombinator.com/item?id=44018000 (May 2025)

https://news.ycombinator.com/item?id=44008533 (May 2025)

https://news.ycombinator.com/item?id=43779758 (April 2025)

https://news.ycombinator.com/item?id=43474204 (March 2025)

https://news.ycombinator.com/item?id=43465383 (March 2025)

https://news.ycombinator.com/item?id=42960299 (Feb 2025)

https://news.ycombinator.com/item?id=42942818 (Feb 2025)

https://news.ycombinator.com/item?id=42706415 (Jan 2025)

https://news.ycombinator.com/item?id=42562036 (Dec 2024)

https://news.ycombinator.com/item?id=42483664 (Dec 2024)

https://news.ycombinator.com/item?id=42021665 (Nov 2024)

https://news.ycombinator.com/item?id=41992383 (Oct 2024)

That's abusive, unacceptable, and not even a complete list!

You can't go after another user like this on HN, regardless of how right you are or feel you are or who you have a problem with. If you keep doing this, we're going to end up banning you, so please stop now.

ajcp · 8h ago
They said "production systems", not "critical production applications".

Also the 'if' doesn't negate anything as they say "I still", meaning the behavior is actively happening or ongoing; they don't use a hypothetical or conditional after "still", as in "I still would".

bnchrch · 8h ago
You do realize your talking to the creator of Django, Datassette, and Lanyrd right?
tough · 6h ago
that made me chuckle
CamperBob2 · 8h ago
I missed the part where he said he was going to put the Space Invaders game into production. Link?
magic_hamster · 3h ago
The LLM is the solution.
bsder · 3h ago
> However the solutions are absolutely useless for anyone else but the implementer.

Disposable code is where AI shines.

AI generating the boilerplate code for an obtuse build system? Yes, please. AI generating an animation? Ganbatte. (Look at how much work 3Blue1Brown had to put into that--if AI can help that kind of thing, it has my blessings). AI enabling someone who doesn't program to generate some prototype that they can then point at an actual programmer? Excellent.

This is fine because you don't need to understand the result. You have a concrete pass/fail gate and don't care about underneath. This is real value. The problem is that it isn't gigabuck value.

The stuff that would be gigabuck value is unfortunately where AI falls down. Fix this bug in a product. Add this feature to an existing codebase. etc.

AI is also a problem because disposable code is what you would assign to junior programmers in order for them to learn.

AlexeyBrin · 10h ago
Most likely its training data included countless Space Invaders in various programming languages.
gblargg · 9h ago
The real test is if you can have it tweak things. Have the ship shoot down. Have the space invaders come from the left and right. Add two player simultaneous mode with two ships.
wizzwizz4 · 4h ago
It can usually tweak things, if given specific instruction, but it doesn't know when to refactor (and can't reliably preserve functionality when it does), so the program gets further and further away from something sensible until it can't make edits any more.
simonw · 4h ago
For serious projects you can address that by writing (or having it write) unit tests along the way, that way it can run in a loop and avoid breaking existing functionality when it adds new changes.
greesil · 3h ago
Okay ask it to write unit tests for space invaders next time :)
quantumHazer · 10h ago
and probably some synthetic data are generated copy of the games already on the dataset?

i have this feeling with LLM's generated react frontend, they all look the same

cchance · 9h ago
Have you used the internet? thats how the internet looks, their all fuckin react and the same layouts and styles 90% shadcn lol
tshaddox · 9h ago
To be fair, the human-generated user interfaces all look the same too.

No comments yet

bayindirh · 10h ago
Last time somebody asked for a "premium camera app for iOS", and the model (re)generated Halide.

Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

Uehreka · 9h ago
> Models don't emit something they don't know. They remix and rewrite what they know. There's no invention, just recall...

People really need to stop saying this. I get that it was the Smart Guy Thing To Say in 2023, but by this point it’s pretty clear that that it’s not true in any way that matters for most practical purposes.

Coding LLMs have clearly been trained on conversations where a piece of code is shown, a transformation is requested (rewrite this from Python to Go), and then the transformed code is shown. It’s not that they’re just learning codebases, they’re learning what working with code looks like.

Thus you can ask an LLM to refactor a program in a language it has never seen, and it will “know” what refactoring means, because it has seen it done many times, and it will stand a good chance of doing the right thing.

That’s why they’re useful. They’re doing something way more sophisticated than just “recombining codebases from their training data”, and anyone chirping 2023 sound bites is going to miss that.

FeepingCreature · 10h ago
True where trivial; where nontrivial, false.

Trivially, humans don't emit something they don't know either. You don't spontaneously figure out Javascript from first principles, you put together your existing knowledge into new shapes.

Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times. Will it be put together from smaller fragments? Yes, this is called "experience" or if the fragments are small enough, "understanding".

phkahler · 9h ago
>> Nontrivially, LLMs can absolutely produce code for entirely new requirements. I've seen them do it many times.

I think most people writing software today are reinventing a wheel, even in corporate environments for internal tools. Everyone wants their own tweak or thinks their idea is unique and nobody wants to share code publicly, so everyone pays programmers to develop buggy bespoke custom versions of the same stuff that's been done 100 times before.

I guess what I'm saying is that your requirements are probably not new, and to the extent they are yes an LLM can fill in the blanks due to its fluency in languages.

bayindirh · 10h ago
Humans can observe ants and invent any colony optimization. AIs can’t.

Humans can explore what they don’t know. AIs can’t.

falcor84 · 10h ago
What makes you categorically say that "AIs can't"?

Based on my experience with present day AIs, I personally wouldn't be surprised at all that if you showed Gemini 2.5 Pro a video of an insect colony and asked it "Take a look at the way they organize and see if that gives you inspiration for an optimization algorithm", it will spit something interesting out.

sarchertech · 8h ago
It will 100% have something in its training set discussing a human doing this and will almost definitely spit out something similar.
FeepingCreature · 9h ago
What makes you categorically say that "humans can"?

I couldn't do that with an ant colony. I would have to train on ant research first.

(Oh, and AIs can absolutely explore what they don't know. Watch a Claude Code instance look at a new repository. Exploration is a convergent skill in long-horizon RL.)

ben_w · 9h ago
> Humans can observe ants and invent any colony optimization. AIs can’t.

Surely this is exactly what current AI do? Observe stuff and apply that observation? Isn't this the exact criticism, that they aren't inventing ant colonies from first principles without ever seeing one?

> Humans can explore what they don’t know. AIs can’t.

We only learned to decode Egyptian hieroglyphs because of the Rosetta Stone. There's no translation for North Sentinelese, the Voynich manuscript, or Linear A.

We're not magic.

CamperBob2 · 9h ago
That's what benchmarks like ARC-AGI are designed to test. The models are getting better at it, and you aren't.

Nothing ultimately matters in this business except the first couple of time derivatives.

satvikpendem · 10h ago
This doesn't make sense thermodynamically because models are far smaller than the training data they purport to hold and recall, so there must be some level of "understanding" going on. Whether that's the same as human understanding is a different matter.
Eggpants · 7h ago
It’s a lossy text compression technique. It’s clever applied statistics. Basically an advanced association rules algorithm which has been around for decades but modified to consider order and relative positions.

There is no understanding, regardless of the wants of all the capital investors in this domain.

simonw · 6h ago
I don't care if it can "understand" anything, as long as I can use it to achieve useful things.
Eggpants · 6h ago
“useful things“ like poorly drawing birds on bikes? ;)

(I have much respect for what you have done and are currently doing, but you did walk right into that one)

msephton · 2h ago
The pelican on a bicycle is a very useful test.
CamperBob2 · 3h ago
It’s a lossy text compression technique.

That is a much, much bigger deal than you make it sound like.

Compression may, in fact, be all we need. For that matter, it may be all there is.

mr_toad · 6h ago
> They remix and rewrite what they know. There's no invention, just recall...

If they only recalled they wouldn’t “hallucinate”. What’s a lie if not an invention? So clearly they can come up with data that they weren’t trained on, for better or worse.

0x457 · 5h ago
Because internally, there isn't a difference between correctly "recalled" token and incorrectly (hallucinated).
NitpickLawyer · 10h ago
This comment is ~3 years late. Every model since gpt3 has had the entirety of available code in their training data. That's not a gotcha anymore.

We went from chatgpt's "oh, look, it looks like python code but everything is wrong" to "here's a full stack boilerplate app that does what you asked and works in 0-shot" inside 2 years. That's the kicker. And the sauce isn't just in the training set, models now do post-training and RL and a bunch of other stuff to get to where we are. Not to mention the insane abilities with extended context (first models were 2/4k max), agentic stuff, and so on.

These kinds of comments are really missing the point.

haar · 10h ago
I've had little success with Agentic coding, and what success I have had has been paired with hours of frustration, where I'd have been better off doing it myself for anything but the most basic tasks.

Even then, when you start to build up complexity within a codebase - the results have often been worse than "I'll start generating it all from scratch again, and include this as an addition to the initial longtail specification prompt as well", and even then... it's been a crapshoot.

I _want_ to like it. The times where it initially "just worked" felt magical and inspired me with the possibilities. That's what prompted me to get more engaged and use it more. The reality of doing so is just frustrating and wishing things _actually worked_ anywhere close to expectations.

aschobel · 10h ago
Bingo, it's magical but the learning curve is very very steep. The METR study on open-source productivity alluded to this a bit.

I am definitely at a point where I am more productive with it, but it took a bunch of effort.

haar · 9h ago
Apologies if I was unclear.

The more I've used it, the more I've disliked how poor the results it's produced, and the more I've realised I would have been better served by doing it myself and following a methodical path for things that I didn't have experience with.

It's easier to step through a problem as I'm learning and making small changes than an LLM going "It's done, and production ready!" where it just straight up doesn't work for 101 different tiny reasons.

devmor · 9h ago
The subjects in the study you are referencing also believed that they were more productive with it. What metrics do you have to convince yourself you aren't under the same illusionary bias they were?
simonw · 9h ago
Yesterday I used ffmpeg to extract the frame at the 13 second mark of a video out as a JPEG.

If I didn't have an LLM to figure that out for me I wouldn't have done it at all.

throwworhtthrow · 8h ago
LLM's still give subpar results with ffmpeg. For example when I asked Sonnet to trim a long video with ffmpeg, it put the input file parameter before the start time parameter, which triggers an unnecessary decode of the video file. [1]

Sure, use the LLM to get over the initial hump. But ffmpeg's no exception to the rule that LLM's produce subpar code. It's worth spending a couple minutes reading the docs to understand what it did so you can do it better, and unassisted, next time.

[1] https://ffmpeg.org/ffmpeg.html#:~:text=ss%20position

CamperBob2 · 8h ago
That says more about suboptimal design on ffmpeg's part than it does about the LLM. Most humans can't deal with ffmpeg command lines, so it's not surprising that the LLM misses a few tricks.
nottorp · 7h ago
Had a LLM generate 3 lines of working C++ code that was "only" one order of magnitude slower than what i edited the code to in 10 minutes.

If you're happy with results like that, sure, LLMs miss "a few tricks"...

ben_w · 7h ago
You don't have to leave LLM code alone, it's fine to change it — unless, I guess, you're doing some kind of LLM vibe-code-golfing?

But this does remind me of a previous co-worker. Wrote something to convert from a custom data store to a database, his version took 20 minutes on some inputs. Swore it couldn't possibly be improved. Obviously ridiculous because it didn't take 20 minutes to load from the old data store, nor to load from the new database. Over the next few hours of looking at very mediocre code, I realised it was doing an unnecessary O(n^2) check, confirmed with the CTO it wasn't business-critical, got rid of it, and the same conversion on the same data ran in something like 200ms.

Over a decade before LLMs.

nottorp · 6h ago
We all do that, sometimes where it’s time critical sometimes where it isn’t.

But I keep being told “AI” is the second coming of Ahura Mazda so it shouldn’t do stuff like that right?

ben_w · 5h ago
> Ahura Mazda

Niche reference, I like it.

But… I only hear of scammers who say, and psychosis sufferers who think, LLMs are *already* that competent.

Future AI? Sure, lots of sane-seeming people also think it could go far beyond us. Special purpose ones have in very narrow domains. But current LLMs are only good enough to be useful and potentially economically disruptive, they're not even close to wildly superhuman like Stockfish is.

CamperBob2 · 4h ago
Sure. If you ask ChatGPT to play chess, it will put up an amateur-level effort at best. Stockfish will indeed wipe the floor with it. But what happens when you ask Stockfish to write a Space Invaders game?

ChatGPT will get better at chess over time. Stockfish will not get better at anything except chess. That's kind of a big difference.

ben_w · 3h ago
> ChatGPT will get better at chess over time

Oddly, LLMs got worse at specifically chess: https://dynomight.net/chess/

But even to the general point, there's absolutely no agreement how much better the current architectures can ultimately get, nor how quickly they can get there.

Do they have potential for unbounded improvements, albeit at exponential cost for each linear incremental improvement? Or will they asymptomatically approach someone with 5 years experience, 10 years experience, a lifetime of experience, or a higher level than any human?

If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance; and separately claim that even if they're actually unbounded with exponential cost for linear returns, we can't afford the training cost needed to make them act like someone with even just 6 years professional experience in any given subject.

Which is still a lot. Especially as it would be acting like it had about as much experience in every other subject at the same time. Just… not a literal Ahura Mazda.

CamperBob2 · 2h ago
If I had to bet, I'd say current models have an asymptomatic growth converging to a merely "ok" performance

(Shrug) People with actual money to spend are betting twelve figures that you're wrong.

Should be fun to watch it shake out from up here in the cheap seats.

ben_w · 2h ago
Nah, trillion dollars is about right for "ok". Percentage point of the global economy in cost, automate 2 percent and get a huge margin. We literally set more than that on actual fire each year.

For "pretty good", it would be worth 14 figures, over two years. The global GDP is 14 figures. Even if this only automated 10% of the economy, it pays for itself after a decade.

For "Ahura Mazda", it would easily be worth 16 figures, what with that being the principal God and god of the sky in Zoroastrianism, and the only reason it stops at 16 is the implausibility of people staying organised for longer to get it done.

CamperBob2 · 6h ago
"I'm taking this talking dog right back to the pound. It told me to short NVDA, and you should see the buffer overflow bugs in the C++ code it wrote. Totally overhyped. I don't get it."
nottorp · 6h ago
"We hear you have been calling our deity a talking dog. Please enter the red door for reeducation."
dingnuts · 9h ago
It is nice to use LLMs to generate ffmpeg commands, because those can be pretty tricky, but really, you wouldn't have just used the man page before?

That explains a lot about Django that the author is allergic to man pages lol

ben_w · 7h ago
I remember when I was a kid, people asking a teacher how to spell a word, and the answer was generally "look it up in a dictionary"… which you can only do if you already have shortlist of possible spellings.

*nix man pages are the same: if you already know which tool can solve your problem, they're easy to use. But you have to already have a shortlist of tools that can solve your problem, before you even know which man pages to read.

adastra22 · 1h ago
That’s what GNU info is for, of course.
simonw · 9h ago
I just took a look, and the man page DOES explain how to do that!

... on line 3,218: https://gist.github.com/simonw/6fc05ea7392c5fb8a5621d65e0ed0...

(I am very confident I am not the only person who has been deterred by ffmpeg's legendarily complex command-line interface. I feel no shame about this at all.)

quesera · 7h ago
Ffmpeg is genuinely complicated! And the CLI is convoluted (in justifiable, and unfortunate ways).

But if you approach ffmpeg from the perspective of "I know this is possible", you are always correct, and can almost always reach the "how" in a handful of minutes.

Whether that's worth it or not, will vary. :)

devmor · 9h ago
You wouldn't have just typed "extract frame at timestamp as jpeg ffmpeg" into Google and used the StackExchange result that comes up first that gives you a command to do exactly that?
simonw · 9h ago
Before LLMs made ffmpeg no-longer-frustrating-to-use I genuinely didn't know that ffmpeg COULD do things like that.
devmor · 6h ago
I'm not really sure what you're saying an LLM did in this case. Inspired a lost sense of curiosity?
simonw · 5h ago
My general point is that people say things like "yeah, but this one study showed that programmers over-estimate the productivity gain they get from LLMs so how can you really be sure?"

Meanwhile I've spent the past two years constantly building and implementing things I never would have done because of the reduction in friction LLM assistance gives me.

I wrote about this first two years ago - AI-enhanced development makes me more ambitious with my projects - https://simonwillison.net/2023/Mar/27/ai-enhanced-developmen... - when I realized I was hacking on things with tech like AppleScript and jq that I'd previously avoided.

It's hard to measure the productivity boost you get from "wouldn't have built that thing" to "actually built that thing".

Philpax · 5h ago
Translated a vague natural language query ("cli, extract frame 13s into video") into something immediately actionable with specific examples and explanations, surfacing information that I would otherwise not know how to search for.

That's what I've done with my ffmpeg LLM queries, anyway - can't speak for simonw!

wizzwizz4 · 4h ago
DuckDuckGo search results for "cli, extract frame 13s into video" (no quotes):

https://stackoverflow.com/questions/10957412/fastest-way-to-...

https://superuser.com/questions/984850/linux-how-to-extract-...

https://www.aleksandrhovhannisyan.com/notes/video-cli-cheat-...

https://www.baeldung.com/linux/ffmpeg-extract-video-frames

https://ottverse.com/extract-frames-using-ffmpeg-a-comprehen...

Search engines have been able to translate "vague natural language queries" into search results for a decade, now. This pre-existing infrastructure accounts for the vast majority of ChatGPT's apparent ability to find answers.

stelonix · 1h ago
Yet the interface is fundamentally different, the output feels much more like bro pages[0] and it's within a click of clipboarding, one CTRL V away from extracting the 13th second screenshot. I've been using Google the past 24 years and my google-fu has always left people amazed; yet I can no longer bother to go through Stack Exchange's results when an LLM not only spits it out so nicely, but also does the equivalent of a explainshell[1].

Not comparable and I fail to see why going through Google's ads/results would be better?

[0] https://github.com/pombadev/bropages

[1] https://github.com/idank/explainshell

0x457 · 5h ago
LLM somewhat understood ffmpeg documentation? Not sure what is not clear here.
jan_Sate · 10h ago
Not exactly. The real utility value of LLM for programming is to come up with something new. For Space Invaders, instead of using LLM for that, I might as well just manually search for the code online and use that.

To show that LLM actually can provide value for one-shot programming, you need to find a problem that there's no fully working sample code available online. I'm not trying to say that LLM couldn't to that. But just because LLM can come up with a perfectly-working Space Invaders doesn't mean that it could do that.

tracker1 · 9h ago
I have a friend who has been doing just that... usually with his company he manages a handful of projects where a bulk of the development is outsourced overseas. This past year, he's outpaced the 6 devs he's had working on misc projects just with his own efforts and AI. Most of this being a relatively unique combination of UX with features that are less common.

He's using AI with note taking apps for meetings to enhance notes and flush out technology ideas at a higher level, then refining those ideas into working experiments.

It's actually impressive to see. My personal experience has been far more disappointing to say the least. I can't speak to the code quality, consistency or even structure in terms of most people being able to maintain such applications though. I've asked to shadow him through a few of his vibe coding sessions to see his workflow. It feels rather alien to me, again my experience is much more disappointing in having to correct AI errors.

nottorp · 7h ago
Is this the same person who posted about launching 17 "products" in one year a few days ago on HN? :)
tracker1 · 3h ago
No, he's been working on building a larger eLearning solution with some interesting workflow analytics around courseware evaluation and grading. He's been involved in some of the newer LRS specifications and some implementation details to bridge training as well as real world exposure scenarios. Working a lot with first responders, incident response training etc.

I've worked with him off and on for years from simulating aircraft diagnostics hardware to incident command simulation and setting up core infrastructure for F100 learning management backends.

devmor · 9h ago
> The real utility value of LLM for programming is to come up with something new.

That's the goal for these projects anyways. I don't know that its true or feasible. I find the RAG models much more interesting myself, I see the technology as having far more value in search than generation.

Rather than write some markov-chain reminiscent frankenstein function when I ask it how to solve a problem, I would like to see it direct me to the original sources it would use to build those tokens, so that I can see their implementations in context and use my judgement.

simonw · 9h ago
"I would like to see it direct me to the original sources it would use to build those tokens"

Sadly that's not feasible with transformer-based LLMs: those original sources are long gone by the time you actually get to use the model, scrambled a billion times into a trained set of weights.

One thing that helped me understand this is understanding that every single token output by an LLM is the result of a calculation that considers all X billion parameters that are baked into that model (or a subset of that in the case of MoE models, but it's still billions of floating point calculations for every token.)

You can get an imitation of that if you tell the model "use your search tool and find example code for this problem and build new code based on that", but that's a pretty unconventional way to use a model. A key component of the value of these things is that they can spit out completely new code based on the statistical patterns they learned through training.

devmor · 9h ago
I am aware, and that's exactly why I don't think they're anywhere near as useful for this type of work as the people pushing them want them to be.

I tried to push for this type of model when an org I worked with over a decade ago was first exploring using the first generation of Tensorflow to drive customer service chatbots and was sadly ignored.

simonw · 9h ago
I don't understand. For code, why would I want to remix existing code snippets?

I totally get the value of RAG style patterns for information retrieval against factual information - for those I don't want the LLM to answer my question directly, I want it to run a search and show me a citation and directly quote a credible source as part of answering.

For code I just want code that works - I can test it myself to make sure it does what it's supposed to.

devmor · 9h ago
> I don't understand. For code, why would I want to remix existing code snippets?

That is what you're doing already. You're just relying on a vector compression and search engine to hide it from you and hoping the output is what you expect, instead of having it direct you to where it remixed those snippets from so you can see how they work to start with and make sure its properly implemented from the get-go.

We all want code that works, but understanding that code is a critical part of that for anything but a throw-away one time use script.

I don't really get this desire to replace critical thought with hoping and testing. It sounds like the pipe dream of a middle manager, not a tool for a programmer.

stavros · 8h ago
I don't understand your point. You seem to be saying that we should be getting code from the source, then adapting it to our project ourselves, instead of getting adapted code to begin with.

I'm going to review the code anyway, why would I not want to save myself some of the work? I can "see how they work" after the LLM gives them to me just fine.

devmor · 6h ago
The work that you are "saving" is the work of using your brain to determine the solution to the problem. Whatever the LLM gives you doesn't have a context it is used in other than your prompt - you don't even know what it does until after you evaluate it.

If you instead have a set of sources related to your problem, they immediately come with context, usage and in many cases, developer notes and even change history to show you mistakes and adaptations.

You're ultimately creating more work for yourself* by trying to avoid work, and possibly ending up with an inferior solution in the process. Where is your sense of efficiency? Where is your pride as a intellectual?

* Yes, you are most likely creating more work for yourself even if you think you are capable of telling otherwise. [1]

1. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...

simonw · 5h ago
It sounds like you care deeply about learning as much as you can. I care about that too.

I would encourage you to consider that even LLM-generated code can teach you a ton of useful new things.

Go read the source code for my dumb, zero-effort space invaders clone: https://github.com/simonw/tools/blob/main/space-invaders-GLM...

There's a bunch of useful lessons to be picked up even from that!

- Examples of CSS gradients, box shadows and flexbox layout

- CSS keyframe animation

- How to implement keyboard events in JavaScript

- A simple but effective pattern for game loops against a Canvas element, using requestAnimationFrame

- How to implement basic collision detection

If you've written games like this before these may not be new to you, but I found them pretty interesting.

stavros · 6h ago
Thanks for the concern, but I'm perfectly able to judge for myself whether I'm creating more work or delivering an inferior product.
AlexeyBrin · 9h ago
You are reading too much into my comment. My point was that the test (a Space Invaders clone) used to asses the model is irrelevant for some time now. I could have gotten a similar result with Mistral Small a few months ago.
stolencode · 2h ago
It's amazing that none of you even try to falsify you claims anymore. You can literally just put some of the code in a search engine and find the prior art example:

https://www.web-leb.com/en/code/2108

Your "AI tools" are just "copyright whitewashing machines."

These kinds of comments are really ignoring reality.

MyOutfitIsVague · 10h ago
I don't think they are missing the point, because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated. I use Gemini 2.5 Pro every day for coding, and even that one still falls over on tasks that aren't well known to it (which is why I break the problem down into small parts that I know it'll be able to handle properly).

It's kind of funny, because sometimes these tools are magical and incredible, and sometimes they are extremely stupid in obvious ways.

Yes, these are impressive, and especially so for local models that you can run yourself, but there is a gap between "absolutely magical" and "pretty cool, but needs heavy guiding" depending on how heavily the ground you're treading has been walked upon.

For a heavily explored space, it's like being impressed that you're 2.5 year old M2 with 64 GB RAM can extract some source code from a zip file. It's worth being impressed and excited about the space and the pace of improvement, but it's also worth stepping back and thinking rationally about the specific benchmark at hand.

NitpickLawyer · 10h ago
> because they're pointing out that the tools are still the most useful for patterns that are extremely widely known and repeated

I agree with you, but your take is much more nuanced than what the GP comment said! These models don't simply regurgitate the training set. That was my point with gpt3. The models have advanced from that, and can now "generalise" over the context in ways they could not do ~3 years ago. We are now at a point where you can write a detailed spec (10-20k tokens) for an unseen scripting language, and have SotA models a) write a parser and b) start writing scripts for you in that language, even though it never saw that particular scripting language anywhere in its training set. Try it. You'll be surprised.

jayd16 · 10h ago
I think you're missing the point.

Showing off moderately complicated results that are actually not indicative of performance because they are sniped by the training data turns this from a cool demo to a parlor trick.

Stating that, aha, jokes on you, that's the status quo, is an even bigger indictment.

Aurornis · 9h ago
> These kinds of comments are really missing the point.

I disagree. In my experience, asking coding tools to produce something similar to all of the tutorials and example code out there works amazingly well.

Asking them to produce novel output that doesn’t match the training set produces very different results.

When I tried multiple coding agents for a somewhat unique task recently they all struggled, continuously trying to pull the solution back to the standard examples. It felt like an endless loop of the models grinding through a solution and then spitting out something that matched common examples, after which I had to remind them of the unique properties of the task and they started all over again, eventually arriving back in the same spot.

It shows the reality of working with LLMs and it’s an important consideration.

phkahler · 9h ago
I find the visual similarity to breakout kind of interesting.
elif · 10h ago
Most likely this comment included countless similar comments in its training data, likely all synthetic without any actual tether to real analysis.
Conflonto · 10h ago
That sounds so dismissive.

I was not able to just download a 8-16GB File and then it would be able to generate A LOT of different tools, games etc. for me in multiply programming languages while in parallel ELI5 me research papers, generate svgs and a lot lot lot more.

But hey.

alankarmisra · 10h ago
I see the value in showcasing that LLMs can run locally on laptops — it’s an important milestone, especially given how difficult that was before smaller models became viable.

That said, for something like this, I’d probably get more out of simply finding an existing implementation on github or the like and downloading that.

When it comes to specialized and narrow domains like Space Invaders, the training set is likely to be extremely small and the model's vector space will have limited room to generalize. You'll get code that is more or less identical to the original source and you also have to wait for it to 'type' the code and the value add seems very low. I would rather ask it to point me to known Space Invaders implementations in language X on github (or search there).

Note that ChatGPT gets very nervous if I put this into GPT to clean up the grammar. It wants very badly for me to stress that LLMs don't memorize and overfitting is very unlikely (I believe neither).

tossandthrow · 10h ago
Interesting, I can not produce these warnings in ChatGPT - though this is something that really interests me, as it represents immense political power to be able ti interject such warnings (explicitly, or implicitly by slight reformulations)
lxgr · 7h ago
This raises an interesting question I’ve seen occasionally addressed in science fiction before:

Could today’s consumer hardware run a future superintelligence (or, as a weaker hypothesis, at least contain some lower-level agent that can bootstrap something on other hardware via networking or hyperpersuasion) if the binary dropped out of a wormhole?

bob1029 · 6h ago
This is the premise of all of the ML research I've been into. The only difference is to replace the wormhole with linear genetic programming, neuroevolution, et. al. The size of programs in the demoscene is what originally sent me down this path.

The biggest question I keep asking myself - What is the Kolmogorov complexity of a binary image that provides the exact same capabilities as the current generation LLMs? What are the chances this could run on the machine under my desk right now?

I know how many AAA frames per second my machine is capable of rendering. I refuse to believe the gap between running CS2 at 400fps and getting ~100b/s of UTF8 text out of a NLP black box is this big.

bgirard · 6h ago
> ~100b/s of UTF8 text out of a NLP black box is this big

That's not a good measure. NP problem solutions are only a single bit, but they are much harder to solve than CS2 frames for large N. If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.

bob1029 · 4h ago
> If it could solve any problem perfectly, I would pay you billions for just 1b/s of UTF8 text.

Exactly. This is what compels me to try.

switchbak · 7h ago
This is what I find fascinating. What hidden capabilities exist, and how far could it be exploited? Especially on exotic or novel hardware.

I think much of our progress is limited by the capacity of the human brain, and we mostly proceed via abstraction which allows people to focus on narrow slices. That abstraction has a cost, sometimes a high one, and it’s interesting to think about what the full potential could be without those limitations.

lxgr · 3h ago
Abstraction, or efficient modeling of a given system, is probably a feature, not a bug, given the strong similarity between intelligence and compression and all that.

A concise description of the right abstractions for our universe is probably not too far removed from the weights of a superintelligence, modulo a few transformations :)

xianshou · 5h ago
I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."

Though I suppose, given a few years, that may also be true!

dust42 · 2h ago
I tried with Claude Sonnet 4 and it does *not* work. So looks like GLM-4.5 Air in 3bit quant is ahead.

Chat is here: https://claude.ai/share/dc9eccbf-b34a-4e2b-af86-ec2dd83687ea

Claude Opus 4 does work but is far behind of Simon's GLM-4.5: https://claude.ai/share/5ddc0e94-3429-4c35-ad3f-2c9a2499fb5d

matt3210 · 19m ago
Is this more than ‘import space invaders; run_space_invaders()’?
pulkitsh1234 · 10h ago
Is there any website to see the minimum/recommended hardware required for running local LLMs? Much like 'system requirements' mentioned for games.
CharlesW · 9h ago
> Is there any website to see the minimum/recommended hardware required for running local LLMs?

LM Studio (not exclusively, I'm sure) makes it a no-brainer to pick models that'll work on your hardware.

svachalek · 8h ago
In addition to the tools other people responded with, a good rule of thumb is that most local models work best* at q4 quants, meaning the memory for the model is a little over half the number of parameters, e.g. a 14b model may be 8gb. Add some more for context and maybe you want 10gb VRAM for a 14gb model. That will at least put you in the right ballpark for what models to consider for your hardware.

(*best performance/size ratio, generally if the model easily fits at q4 you're better off going to a higher parameter count than going for a larger quant, and vice versa)

nottorp · 7h ago
> maybe you want 10gb VRAM for a 14gb model

... or if you have Apple hardware with their unified memory, whatever the assholes soldered in is your limit.

qingcharles · 8h ago
This can be a useful resource too:

https://www.reddit.com/r/LocalLLaMA/

GaggiX · 10h ago
https://apxml.com/tools/vram-calculator

This one is very good in my opinion.

jxf · 10h ago
Don't think it has the GLM series on there yet.
knowaveragejoe · 10h ago
If you have a HuggingFace account, you can specify the hardware you have and it will show on any given model's page what you can run.
pmarreck · 44m ago
I have an M4 Mac with 128GB RAM and I'm currently downloading GLM-4.5-Air-q5-hi-mlx via LM Studio (80GB) and will report back!
stpedgwdgfhgdd · 10h ago
Aside that space invaders from scratch is not representative for real engineering, it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine (no usage tier per hour or week), let’s say, one year from now. At $200 per month for 2 years I can buy a decent Mx with 64GB (or perhaps even 128GB taking residual value into account)
falcor84 · 9h ago
How come it's "not representative for real engineering"? Other than copy-pasting existing code (which is not what an LLM does), I don't see how you can create a space invaders game without applying "engineering".
hbn · 9h ago
The prompt was

> Write an HTML and JavaScript page implementing space invaders

It may not be "copy pasting" but it's generating output as best it can be recreated from its training on looking at Space Invaders source code.

The engineers at Taito that originally developed Space Invaders were not told "make Space Invaders" and then did their best to recall all the source code they've looked at in their life to re-type the source code to an existing game. From a logistics standpoint, where the source code already exists and is accessible, you may as well have copy-pasted it and fudged a few things around.

simonw · 9h ago
The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.

I used that prompt because it's the shortest possible prompt that tells the model to build a game with a specific set of features. If I wanted to build a custom game I would have had to write a prompt that was many paragraphs longer than that.

The aim of this piece isn't "OMG looks LLMs can build space invaders" - at this point that shouldn't be a surprise to anyone. What's interesting is that my laptop can run a model that is capable of that now.

sarchertech · 8h ago
> The source code for original Space Invaders from 1978 has never been published. The closest to that is disassembled ROMs.

Sure but that doesn’t impact the OPs point at all because there are numerous copies of reverse engineered source code available.

There are numerous copies of the reverse engineered source code already translated to JavaScript in your models training set.

hbn · 4h ago
The discussion I replied to was just regarding whether or not what the LLM did should be considered "engineering"

It doesn't really matter whether or not the original code was published. In fact that original source code on its own probably wouldn't be that useful, since I imagine it wouldn't have tipped the weights enough to be "recallable" from the model, not to mention it was tasked with implementing it in web technologies.

nottorp · 7h ago
> What's interesting is that my laptop can run a model that is capable of that now.

I'm afraid no one cared much about your point :)

You'll only get "OMG look how good LLMs are they'll get us all fired!" comments and "LLMs suck" comments.

This is how it goes with religion...

sharkjacobs · 7h ago
Making a space invaders game is not representative of normal engineering because you're reproducing an existing game with well known specs and requirements. There are probably hundreds of thousands of words describing and discussing Space Invaders in GLM-4.5's training data

It's like using an LLM to implement a red black tree. Red black trees are in the training data, so you don't need to explain or describe what you mean beyond naming it.

"Real engineering" with LLMs usually requires a bunch of up front work creating specs and outlines and unit tests. "Context engineering"

jasonvorhe · 7h ago
Smells like moving the goal post. What's real engineering to be in 2028? Implementing Google's infra stack in your homelab?
phkahler · 9h ago
>> Other than copy-pasting existing code (which is not what an LLM does)

I'd like to see someone try to prove this. How many space invaders projects exist on the internet? I'd be hard to compare model "generated" code to everything out there looking for plagiarism, but I bet there are lots of snippets pulled in. These things are NOT smart, they are huge and articulate information repositories.

simonw · 9h ago
Go for it. https://www.google.com/search?client=firefox-b-1-d&q=github+... has a bunch of results. Here's the source code GLM-4.5 Air spat out for me on my laptop: https://github.com/simonw/tools/blob/main/space-invaders-GLM...

Based on my mental model of how these things work I'll be genuinely surprised if you can find even a few lines of code duplicated from one of those projects into the code that GLM-4.5 wrote for me.

phkahler · 9h ago
So I scanned the beginning of the generated code, picked line 83:

  animation: glow 2s ease-in-out infinite;

stuffed it verbatim into google and found a stack overflow discussion that contained this:

      animation: glow .5s infinite alternate;

in under one minute. Then I found this page of CSS effects:

https://alvarotrigo.com/blog/animated-backgrounds-css/

Another page has examples and contains:

  animation: float 15s infinite ease-in-out;

There is just too much internet to scan for an exact match or a match of larger size.
simonw · 9h ago
That's not an example of copying from an existing Space Invaders implementation. That's an LLM using a CSS animation pattern - one that it's seen thousands (probably millions) of times in the training data.

That's what I expect these things to do: they break down Space Invaders into the components they need to build, then mix and match thousands of different coding patterns (like "animation: glow 2s ease-in-out infinite;") to implement different aspects of that game.

You can see that in the "reasoning" trace here: https://gist.github.com/simonw/9f515c8e32fb791549aeb88304550... - "I'll use a modern design with smooth animations, particle effects, and a retro-futuristic aesthetic."

threeducks · 8h ago
I think LLMs are adapting higher level concepts. For example, the following JavaScript code generated by GLM (https://github.com/simonw/tools/blob/9e04fd9895fae1aa9ac78b8...) is clearly inspired by this C++ code (https://github.com/portapack-mayhem/mayhem-firmware/blob/28e...), but it is not an exact copy.
simonw · 8h ago
This is a really good spot.

That code certainly looks similar, but I have trouble imagining how else you would implement very basic collision detection between a projectile and a player object in a game of this nature.

threeducks · 7h ago
A human would likely have refactored the two collision checks between bullet/enemy and enemyBullet/player in the JavaScript code into its own function, perhaps something like "areRectanglesOverlapping". The C++ code only does one collision check like that, so it has not been refactored there, but as a human, I certainly would not want to write that twice.

More importantly, it is not just the collision check that is similar. Almost the entire sequence of operations is identical on a higher level:

    1. enemyBullet/player collision check
    2. same comment "// Player hit!" (this is how I found the code)
    3. remove enemy bullet from array
    4. decrement lives
    5. update lives UI
    6. (createParticle only exists in JS code)
    7. if lives are <= 0, gameOver
ben_w · 9h ago
So, your example of it copying snippets is… using the same API with fairly different parameters in a different order?
falcor84 · 9h ago
The parent said

> find even a few lines of code duplicated from one of those projects

I'm pretty sure they meant multiple lines copied verbatim from a single project implementing space invaders, rather than individual lines copied (or likely just accidentally identical) across different unrelated projects.

sejje · 7h ago
Is this some kind of joke?

That's how you write css. The examples aren't the same at all, they just use the same css feature.

It feels like you aren't a coder--you've sabotaged your own point.

ben_w · 9h ago
Sorites paradox. Where's the distinction between "snippet" and "a design pattern"?

Compressing a few petabytes into a few gigabytes requires that they can't be like this about all of the things they're accused of simply copy-pasting, from code to newspaper articles to novels. There's not enough space.

dmortin · 9h ago
" it will be interesting to see what the business model for Anthropic will be if I can run a solid code generation model on my local machine "

Most people won't bother with buying powerful hardware for this, they will keep using SAAS solutions, so Anthropic can be in trouble if cheaper SAAS solutions come out.

qingcharles · 8h ago
The frontier models are always going to tempt you with their higher quality and quicker generation, IMO.
kasey_junk · 8h ago
I’ve been mentally mapping tge models to the history of db.

Most db in the early days you had to pay for. There are still for pay db that are just better than ones you don’t pay for. Some teams think that the cost is worth the improvements and there is a (tough) business there. Fortunes were made in the early days.

But eventually open source models became good enough for many use cases and they have their own advantages. So lots of teams use them.

I think coding models might have a similar trajectory.

qingcharles · 8h ago
You make a good point -- a majority of applications are now using open source or free versions[1] of DBs.

My only feedback is: are these the same animal? Can we compare an O/S DB vs. paid/closed DB to me running an LLM locally? The biggest issue right now with LLMs is simply the cost of the hardware to run one locally, not the quality of the actual software (the model).

[1] e.g. SQL Server Express is good enough for a lot of tasks, and I guess would be roughly equivalent to the upcoming open versions of GPT vs. the frontier version.

qcnguy · 8h ago
A majority of apps nowadays are using proprietary forks of open source DBs running in the cloud, where their feature set is (slightly) rounded out and smoothed off by the cloud vendors.

Not that many projects are doing fully self-hosted RDBMS at this point. So ultimately proprietary databases still win out, they just (ab)use the Postgresql trademark to make people think they're using open source.

LLMs might go the same way. The big clouds offering proprietary fine tunes of models given away by AI labs using investor money?

qingcharles · 7h ago
That's definitely true. I could see more of the running open source models on other people's hardware model.

I dislike running local LLMs right now because I find the software kinda janky still, you often have to tweak settings, find the right model files. Basically have a bunch of domain knowledge I don't have space for in my head. On top of maintaining a high-spec piece of hardware and paying for the power costs.

zarzavat · 6h ago
Closed doesn't always win over open. People said the same thing about Windows vs Linux, but even Microsoft was forced to admit defeat and support Linux.

All it takes is some large companies commoditizing their complements. For Linux it was Google, etc. For AI it's Meta and China.

The only thing keeping Anthropic in business is geopolitics. If China were allowed full access to GPUs, they would probably die.

rafaelmn · 9h ago
What about power used and support hardware ? Also card going down means you are down until you get warranty service.
skeezyboy · 8h ago
why are you doing anything locally then?
tptacek · 9h ago
OK, go write Space Invaders by hand.
LandR · 7h ago
I'd hope most professional software engineers could do this in an afternoon or so?
sejje · 7h ago
Most professional software engineers have never written a game and don't do web work, so I somehow doubt that.
anthk · 7h ago
With TCL/TK it's a matter of less than 2 hours.
indigodaddy · 9h ago
Did pretty well with a boggle clone. I like that it tries to do a single html file (I didn't ask for that but was pleasantly surprised). It didn't include dictionary validation so needed a couple of prompts. Touch selection on mobile isn't the greatest but I've seen plenty worse

https://chat.z.ai/space/z0gcn6qtu8s1-art

https://chat.z.ai/s/74fe4ddc-f528-4d21-9405-0a8b15a96520

Keyframe · 8h ago
I went the other route with tetris clone the other day. It's definitely not a single prompt. It took me solid 15 hours until this stage to get here and most of that me thinking.. BUT, except one small trivial thing (space invader logo in pre tag) I haven't touched code - just looked at it. I made it mandatory for myself to see if I can first greenfield myself into this project and then brownfield features and fixes.. It's definitely a ton of work on my end, but it's also not something I'd be able to do in ~2 working days or less. As a cherry on top, even though it's still not done yet, I put in AI-generated music singing about the project itself. https://www.susmel.com/stacky/

Definitely a ton of things I learned about how to "develop" "with" AI along the way.

JKCalhoun · 9h ago
Cool — if only diagonals were easier. ;-) (Hopefully I'm being constructive here.)
indigodaddy · 9h ago
Yep I tried to have it improve that but actually didn't use the word 'diagonal' in the prompt. I bet it would have done better if I had..
indigodaddy · 8h ago
Had it try to improve Diagonal selection but didn't seem to help much

https://chat.z.ai/space/b01dc65rg2p0-art

maksimur · 8h ago
A $xxxx 2.5 year old laptop, one that's probably much more powerful than an average laptop bought today and probably next year as well. I don't think it's a fair reference point.
bprew · 8h ago
His point isn't that you can run a model on an average laptop, but that the same laptop can still run frontier models.

It speaks to the advancements in models that aren't just throwing more compute/ram at it.

Also, his laptop isn't that fancy.

> It claims to be small enough to run on consumer hardware. I just ran the 7B and 13B models on my 64GB M2 MacBook Pro!

From: https://simonwillison.net/2023/Mar/11/llama/

parsimo2010 · 8h ago
The article is pretty good overall, but the title did irk me a little. I assumed when reading "2.5 year old" that it was fairly low-spec only to find out it was an M2 Macbook Pro with 64 GB of unified memory, so it can run models bigger than what an Nvidia 5090 can handle.

I suppose that it could be intended to be read as "my laptop is only 2.5 years old, and therefore fairly modern/powerful" but I doubt that was the intention.

simonw · 8h ago
The reason I emphasize the laptop's age is that it is the same laptop I have been using ever since the first LLaMA release.

This makes it a great way to illustrate how much better the models have got without requiring new hardware to unlock those improved abilities.

nh43215rgb · 4h ago
About $3700 laptop...
ddtaylor · 8h ago
My brain is running legacy COBOL and first read this as

> My 2.5 year old with their laptop can write Space Invaders

For a few hundred milliseconds there I was thinking "these damn kids are getting good with tablets"

Imustaskforhelp · 8h ago
Don't worry I guess my brain is running bleeding edge typescript with react (I am in high school for context) and the first time I also read it this way...

But I am without my glasses, but still I have hackernews at 250%, I think I am a little cooked lol.

OldfieldFund · 8h ago
We are all cooked at this point :)
petercooper · 9h ago
I ran the same experiment on the full size model. It used a custom 80s style font (from Google Fonts) and gave 'eyes' and more differences to the enemies but otherwise had a similar vibe to Simon's. An interesting visual demonstration of what quantization does though! Screenshot: https://peterc.org/img/aliens.png
efitz · 10h ago
I missed the word “laptop” in the title at first glance and thought this was a “I taught my toddler to code” article.
below43 · 1h ago
Same here. Pretty impressive LLM.
juliangoetze · 9h ago
I thought I was the only one.
joelthelion · 10h ago
Apart from using a Mac, what can you use for inference with reasonable performance? Is a Mac the only realistic option at the moment?
reilly3000 · 9h ago
The top 3 approaches I see a lot on r/localllama are:

1. 2-4x 3090+ nvidia cards. Some are getting Chinese 48GB cards. There is a ceiling to vRAM that prevents the biggest models from being able to load, most can run most quants at great speeds

2. Epyc servers running CPU inference with lots of RAM at as high of memory bandwidth as is available. With these setups people are getting like 5-10 t/s but are able to run 450B parameter models.

3. High RAM Macs with as much memory bandwidth as possible. They are the best balanced approach and surprisingly reasonable relative to other options.

badsectoracula · 5h ago
An Nvidia GPU is the most common answer, but personally i've done all my LLM use locally using mainly Mistral Small 3.1/3.2-based models and llama.cpp with an AMD RX 7900 XTX GPU. It only gives you ~4.71 tokens per second, but that is fast enough for a lot of uses. For example last month or so i wrote a raytracer[0][1] in C with Devstral Small 1.0 (based on Mistral Small 3.1). It wasn't "vibe coding" as much as a "co-op" where i'd go back and forth a chat interface (koboldcpp) and i'd, e.g. ask the LLM to implement some feature, then i'd switch to the editor and start writing code using that feature while the LLM was generating it in the background. Or, more often, i'd fix bugs in the LLM's code :-P.

FWIW GPU aside, my PC isn't particularly new - it is a 5-6 year old PC that was the cheapest money could buy originally and became "decent" at the time i upgraded it ~5 years ago and i only added the GPU around Christmas as prices were dropping since AMD was about to release the new GPUs.

[0] https://i.imgur.com/FevOm0o.png

[1] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92...

AlexeyBrin · 9h ago
A gaming PC with an NVIDIA 4090/5090 will be more than adequate for running local models.

Where a Mac may beat the above is on the memory side, if a model requires more than 24/32 GB of GPU memory you are usually better off with a Mac with 64/128 GB of RAM. On a Mac the memory is shared between CPU and GPU, so the GPU can load larger models.

regularfry · 9h ago
This one should just about fit on a box with an RTX 4090 and 64GB RAM (which is what I've got) at q4. Don't know what the performance will be yet. I'm hoping for an unsloth dynamic quant to get the most out of it.
weberer · 8h ago
Whats important is VRAM, not system RAM. The 4090 has 16gb of VRAM so you'll be limited to smaller models at decent speeds. Of course, you can run models from system memory, but your tokens/second will be orders of magnitude slower. ARM Macs are the exception since they have unified memory, allowing high bandwidth between the GPU and the system's RAM.
throwaway0123_5 · 1h ago
iirc 4090s have 24GB
whimsicalism · 9h ago
you are almost certainly better off renting GPUs, but i understand self-hosting is an HN touchstone
mrinterweb · 8h ago
I don't know about that. I've had my RTX 4090 for nearly 3 years now. If I had a script that provisioned and deprovisioned a rented 4090 at $0.70/hr for an 8 hour work day for 20 work days per month. Assuming I get 2 paid weeks off per year + normal holidays over 3 years.

0.7 * 8 * ((20 * 12) - 8 - 14) * 3 = $3662

I bought my RTX 4090 for about $2200. I also had the pleasure of being able to use it for gaming when I wasn't working. To be fair, the VRAM requirements for local models keeps climbing and my 4090 isn't able to run many of the latest LLMs. Also, I omitted cost of electricity for my local LLM server cost. I have not been measuring total watts consumed by just that machine.

One nice thing about renting is that it give you flexibility in terms of what you want to try.

If you're really looking for the best deals look at 3rd party hosts serving open models for the API-based pricing, or honestly a Claude subscription can easily be worth it if you use LLMs a fair bit.

whimsicalism · 7h ago
1. I agree - there are absolutely scenarios in which it can make sense to buy a GPU and run it yourself. If you are managing a software firm with multiple employees, you very well might break even in less than a few years. But I would gander this is not the case for 90%+ of people self-hosting these models, unless they have some other good reason (like gaming) to buy a GPU.

2. I basically agree with your caveats - excluding electricity is a pretty big exclusion and I don't think that you've had 3 years of really high-value self-hostable models, I would really only say the last year and I'm somewhat skeptical of how good for ones that can be hosted in 24gb vram. 4x4090 is a different story.

qingcharles · 8h ago
This. Especially if you just want to try a bunch of different things out. Renting is insanely cheap -- to the point where I don't understand how the renters are making their money back unless they stole the hardware and power.

It can really help you figure a ton of things out before you blow the cash on your own hardware.

4b11b4 · 8h ago
Recommended sites to rent from
whimsicalism · 8h ago
runpod, vast, hyperbolic, prime intellect. if all you're doing is going to be running LLMs, you can pay per token on openrouter or some of the providers listed there
doormatt · 8h ago
runpod.io
thenaturalist · 9h ago
This guy [0] does a ton of in-depth HW comparison/ benchmarking, including against Mac mini clusters and an M3 ultra.

0: https://www.youtube.com/@AZisk

h-bradio · 8h ago
Thanks so much for this! I updated LM Studio, and it picked up the mlx-lm update required. After a small tweak to tool-calling in the prompt, it works great with Zed!
torarnv · 7h ago
Could you describe the tweak you did, and possibly the general setup you have with zed working with LM Studio? Do you use a custom system prompt? What context size do you use? Temperature? Thanks!
slimebot80 · 1h ago
(novice question)

64gb is pure RAM? I thought Apple Silicon was efficient at paging SSD as memory storage - how important is RAM if you've got a fast SSD?

nicce · 1h ago
Memory speed is the most important factor with LLMs and SSD is very slow when compared to RAM.
simonw · 5h ago
There's a new model from Qwen today - Qwen3-30B-A3B-Instruct-2507 - that also runs comfortably on my Mac (using about 30GB of RAM with an 8bit quantization).

I tried the "Write an HTML and JavaScript page implementing space invaders" prompt against it and didn't quite get a working game with a single shot, but it was still an interesting result: https://simonwillison.net/2025/Jul/29/qwen3-30b-a3b-instruct...

aplzr · 9h ago
I really like talking to Claude (free tier) instead of using a search engine when I'm stumbling upon a random topic that interests me. For example, this morning I had it explain the differences between pass by value, pass by reference, and pass by sharing, the last of which I wasn't aware of until then.

Is this kind of thing also possible with one of these self-hosted models in a comparable way, or are they mostly good for coding?

andai · 6h ago
I got almost the same result with a 4B model (Qwen3-4B), about 20x smaller than OP's ~200B model.

https://jsbin.com/lejunenezu/edit?html,output

Its pelican was a total fail though.

andai · 6h ago
Update: It failed to make Flappy Bird though (several attempts).

This surprises me, I thought it would be simpler than Space Invaders.

Aurornis · 9h ago
This is very cool. The blog had to run it from the main branch of the mlx-lm library and a custom script. Can someone up to date on the local LLM tools let us know which mainstream tools we should be watching for an easier way to run this on MLX? The space moves so fast that it's hard to keep up.
simonw · 9h ago
I expect LM Studio will have this pretty soon - I imagine they are waiting on the next stable release of mlx-lm which will include the change I needed to get this to work.
lherron · 8h ago
With the Anthropic rug pull on quotas for Max, I feel the short-mid term value sweet spot will be a Frankensteined together “Claude as orchestrator/coder, falling back to local models as quota limits approach” tool suite.
4b11b4 · 8h ago
Was thinking this one might backfire on Anthropic in the end...

People are going to explore and get comfortable with alternatives.

There may have been other ways to deal with the cases they were worried about.

skeezyboy · 8h ago
But arent we still decades away from running our own video-creating AIs locally? Have we plateaued with this current generation of techniques?
svachalek · 8h ago
It's more a question of, how long do you want it to take to create a video locally?
skeezyboy · 8h ago
nah, i definitely want to know what i asked
sejje · 7h ago
His answer implies you can run them locally now, just not in a useful timeframe.
another_one_112 · 7h ago
Crazy to think that you can have a mostly-competent oracle even when disconnected from the grid.
accrual · 8h ago
Very impressive model! The SVG pelican designed by GLM 4.5 in Simon's adjacent article is the most accurate I've seen yet.
4b11b4 · 8h ago
Quick, someone knit a quilt with all the different SVG pelicans
neutronicus · 10h ago
If I understand correctly, the author is managing to run this model on a laptop with 64GB of RAM?

So a home workstation with 64GB+ of RAM could get similar results?

simonw · 10h ago
Only if that RAM is available to a GPU, or you're willing to tolerate extremely slow responses.

The neat thing about Apple Silicon is the system RAM is available to the GPU. On most other systems you would need ~48GB of VRAM.

xrd · 9h ago
Aren't there non-Macos laptops which also support sharing the VRAM and regular RAM, i.e. iGPU?

https://www.reddit.com/r/GamingLaptops/comments/1akj5aw/what...

I personally want to run linux and feel like I'll get a better price/GB offering that way. But, it is confusing to know how local models will actually work on those and the drawbacks of iGPU.

mft_ · 6h ago
iGPUs are typically weak, and/or aren't capable of running the LLM so the CPU is used instead. You can run things this way, but it's not fast, and it gets slower as the models go up in size.

If you want things to run quickly, then aside from Macs, there's the 2025 ASUS Flow z13 which (afaik) is the only laptop with AMD's new Ryzen Max+ 395 processor. This is powerful and has up to 128Gb of RAM that can be shared with the GPU, but they're very rare (and Mac-expensive) at the moment.

The other variable for running LLMs quickly is memory bandwidth; the Max+ 395 has 256Gb/s, which is similar to the M4 Pro; the M4 Max chips are considerably higher. Apple fell on their feet on this one.

NitpickLawyer · 10h ago
> So a home workstation with 64GB+ of RAM could get similar results?

Similar in quality, but CPU generation will be slower than what macs can do.

What you can do with MoEs (GLMs and Qwens) is to run some experts (the shared ones usually) on a GPU (even a 12GB/16GB will do) and the rest from RAM on CPU. That will speed things up considerably (especially prompt processing). If you're interested in this, look up llama.cpp and especially ik_llama, which is a fork dedicated to this kind of selective offloading of experts.

0x457 · 5h ago
You can run, it will just run on CPU and will be pretty slow. Macs, like everyone in this thread said, use unified memory, so it's 64GB between CPU and GPU, while for you its just 64 for CPU.
simlevesque · 10h ago
Not so sure. The MBP uses hybrid memory, the ram is shared with the cpu and gpu.

Your 64gb workstation doesn't share the ram with your gpu.

lynndotpy · 10h ago
The laptop has "unified RAM", so that's like 64GB of VRAM.
msikora · 6h ago
With 48GB MAcBook Pro M3 I'm probably out of luck, right?
simonw · 6h ago
For this particular model, yes.

This new one from Qwen should fit though - it looks like that only needs ~30GB of RAM: https://huggingface.co/lmstudio-community/Qwen3-30B-A3B-Inst...

omneity · 5h ago
It takes ~17-20GB on Q4 depending on context length & settings (running it as we speak)

~30GB in Q8 sure, but it's a minimal gain for double the VRAM usage.

__mharrison__ · 9h ago
Time to get a new laptop. My MBP only has 16 gigs.

Looking forward to trying this with Aider.

joshstrange · 10h ago
My next MBP is going to need the next size up SSD (RIP bank account) so it can hold all the models I want to play with locally and my data. Thankfully I already have been maxing out the RAM so that isn't something new I also have to do.
asadm · 8h ago
How good is this model with tool calling.
bradly · 10h ago
I appreciate you sharing both the chat log and the full source code. I would be interested to see a followup post on how adding moderately-sized features like High Score go.

Also, IANAL but Space Invaders is owned IP. I have no idea the legality of a blog post describing steps to create and releasing an existing game, but I've seen headlines on HN of engs in trouble for things I would not expect to be problematic. Maybe Space Invaders is in q-tip/band-aid territory at this point?, but if this was Zelda instead of Space Invaders, I could see things being more dicey.

sowbug · 9h ago
It doesn't infringe any kind of intellectual property.

This isn't copyright infringement; it isn't based on the original assembly code or artwork. A game concept can't be copyrighted. Even if one of SI's game mechanics were patented, it would have long expired. Trade secret doesn't apply in this situation.

That leaves trademark. No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.

9rx · 7h ago
> No reasonable person would be confused whether Simon is trying to pass this creation off as a genuine Space Invaders product.

There may be no reasonable confusion, but trademark holders also have to protect against dilution of their brand, if they want to retain their trademark. With use like this, people might come to think of Space Invaders as a generic term for all games of this type, not the brand of a specific game.

(there is a strong case to be made that they already do, granted)

Joker_vD · 10h ago
> Space Invaders is owned IP

So is Tetris. And I believe that Snake is also an owned IP although I could be wrong on this one.

lifestyleguru · 9h ago
> my 2.5 year old laptop (a 64GB MacBook Pro M2) i

My MacBook has 16GB of RAM and it is from a period when everyone was fiercely insisting that 8GB base model is all I'll ever need.

tracker1 · 9h ago
I'm kind of with you... while I've run 128gb on my desktop, and currently at 96gb with dr5 what it is, It's far less common for typical laptops. I'm a bit curious how the Ryzen 395+ with 128gb will handle some of these models. The 200gb options feel completely out of reach.
polynomial · 8h ago
At first I read this as "My 2.5 year old can write Space Invaders in JavaScript now"
dcchambers · 8h ago
Amazing. There really is no secret sauce that the frontier models have.
sneak · 9h ago
What is the SOTA for benchmarking all of the models you can run on your local machine vs a test suite?

Surely this must exist, no? I want to generate a local leaderboard and perhaps write new test cases.

anthk · 10h ago
Writting a Z80 emulator with the original Space Invaders ROM will make you more fullfilled.

Either with SDL2+C, or even TCL/Tk, or Pythn with TKInter.

chickenzzzzu · 10h ago
"2.5 year old laptop" is potentially the most useless way of describing a 64GB M2, as it could be confused with virtually any other configuration of laptop.
simonw · 10h ago
The thing I find most notable here is that this is the same laptop I've used to run every open weights model since the original LLaMA.

The models have got so much better without me needing to upgrade my hardware.

chickenzzzzu · 10h ago
That's great! Why can't we say that instead?

No need to overly quantize our headlines.

"64GB M2 makes Space Invaders-- can be bought for under $xxxx"

OJFord · 10h ago
I think the point is just that it doesn't require absolute cutting edge nor server hardware.
jphoward · 10h ago
No but 64 GB of unified memory provides almost as much GPU RAM capacity as two RTX 5090s (only less due to the unified nature) - top of the range GPUs - so it's a truly exceptional laptop in this regard.
turnsout · 10h ago
Except that it is not exceptional at all; it's an older-generation MacBook Pro with 64GB of RAM. There's nothing particularly unusual about it.
jphoward · 9h ago
64 GB of RAM which is addressable by a GPU is exceptional for a laptop - this is not just system RAM.
chickenzzzzu · 8h ago
To emphasize this point further, at least with my efforts, it is not even possible to buy a 64GB M4 Pro right now. 32GB, 64GB, and 128GB are all sold out.

We can say that 64GB addressable by a GPU is not exceptional when compared to 128GB and it still costs less than a month's pay for a FAANG engineer, but the fact that they aren't actually purchasable right now shows that it's not as easy as driving to Best Buy and grabbing one off the shelf.

turnsout · 5h ago
They're not sold out—Apple's configurator (and chip naming) is just confusing. The MacBook Pro with M4 Pro is only available in 24 or 48 GB configurations. To get 64 or 128 GB, you need to upgrade to the M4 Max.

If you're looking for the cheapest way into 64 of unified memory, the Mac mini is available with an M4 Pro and 64GB at $1999.

So, truly, not "exceptional" unless you consider the price to be exorbitant (it's not, as evidenced by the long useful life of an M-series Mac).

chickenzzzzu · 2h ago
thank you for providing that extra info! i agree that $2000-4000 is not an absolutely earth shattering price, but i still wonder what the benefit one receives is when they say "2.5 year old laptop" instead of "64GB M2 laptop"
turnsout · 5h ago
I understand, but that is not exceptional for a Mac laptop. You could say all Apple Silicon Macs are exceptional, and I guess I agree in the context of the broader PC community. But I would not point at an individual MacBook Pro with 64 GB of RAM and say "whoa, that's exceptional." It's literally just a standard option when you buy the computer. It does bump the price pretty high, but the point of the MBP is to cater to higher-end workflows.
tantalor · 10h ago
It was also something he already had lying around. Did not need to buy something new to get new functionality.
wslh · 7h ago
Here's a sci-fi twist: suppose Space Invaders and similar early games were seeded by a future intelligence. (•_•)⌐■-■
vFunct · 10h ago
please please apple give us a M5 MacBook Pro laptop with 2TB of unified memory please please
bgwalter · 8h ago
The GML-4.5 model utterly fails at creating ASCII art or factorizing numbers. It can "write" Space Invaders because there are literally thousands of open source projects out there.

This is another example of LLMs being dumb copiers that do understand human prompts.

But there is one positive side to this: If this photocopying business can be run locally, the stocks of OpenAI etc. should got to zero.

simonw · 8h ago
Why would you use an LLM to factorize numbers?
bgwalter · 8h ago
Because we are told that they can solve IMO problems. Yet they fail at basic math problems, not only at factorization but also when probing them with relatively basic symbolic math that would not require the invocation of an external program.

Also, you know it they fail they could say so instead of giving a hallucinated answer. First the models lie and say that a 20 digit number takes vast amounts of computing. Then, if pointed to a factorization program they pretend to execute it and lie about the output.

There is no intelligence or flexibility apart from stealing other people's open source code.

simonw · 8h ago
That's why the IMO results were so notable: that was one of those moments where new models were demonstrated doing something that they had previously been unable to do.
ducktective · 8h ago
I can't fathom why more people aren't talking about the IMO story. Apparently the model they used is not just an LLM but some RL are involved too. If a model wins gold at IMO, is it still merely a "statistical parrot"?
sejje · 7h ago
Stochastic parrot is the term.

I don't think it's ever been accurate.

bgwalter · 4h ago
The results were private and the methodology was not revealed. Even Tao, who was bullish on "AI", is starting to question the process.
simonw · 3h ago
The same thing has also been achieved by a Google DeepMind team and at least one group of independent researchers using publicly available models and careful promoting tricks.
deadbabe · 9h ago
You can overtrain a neural network to write a space invaders clone. The final weights might take up less disk space than the output code.
amelius · 10h ago
Wake me up when I can apt-get install the llm.
Kurtz79 · 8h ago
You can install ollama with a script fetched with curl and run a llm model with a grand total of two bash commands (including curl).
croes · 10h ago
I bet the training data included enough space invader cloned in JS
jplrssn · 10h ago
I also wouldn't be surprised if labs were starting to mix in a few pelican SVGs into their training data.
diggan · 10h ago
Even "accidentally" it makes sense that "SVGs of pelicans riding bikes" are now included into datasets used for training as it has spread as a wildfire on the internet, making it less useful as a simple benchmark.

This is why I keep all my benchmarks private and don't share anything about them publicly, as soon as you write about them anywhere publicly they'll stop being useful in some months.

toyg · 10h ago
> This is why I keep all my benchmarks private

This is also why, if I were an artist or anyone commercially relying on creative output of any kind, I wouldn't be posting anything on the internet anymore, ever. The minute you make anything public, the engines will clone it to death and turn it into a commodity.

debugnik · 9h ago
That makes it so much harder to show art to people and market yourself though.

I considered experimenting with web DRM for art sites/portfolios, on the assumption that scrappers won't bother with the analog loophole (and dedicated art-style cloners would hopefully be disappointed by the quality), but gave up because of limited compatible devices for the strongest DRM levels, and HDCP being broken on those levels anyway. If the DRM technique caught on it would take attackers, at most, a few bucks and hours once to bypass it, and I don't think users would truly understand that upfront.

__mharrison__ · 9h ago
Somewhat defeats the purpose of being an artist, doesn't it?
toyg · 9h ago
Defeating the purpose of creating almost anything, really.

AI is definitely breaking the whole "labor for money" architecture of our world.

zhengyi13 · 8h ago
Eeeehhhh.

Maybe the thing to do is provide public, physical exhibits of your art in search of patronage.

simonw · 10h ago
I'll believe they are doing that when one of the models draws me an SVG that actually looks like a pelican.
__mharrison__ · 9h ago
Someone needs to craft a beautifully bike donned by a pelican, throw in some seo, and see how long it takes a model to replicate it.

Simon probably wouldn't be happy about killing his multi-year evaluation metric though...

simonw · 9h ago
I would be delighted.

My pelican on a bicycle benchmark is a long con. The goal is to finally get a good SVG of a pelican riding a bicycle, and if I can trick AI labs into investing significant effort in cheating on my benchmark then fine, that gets me my pelican!

quantumHazer · 10h ago
SVG benchmarking is a thing since GPT-4, so probably all major labs are overfitting on some dataset ov svg images for sure
shermantanktop · 10h ago
How about an SVG of 9.11 pelicans riding bicycles and counting the three Rs in “strawberry”?
gchamonlive · 10h ago
Which would make this disappointing if it was only good at cloning space invaders. If it can reproduce all the clone it has ever seen it would still be an impressive feat.

I just think we should stop to appreciate exactly how awesome language models are. It's compressing and correctly reproducing a lot of data with meaningful context between each token and the rest of the context window. It's still amazing, specially with smaller models like this, because even if it's reproducing a clone, you can still ask questions about it and it should perform reasonably well explaining you what it does and how you can take it over to further develop that clone.

croes · 8h ago
But that would still be copy and paste with extra steps.

Like all these vibe coded to do apps, one of the most used starting problems of programming courses.

It’s great that an AI can do that but it could stall progress if we get limited to existing tools and programs.

jus3sixty · 10h ago
I recently let go of my 2.5 year old vacuum. It was just collecting dust.
falcor84 · 9h ago
Thinking about it, the measure of whether a vacuum is being sufficiently used is probably that the circulation of dust within it over the last year is greater than the circulation of dust on its external boundary over that time period.
pamelafox · 10h ago
Alas, my 3 year old Mac has only 16 GB RAM, and can barely run a browser without running out of memory. It's a work-issued Mac, and we only get upgrades every 4/5 years. I must be content with 8B parameters models from Ollama (some of which are quite good, like llama3.1:8b).
dreamer7 · 10h ago
I am able to run Gemma 3 12B on my M1 MBP 16GB. It is pretty good at logic and reasoning!
__mharrison__ · 9h ago
Odd. My MBP has 16 GB and I routinely have 5 browsers windows open. Most of them have 5-20 tabs. I'm also routinely running vi vscode and editing videos with davinci resolve without issue.

My only memory issue that I can remember is an OBS memory leak, otherwise these MBPs incredible hardware. I wish any other company could actually deliver a comparable machine.

pamelafox · 8h ago
I was exaggerating slightly - I think it's some combo of the apps I use: Edge, Teams, Discord, VS Code, Docker. When I get the RAM popup once a week, I typically have to close a few of those, whichever is using the most memory according to Activity Monitor. I've also got very little hard drive space on my machine, about 15 GB free, so that makes it harder for me to download the larger models. I keep trying to clear space, even using CleanMyMac, but I somehow keep filling it up.
e1gen-v · 10h ago
Just download more ram!
GaggiX · 10h ago
Reasoning models like qwen3 are even better, and they have more options, for example you can choose the 14B model (at the usual 4KM quantization) instead of the 8B model.
pamelafox · 10h ago
Are they quantized more effectively than the non-reasoning models for some reason?
GaggiX · 10h ago
There is no difference, you can choose a 6 bits quantization if you prefer, at that point it's essentially lossless.
larodi · 10h ago
Is probably more correct to say - my 2.5 year laptop can RETELL space invaders. Pretty sure it cannot write a game it has never seen, so you can even say - my old laptop can now do this fancy extraction of data from a smart probabilistic blob, where the original things are retold in new colours and forms :)
simonw · 10h ago
I know these models can build games and apps they've never seen before because I've already observed them doing exactly that time and time again.

If you haven't seen that yourself yet I suggest firing up the free, no registration required GLM-4.5 Air on https://chat.z.ai/ and seeing if you can prove yourself wrong.

No comments yet

uludag · 10h ago
It's unfortunate that the ideas of things to test first are exactly the things more likely to be contained in training data. Hence why the pelican on a bicycle was such a good test, until it became viral.
oceanplexian · 10h ago
So you're saying it works exactly the same way as humans, who copied Space Invaders from Breakout which came out in 1976.
MattRix · 10h ago
No, that would be incorrect, nobody uses “retell” like that.

The impressive thing about these models is their ability to write working code, not their ability to come up with unique ideas. These LLMs actually can come up with unique ideas as well, though I think it’s more exciting that they can help people execute human ideas instead.