Gemini Diffusion

890 mdp2021 244 5/22/2025, 1:13:50 AM simonwillison.net ↗

Comments (244)

cztomsik · 36d ago

I have no idea how it works actually (in google) but I wouldn't be surprised if it was just post-training because recently RWKV people did something similar: They replaced the whole attention mechanism with WKV (forward-only linear attention), and created such franken-stein just by post-training.

The big wow moment about that is that it sort of implies that most of the useful knowledge is in the FFN, and attention itself is not that unique/important.

https://substack.recursal.ai/p/qwerky-72b-and-32b-training-l...

BTW: It could be also interesting to try use already trained attention and see how long the FFN itself takes in the gpt2 speedtraining (it would be against the rules but still very interesting IMHO - definitely something I'd like to read paper about) https://github.com/KellerJordan/modded-nanogpt

Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained, and if both of these statements are true maybe we could just train everything much faster just by sharing fixed embeddings and attentions.

spwa4 · 36d ago

Ever notice that attention is (with the highest respect to the original researchers) "just" inputting the entire past of the network into a reverse-MoE neural network? (meaning the expert is selecting parts of the input instead of parts of the neural network to execute)

In a way everyone knew this would work. Nobody did it because it's so inefficient even R and Python users thought that it would be ridiculously slow (or simply couldn't execute it enough to train to a reasonable extent)

scotty79 · 36d ago

Attention is just completely arbitrary way to split the network so the learning can be parallelized.

What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

grumbelbart2 · 36d ago

> What contributed more towards success in my opinion are "shortcut connections" through layers which enable more influence on early layers during learning.

For those who don't know, that is the idea behind ResNet (He et al., Deep Residual Learning for Image Recognition, https://arxiv.org/abs/1512.03385), one of the most influential papers in deep learning of all time.

Residual connections make it possible to train networks that are arbitrarily deep. Before ResNet, networks that were too deep were essentially not trainable due to vanishing or exploding gradients.

scotty79 · 33d ago

It's really nice to have your personal intuitions in a field you barely know confirmed by research.

cubefox · 36d ago

> Also, I read yesterday that at some point, the embeddings across all of the models are (very) comparable/similar, and a simple converter can be trained

That was from here: https://news.ycombinator.com/item?id=44054425

jonahx · 35d ago

So is the famous "Attention is all you need" wrong?

slickytail · 36d ago

The relative unimportance of the exact SDPA attention in use in modern transformers is already known: https://arxiv.org/abs/2111.11418

The FFN, normalization, and residual connections are absolutely irreplaceable -- but attention can be replaced with almost any other layer that shares information between tokens, such as pooling, convolution, random mixing, etc.

cztomsik · 35d ago

hm, residual is what I would not expect, can you elaborate why?

simsla · 35d ago

Avoids vanishing gradients in deeper networks.

Also, most blocks with a residual approximate the identity function when initialised, so tend to be well behaved.

airstrike · 36d ago

That's...ridiculously fast.

I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.

Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.

Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.

westoncb · 36d ago

The trick to this is you've got to talk to them and share this information in the same way. I can give an example. These days my main workflow is as follows: if I have some big feature/refactor/whatever I'm going to work on I'll just start talking to o3 about it essentially as if it was a coworker and (somewhat painstakingly) paste in relevant source files it needs for context. We'll have a high-level discussion about what it is we're trying to build and how it relates to the existing code until I get the sense o3 has a clear and nuanced understanding (these discussions tend to sharpen my own understanding as well). Then, I'll ask o3 to generate an implementation plan that describes what needs to happen across the codebase in order for whatever it is to be realized. I'll then take that and hand it off to Codex, which might spend 10min executing shell commands to read source, edit files, test, etc. and then I've got a PR ready, which sometimes takes a bit more manual editing, and other times is perfectly ready to merge.

What you're saying is true RE them needing rich context, too—but this isn't a fundamental limitation, it's just an aspect of what it takes to work with them effectively. There's definitely a learning curve but once you've got it down it's not only very powerful but, for me anyway, a more enjoyable headspace to occupy than lots of lower level manual editing.

Onawa · 36d ago

I would suggest trying the Continue.dev VSCode plugin for selective context injection. The plugin is Apache 2.0 licensed, and you can hook it up to any LLM API including local.

It has most of the same features as GitHub Copilot, but a few extra features I find essential. It can scrape documentation sites for individual libraries, which means you can do stuff like `@pandas @terminal @codebase Help me fix this error`.

For greenfield projects I will usually start out in a web-based chat interface, but the second I need to go back and forth between IDE and the web I switch over to the Continue.dev plugin.

westoncb · 36d ago

I’m pretty happy with Zed for development. I do plan on developing custom tooling around my style of workflow, but it’s not going to be part of an IDE.

dimitri-vs · 36d ago

Interesting approach, I'm definitely going to steal your wording for "generate an implementation plan that...".

I do something similar but entirely within Cursor:

1. create a `docs/feature_name_spec.md`, use voice-to-text to brain dump what I am trying to do 2. open up a the AI chat panel in "Ask" mode while referencing that spec file, ask (paste) a boilerplate snippet like: "1) Ask clarifying questions about intent, domain, restrictions, ambiguity or missing details 2) Briefly identify any missing documents, data, or background information that would help you complete the task thoroughly" 3. move that list of questions into the spec doc and answer them there, attach the files it asked for and just rerun the above request (optionally, switching to a different model, like gemini-2.5-pro -> o3, for different perspective) 4. ask it to make an execution plan and at that point i have a fully spec'd out feature and documented business logic, I either use the Edit mode on each step or Agent mode

That's for more complex features touching many files or refactors, but I essentially do a simplified version of that within the same chat by editing my original chat prompt until I'm confident I explained myself well

westoncb · 36d ago

I spend so much time just finding/moving context pieces around these days i bought a physical macro pad and have been thinking about designing some software specifically to make this quicker, basically like rapidly finding/selecting context pieces and loading into buffers and relaying to conversation context. I think it’ll have to be backed by agentic search, voice controlled, and not sure how to best integrate with possible consumers… I dunno if that makes sense. I started building it and realized I need to think on the design a bit more so I’m building more like infrastructure pieces now.

rcarmo · 36d ago

That's very close to my workflow: https://taoofmac.com/space/blog/2025/05/13/2230

blurrybird · 36d ago

I’d love to watch a video of this playing out.

landl0rd · 30d ago

This is absolutely the best way to do it. However it's also infeasible for number-of-queries-based quota like most front-ends have. And of course running through API for models like o3 and 4-opus is basically always way more expensive. Hence the desire for one-shotting stuff.

jacob019 · 36d ago

I find myself using a similar workflow with Aider. I'll use chat mode to plan, adjust context, enable edits, and let it go. I'll give it a broad objective and tell it to ask me questions until the requirements are clear, then a planning summary. Flipping the script is especially helpful when I'm unsure what I actually want.

ckw · 34d ago

I do the same thing, though sometimes I take one extra step to elaborate on the first implementation plan ‘in minute detail such that a weaker model could successfully implement it’, with deep research selected.

ManuelKiessling · 36d ago

"...what is not in a codebase, and there is meaningful signal in that negative space."

Man, I'm writing software for money for decades now, but this fundamental truth never occured to me, at least not consciously and with such clarity.

So, thank you!

spuz · 36d ago

I am not certain that I agree with this. If there are alternative ways of solving a problem that we're not taken then these should be documented in comments. A mantra I try to tell myself and my colleagues is if information exists in your brain and nowhere else then write down it down _somewhere_. If I tried 5 different libraries before settling on one, then I write in comments which libraries I tried but didn't work and why. If I used a particular tool to debug a race condition then I put a link to a wiki page on how to use it in the comments. If we have one particular colleague who is an expert in some area then I write their name in a comment. Basically anything that is going to save future developers' time should be written down.

david-gpu · 36d ago

Agreed. IMO it's always a good idea to document design choices.

The owner can write down the problem, a few solutions that were considered, why they were chosen/rejected, and a more detailed description of the final design. Stakeholders then review and provide feedback, and after some back and forth all eventually sign off the design. That not only serves to align the organization, but to document why things were done that way, so that future hires can get a sense of what is behind the code, and who was involved in case they have more questions.

This was how we did things at some $BigCorps and it paid dividends.

jonahx · 35d ago

What are you disagreeing with?

Even if you do this (and it's good practice!), it is, empirically, not done in the vast majority of codebases.

And even if you succeed with the utmost diligence, a vastly greater number of decisions (those you were not even aware of consciously, or took for granted) will remain undocumented but still be quite real in this "negative space" sense.

airstrike · 35d ago

Exactly. I couldn't have said it better.

airstrike · 36d ago

My pleasure ;-) I borrowed the term from art: https://www.michaelalfano.com/tag/negative-space/?id=400

shahar2k · 36d ago

I'm an artist who works on pre-production fast turnaround animations for films, and yeah that hits the nail on the head, knowing what NOT to do which elements not to focus on is a majority of the power that comes with experience. I'm fast because I know which corners can be cut best and how to illustrate what I need to

woctordho · 36d ago

Then document it. Whenever you choose one algorithm/library/tech stack but not another, write your consideration in the documents.

ManuelKiessling · 36d ago

The funny thing is that I have at least a dozen comments in my current codebase where I explain in detail why certain things are not put in place or are not served via other-solution-that-might-seem-obvious.

stef25 · 36d ago

I understand what negative space is in art. Can you explain how this applies to writing software ?

skydhash · 36d ago

A quick example is a basic 2d game. If you’re not using an engine (just a graphic library) and you have some animations, experience will tell you to not write most of the code with numbers only. More often than not, you will write a quick vector module. Just how you will use local origin for transformations.

But more often than not, the naive code is the result of not doing the above and just writing the feature. It technically does the job, but it’s verbose and difficult to maintain.

So just like in drawing, you need to think holistically about the program. Every line of code should support an abstraction. And that will dictate which code to write and which to not write.

That’s why you often see the concept of patterns in software. The code is not important. The patterns are. The whole structure more so. Code is just what shape these.

lukan · 36d ago

I have written 2D games, but maybe the metapher is just lost on me or I simply disagree to its usefulness here.

Negative space in art achieves a certain effect. Like in the linked sibling comment, the empty space is part of the sculpture.

So the empty space has purpose and meaning.

But if I didn't choose a certain libary .. the empty place of that libary serves no function. It does change my code and might make my dev life easier or harder, but has no meaning in itself for the result.

collingreen · 36d ago

Let me take a crack at it.

I think the negative space metaphor in software can be in the shape of the abstractions and hitting the sweet spot of making the right things easy/railroaded while not over engineering it.

In visual art, negative space is part of the layout and the visual journey. It helps define the relationships between things as much as those things themselves and, used judiciously, is one of the differences between elegance and clutter.

I think "not choosing a library" is important info but isn't the same thing as negative space and is instead more like restrictions, framing, or limitation. You can do a lot with what isn't shown but in this area I think good art and good software diverge in goals - to me good art makes me think or feel or speculate while good software instead makes me understand with as little of those other things as possible.

The caveat here might be not choosing things for very good but not obvious reasons, which should be loudly documented. Things like licensing or other external influences or specific hardware requirements maybe. For example I once banned the creation of a graphQL api in a product that could have benefited from it because we still needed to support the existing api for third parties forever so the suggestion to replace the api was actually secretly the suggestion to maintain two APIs in lockstep.

skydhash · 36d ago

Yes the code is not actually important as two different teams will solve the same problem in different manners. Just like a great painting and a bad one can use the same base materials. What’s important is the purpose and the constraints of any solution. Any decision you take propagates down the timeline and outward in the project. And they preclude other decisions from being taken.

So whatever you do will live a mark. But there are some spaces that should not be filled in. While it may look nice in the moment or taken in isolation. When looking at the whole, it makes it a mess.

skydhash · 36d ago

I’m talking more about architecting code instead of naively writing them. The same point can be made about libraries but the considerations are more subjective.

Most naive approaches to writing software looks like assembly. But instead of opcodes, you have libraries functions. But we move away from assembly and assembly like programming because it’s essentially one shot. Any modification to the program is difficult and/or tedious. So instead of having that one blob of instructions, we introduce gaps so that it becomes more flexible. We have functions, objects, modules… but the actual links between them still needs to be shaped.

A library can have some influence on the shape, but it is minor if you favor the solution over the means. But sometimes you see people really going hard to fill the gaps of the shape, and that’s when you start to shout KISS and YAGNI. Sometimes they want to alter the shape and you bring out SOLID and other principles…

lukan · 36d ago

"I’m talking more about architecting code instead of naively writing them."

Yeah, we are talking about code designing.

And I got my head filled with all the design patterns back then in university, but my first bigger real world projects were somehow horribly overengineered and still unflexible. And I don't think it was just lack of experience.

Nowdays I prefer a very, very simple and clear approach.

No dark empty space I want to design around.

No clever hidden layers, that prevent the introduction of a pragmatic new API.

I guess I get what you probably mean and it ain't that, but to me it has too much of the vibe of the time when I was amazed at myself for coming up with a seemingly super clever (complex) design, that sounded great in theory.

skydhash · 35d ago

Yes simplicity is always important, but it does not equate easiness. The axe of simple to complex is independent of the axe of easy to hard. It may be easy to apply patterns blindly to your codebass and make it complex. Just how it is easy to write naive and simple code that then becomes difficult to work with.

The mark of a good programmer is to balance all of these so that it’s easy to work with the codebase on an ongoing basis. And more often than not it’s similar to the sketching process. At each stage, you get enough feedback to judge the right direction for the next iteration. You do not start with all the details, nor with careless doodling. But one aspect that is often overlooked with artists is how often they practice to get that judgement capability.

lukan · 35d ago

"At each stage, you get enough feedback to judge the right direction for the next iteration."

Depends on the project I would say. What do you do, if all of a sudden the requirements change again? Or the plattform evolved/degraded? Then you compromise - and I can better compromise with simple solution. And I would never claim simple equals easy. Rather the opposite. Like you said, it is easy to make complex things. Also I never applied design patterns for the sake of it(even though it might have sounded like it) KISS was part of the theories as well.. but I did value emphasized cleverness too much as I thought that this is the way it is supposed to be done.

My resume is: simple direct solutions are to be prefered and trying to be clever is not very clever.

I rather have 3 lines of code, than one compressed clever one, no one can understand the first time reading it. And the same goes for the bigger design picture.

airstrike · 35d ago

Re: this whole conversation, you might find this quick video a worthwhile watch https://www.youtube.com/watch?v=wrwxC9taL8w

No comments yet

FieryTransition · 36d ago

There's a reason why less is called less, and not more.

8n4vidtmkvmk · 36d ago

That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.

That said, a negative prompt like we have in stable diffusion would still be very cool.

Incipient · 36d ago

I'm in the camp of 'no good for existing'. I try to get ~1000 line files refactored to use different libraries, design paradigms, etc and it usually outputs garbage - pulling db logic into the UI, grabbing unrelated api/function calls, to entirely just corrupting the output.

I'm sure there is a way to correctly use this tool, so I'm feeling like I'm "just holding it wrong".

fragmede · 36d ago

Which LLM are you using? what LLM tool are you using? What's your tech stack that you're generating code for? Without sharing anything you can't, what prompts are you using?

Incipient · 36d ago

Was more of a general comment - I'm surprised there is significant variation between any of the frontier models?

However, vscode with various python frameworks/libraries; dash, fastapi, pandas, etc. Typically passing the 4-5 relevant files in as context.

Developing via docker so I haven't found a nice way for agents to work.

fragmede · 35d ago

> I'm surprised there is significant variation between any of the frontier models?

This comment of mine is a bit dated, but even the same model can have significant variation if you change the prompt by just a few words.

https://news.ycombinator.com/item?id=42506554

danielbln · 36d ago

I would suggest using an agentic system like Cline, so that the LLM can wander through the codebase by itself and do research and build a "mental model" and then set up an implementation plan. The you iterate in that and hand it off for implementation. This flow works significantly better than what you're describing.

otabdeveloper4 · 36d ago

> LLM can wander through the codebase by itself and do research and build a "mental model"

It can't really do that due to context length limitations.

exe34 · 36d ago

It doesn't need the entire codebase, it just needs the call map, the function signatures, etc. It doesn't have to include everything in a call - but having access to all of it means it can pick what seems relevant.

danielbln · 36d ago

Yes, that's exactly right. The LLM gets a rough overview over the project (as you said, including function signatures and such) and will then decide what to open and use to complete/implement the objective.

otabdeveloper4 · 35d ago

In a real project the call map and function signatures are millions of tokens themselves.

exe34 · 35d ago

For sufficiently large values of real.

otabdeveloper4 · 35d ago

Anything less is not a "project", it's a "file".

exe34 · 35d ago

That's right, there is no true Scotsman!

otabdeveloper4 · 32d ago

Incorrect attempt as fallacy baiting.

If your repo map fits into 1000 tokens then your repo is small enough that you can just concatenate all the files together and feed the result as one prompt to the LLM.

No, current LLM technology does not allow to process actual (i.e. large) repos.

simonw · 32d ago

Where's your cutoff for "large"?

johnisgood · 36d ago

1k LOC is perfectly fine, I did not experience issues with Claude with most (not all) projects around ~1k LOC.

otabdeveloper4 · 35d ago

Actual projects where you'd want some LLM help start with millions of lines of code, not thousands.

With 1k lines of code you don't need an LLM, the entire source code can fit in one intern's head.

johnisgood · 35d ago

The OP mentioned having LLM issues with 1k LOC, so I suppose he would have problems with millions. :D

simonw · 34d ago

Have you tried Claude Code yet?

Even with it's 200,000 token limit it's still really impressive at diving through large codebases using find and grep.

lukan · 36d ago

I guess people are talking about different kinds of projects here in terms of project size.

jacob019 · 36d ago

I've refactored some files over 6000 loc. It was necessary to do it iteratively with smaller patches. "Do not attempt to modify more than one function per iteration" It would just gloss over stuff. I would tell it repeatedly: I noticed you missed something, can you find it? I kept doing that until it couldn't find anything. Then I had to manually review and ask for more edits. Also lots of style guidelines and scope limit instructions. In the end it worked fine and saved me hours of really boring work.

landl0rd · 30d ago

I'll back this up. I feel constantly gaslit by people who claim they get good output.

I was hacking on a new project and wanted to see if LLMs could write some of it. So I picked an LLM friendly language (python). I picked an LLM friendly DB setup (sqlalchemy and postgres). I used typing everywhere. I pre-made the DB tables and pydantic schema. I used an LLM-friendly framework (fastapi). I wrote a few example repositories and routes.

I then told it to implement a really simple repository and routes (users stuff) from a design doc that gave strict requirements. I got back a steaming pile of shit. It was utterly broken. It ignored my requirements. It fucked with my DB tables. It fucked with (and broke) my pydantic. It mixed db access into routes which is against the repository pattern. Etc.

I tried several of the best models from claude, oai, xai, and google. I tried giving it different prompts. I tried pruning unnecessary context. I tried their web interfaces and I tried cursor and windsurf and cline and aider. This was a pretty basic task I expect an intern could handle. It couldn't.

Every LLM enthusiast I've since talked to just gives me the run-around on tooling and prompting and whatever. "Well maybe if you used this eighteenth IDE/extension." "Well maybe if you used this other prompt hack." "Well maybe if you'd used a different design pattern."

The fuck?? Can vendors not produce a coherent set of usage guidelines? If this is so why isn't there a set of known best practices? Why can't I ever replicate this? Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

simonw · 30d ago

> Why don't people publish public logs of their interactions to prove it can do this beyond a "make a bouncing ball web game" or basic to-do list app?

It's possible I've published more of those than anyone else. I share links to Gists with transcripts of how I use the models all the time.

You can browse a lot of my collection here: https://simonwillison.net/search/?q=Gist&sort=date

Look for links that's at things like "transcript".

manmal · 36d ago

They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.

eMPee584 · 36d ago

This. Git ( / tig!) blame and log -p --stat -S SEARCHSTR are extremely powerful for understanding the what why and when about code..

Cthulhu_ · 36d ago

I find most meetings I'm in nowadays are mostly noise; there's no clear "signal" that "this is the outcome", which I think is what an AI should be able to filter out.

Of course, it'd be even better if people communicated more clearly and succinctly.

manmal · 36d ago

Maybe time to find an employer with a better culture? I rarely have meetings that I would be comfortable skipping.

internet_points · 36d ago

That also leads to more noise and opportunities to get lost in the woods.

ttoinou · 36d ago

Do we already have tools to do thar automagically ?

manmal · 36d ago

Yes there are MCPs for git and Jira. I‘m not sure about the utility with the current context sizes.

aposm · 36d ago

A human working on an existing codebase does not have any special signal about what is _not_ in a codebase. Instead, a (good) human engineer can look at how a problem is handled and consider why it might have been done that way vs other options, then make an educated decision about whether that alternative would be an improvement. To me this seems like yet another piece of evidence that these models are not doing any "reasoning" or problem-solving.

ec109685 · 36d ago

If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.

Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.

airstrike · 36d ago

I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.

lucasacosta_ · 36d ago

Definitely. For now the "frontier-level" papers (working with repository-level coding maintenance) need to necessarily depend on previously (and statically) generated Code Knowledge Graphs or Snippet-Retrieval systems, which makes the scalable and fast aspects complicated, as any change in the code would represent a change in the graph, hence requiring a rebuild. But given the context limit, you need to rely on Graph queries to give relevant parts and then at the end of the day it just reads snippets instead of the full code, which makes the consistent an issue, as it can't learn from the entirety of the code.

Papers I'm referring to (just some as example, as there're more):

- CodexGraph [https://arxiv.org/abs/2408.03910] - Graph

- Agentless [https://arxiv.org/abs/2407.01489] - Snippet-Retrieval

airstrike · 36d ago

Thanks for these links. I really appreciate it.

Flemlo · 36d ago

But plenty of companies already do this for a decade and more

Having old shitty code base and not retaining the people who built it.

I have done that too despite the creator sitting only 100km away. Code was shit as hell tons of c&p different logic in different endpoints for logging in.

Finally it's worth it to have adrs and similar things.

Flemlo · 36d ago

A LLM could easily use its own knowledge to create a list of things to check inside the code base and generate a fact sheet and use best practices and similar knowledge to extend on it.

Just because one query might not be able to do so doesn't mean there are no ways around it

mejutoco · 36d ago

> Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space

I wonder if git history would be enough to cover this. It has alternatives tried and code that was removed at the very least.

scotty79 · 36d ago

> they will continue to be handicapped by that lack of institutional knowledge, so to speak

Until we give them access to all Jira tickets instead of just one so they know what's missing.

campers · 36d ago

I've been thinking about adding in an agent to our Codex/Jules like platform which goes through the git history for the main files being changed, extracts the Jira ticket ID's, look through them for additional context, along with the analyzing the changes to other files in commits.

nopinsight · 36d ago

...which is why top LLM providers' web apps like ChatGPT, Claude.ai, Gemini try to nudge you to connect with Google Drive, and where appropriate, GitHub Repos. They also allow the user/dev to provide feedback to revise the results.

All the training and interaction data will help make them formidable.

shreezus · 36d ago

Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.

Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.

theptip · 36d ago

Can someone help with the intuition here? My understanding from vision transformers is you start with noise and use a series of hierarchical models to iteratively refine the noise into the target. Each layer is trained to produce images at an increasing resolution, and by layering them you skip the problem of sparse gradients at the beginning to get from “noise” to “noise that kinda looks like a face”.

How does this work for coding? It would require you to be able to hierarchically structure the emitted artifacts. Maybe this sort of works; low granularity concepts like “use Django for this problem”, then “I need these endpoints” then “emit the code”. But AIUI diffusion doesn’t have a mechanism for backtracking, so you can’t feed back signals from the detailed layers to the “higher abstraction” layers at the top of your need to change an aspect of the design in response to a low-level problem.

Whereas transformers, you go through the whole model for each token and therefore can deploy all your smarts and logic at each step of the problem (if needed), including backtracking on key design decisions.

I’m sure my mental model has some big gaps, would appreciate any insights.

nvtop · 36d ago

Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling. Recall how BERT is trained:

1. Take a full sentence ("the cat sat on the mat") 2. Replace 15% of tokens with a [MASK] token ("the cat [MASK] on [MASK] mat") 3. Make the Transformer predict tokens at masked positions. It does it in parallel, via a single inference step.

Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final"). Next, you run another iteration of inference, this time input having 90% of masks and 10% of "final" tokens. Again, you mark 10% of new tokens as final. Continue, and in 10 steps you'll have generated a whole sequence. This is a core idea behind diffusion language models.

Of course, there are some optimizations in the real world. If you need to generate a really long text (over 200 tokens), you'd better split it in chunks and fully generate the first chunk in parallel before moving to the next one. This semi-autoregressive generation is what Block Diffusion does.

You can be smart about how exactly you pick tokens you consider generated and what % exactly. At earlier stages, when it's mostly noise, you can take more, and on final stages you can do more iterations and take fewer tokens.

All in all, diffusion LMs are still iterative, but the number of steps is much lower than in autoregressive models. A nice thing is that you can choose how many steps are you going to make, trading quality for speed.

In the extreme, you can even generate just one leftmost masked token with a diffusion LM, effectively turning it into a traditional causal language model.

yahoozoo · 36d ago

Great explanation. I think I have seen where text diffusion models can “edit” as it’s running inference. Or in other words, a “final” token isn’t necessarily “final” and could change but at some later iteration the model decides it truly is. How does that work?

nvtop · 36d ago

Correct, diffusion LMs can edit their intermediate predictions, so "final" tokens aren't necessarily final. This is an exciting property because it allows models to correct errors in what's generated so far -- something that GPT-like models can't.

This editing is based on the Transformer's encoder property to predict token probabilities for __every__ token in a sequence, not just for [MASK]s. So when you input a sentence of three tokens `[MASK] cat barks`, Transformer will generate a probability distribution over the vocabulary for each of the three tokens, for free.

Now you can come up with many ways of how to decide whether you want to edit token or keep it as is. In the simplest case, take a new token if its probability higher than the original by some margin. In our example, say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Actual heuristics are slightly more complicated, but the base idea is this.

P.S. In order to teach LM not to just copy input unmasked tokens but to try to find a better replacement, your training objective should include replacing some % of input tokens with some other random token. Now you have part of the input masked, and part of the input corrupted, so the model can't blindly assume that all input tokens are here to stay.

paulsmith · 36d ago

> say model returns the probability of the token "cat" on the second position as p_2("cat") = 0.3, while p_2("dog") = 0.6. We may want to replace "cat" with dog, and use it in the subsequent iterations.

Might one tradeoff of speed/quality be a tree "search" for better outcomes by branching on logit choices? If a diffusion model is so much faster overall than AR, then I might not mind that I hunt or backtrack for the best probabilities overall.

skydhash · 36d ago

But what about the dependency graph between symbols in the program. Because all those symbols have high constraints around them which is the program design.

The issue comes in image diffusion as well. When you ask it for a portrait and some details are wrong. That’s because the face has constraints (which you learn about as an artist). Patterns and probability won’t help you.

angusturner · 36d ago

You assume that for small steps (I.e taking some noisy code and slightly denoising) you can make an independence assumption. (All tokens conditionally independent, given the current state).

Once you chain many steps you get a very flexible distribution that can model all the interdependencies.

A stats person could probably provide more nuance, although two interesting connection I’ve seen: There is some sense in which diffusion generalises autoregression, because you don’t have to pick an ordering when you factor the dependency graph.

(Or put otherwise, for some definitions of diffusion you can show autoregression to be a special case).

skydhash · 36d ago

There’s a reason we have formal verification as the highest guarantee for software. To ensure that we have a complete assurance of what the program can and can not do, the semantic of each of its components needs to be known. Recursively.

A minor change in one token can change the meaning of the whole software. Programming is just trying to enforce semantics on instructions (how well is that done is software engineering’s realm)

An algorithm like merge sort is just semantic constraints. Which is why most books go with their own notations as code does not really matter.

At most, LLMs and diffusion can be regarded as fancy searches. But, what you actually want is semantics and that’s why you can design lots of stuff on paper. But we do it with the code editor because feedbacks are nice and libraries’ documentations (if they exist) lie about their semantics. And we read code because there’s nothing more complete about semantics than that.

oliwary · 36d ago

Fascinating, and great explanation.

What about insert and delete operations however? Isn't there a risk of there being too few tokens to properly finish the code in-between the "final" tokens?

Workaccount2 · 36d ago

Can you have a hybrid model that can do autoregression and diffusion? It doesn't seem like there is something that would fundamentally prevent this. A model with diffusion CoT for rapid "thought" generation, and then autoregression for the answer on the output.

nvtop · 35d ago

You can absolutely do it, and I think it's a nice idea to try.

shawntan · 36d ago

I'm curious how the speed is achieved is this is the technique used. Generally I expected this "masked language model" technique to be far slower since the full vocab projection needs to be computed every iteration.

I always thought the eventual technique would be some form of diffusion in continuous space, then decoding into the discrete tokens.

Also I'm guessing this is a "best guess" of how Gemini Diffusion is done?

victorbjorklund · 36d ago

Thanks. Best explanation of text diffusion.

moralestapia · 36d ago

Whoa man, thanks.

This is a great explanation.

ctxc · 36d ago

Thank you for the explanation!

yorwba · 36d ago

You could downscale text the same way you downscale images, by averaging token embeddings instead of pixel values. But you don't have to. AFAIK vision transformers don't suffer from sparse gradients that need a resolution hierarchy to overcome, downscaling is just a performance optimization, because processing an image at full resolution is expensive.

sroussey · 35d ago

So downscaling will summarize?

pertymcpert · 36d ago

I have the exact same questions as you. I can barely understand how diffusion works for images, for sequential data like text it makes no sense to me.

janalsncm · 36d ago

Let’s suppose we have 10k possible tokens in the vocabulary.

Then text would be an image 10k pixels tall and N pixels wide, where N is the length of the text.

For each column, exactly 1 pixel is white (corresponding to the word which is there) and the rest are black.

Then the diffusion process is the same. Repeatedly denoising.

moralestapia · 36d ago

No, that intuition is incorrect.

Denoising models work because a lot of regions turn out to be smooth, you cannot do that "in a discrete way" if that makes sense.

janalsncm · 35d ago

Feel free to give a better explanation. I am not an expert. Clearly denoising models do work on text though.

moralestapia · 35d ago

This one's closer to the thing.

https://news.ycombinator.com/item?id=44059646

lostmsu · 36d ago

They may be smooth in embedding space

bredren · 36d ago

> however it’s been overshadowed by Veo 3 etc.

Because it’s simple to understand the power and difference in capability of Veo 3.

Understanding important steps forward in text completion requires understanding the value of what we have already and potential implications. Many people are not yet convinced LLMs are valuable for coding at all.

NitpickLawyer · 36d ago

> Diffusion models for code generation are a big deal.

This is my intuition as well, as there are a lot of low-hanging fruits that a model like this could tackle in coding:

- you should be able to have a workflow where you constrain the generation w/ a function definition, and its output, and "generate" the tokens in between. Kind of like constrained generation but with the model being able to attend to tokens both ways.

- you should also be able to use a 2 step workflow like first writing a high level description of the function layout (think "write the chapters for an article on x" from LLMs) and then ping-pong between the actual implementations ("and now write chapter x"), using larger and larger context, using proxies like linters, code compilation, AST derived info, etc. for signals of "completion". Lots of things to be tried here indeed.

janalsncm · 36d ago

That’s kind of hard though, right? If we have a rule that only B can follow A, and token at position 5 changes to an A you will have a cascade of constraints to follow.

bn-l · 36d ago

Like in-painting except code?

impossiblefork · 36d ago

I am not sure.

In principle one would imagine that models of this type would have an advantage-- you can use information from both the left and right, etc. and in practice I've found LLaDA to be impressive considering its size and my assumption that they have had small training resources, but they are behind in perplexity, and I think this is unavoidable. They also become rather fixed early, so I don't believe fully in these hopes to be able to really correct text deeply (although they will of course be able to correct their partially completed texts to some degree, especially when it's just a word or two that are wrong, but I believe that the words that are wrong basically need to get masked simultaneously, so 1/masking_probability^2, and 1/masking_probability^3 for three and so on).

Despite this I've been happy with the practical results I've seen during my experimentation.

spiderfarmer · 36d ago

Not really only because I saw it demoed before: https://www.inceptionlabs.ai

TeMPOraL · 36d ago

Right. It's not novel, but it's great to see this getting fully mainstream.

heliophobicdude · 36d ago

I think the lede is being buried. This is a great and fast InstructGPT. This is absolutely going to be used in spell checks, codemods, and code editors.

Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.

I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.

KingMob · 36d ago

Spell check? Isn't that a well-solved problem at this point?

efitz · 36d ago

No. Spell check frequently still gets things wrong if the word is spelled correctly and the sentence is grammatically correct but the wrong word was used.

wenc · 36d ago

Can you give me an example? Spell check only checks if a word is in dictionary. It doesn’t check grammar or context.

thinkingemote · 36d ago

"Bob went to Venice to pick up the doge."

Where doge is both the name of a title (like duke) but it is misspelt "dog". The use of "Venice" where doge's are could increase a the likelihood of a smarter spell check keeping doge and not correcting to dog. Looking at a wider context might see that Bob is talking about a pupper.

A simpler example would be "spell cheque"

macleginn · 36d ago

A spelling error, using one dictionary definition, is "an error in the conventionally accepted form of spelling a word" --- mistaking one word for another does not fall under this definition. It is true that we now expect spell checkers to do grammatical checking as well, but a pure spell checker can indeed rely on a wordlist for English (this wouldn't work in languages with more developed morphology and/or frequent compounding).

kmacdough · 36d ago

Ok, but this is a technicality. Spell-checkers have slowly evolved into grammar checkers and what people really want is error correction. Whether people call it a spell checker a minor language issue (and the kind of things humans do all the time).

When teaching for your dictionary, ask: "is it obvious what they mean if I'm not being pedantic?"

macleginn · 36d ago

We expect different outputs in these two cases, though. A wrong word choice is usually accompanied by a hint that another word may have been intended, while a wrong spelling can be unambiguously marked as a mistake. These two behaviours can be turned on and off independently, and they need two different labels.

ealexhudson · 36d ago

Agreed. "Dessert" vs "desert" - mistaking these two is often not a grammatical error (they're both nouns), but is a spelling error (they have quite different meanings, and the person who wrote the word simply spelled it wrongly).

macleginn · 36d ago

I agree, but this is definitely the kind of spelling error (along with complementary/complimentary, discrete/discreet, etc.) that we normally don't expect our spellcheckers to catch.

tacitusarc · 34d ago

I don’t think I agree with your interpretation of the definition.

If I spell the word “pale” as “pal”, that is not an acceptable spelling for the word “pale”, even if it is coincidentally the acceptable spelling for an entirely different word.

If I asked a human editor to spellcheck the sentence: “His mouth dropped and he turned pal.”, the editor would correctly indicate I had misspelled the word.

Spellcheck hasn’t done this in the past because it can be quite difficult. But that’s a limitation of computer capability, not functionality bounded by the definition of the term “spellcheck”.

internet_points · 36d ago

Finnish would like a word. Take a random noun like kauppa "shop". It has at least 6000 forms: https://flammie.github.io/omorfi/genkau3.html and that's excluding compounds (written as one word in Finnish) like "bookshop" or "shop-manager" etc. etc. And then you have loan words and slang, derivations into other words classes; all of this is impossible to compactly represent in a full-form word list.

Now consider the many other languages of that family ( https://en.wikipedia.org/wiki/Uralic_languages ) – they also have this extreme potential for inflections, but next to no online resources to train language models on or even scrape decent wordlists from.

yencabulator · 36d ago

Finnish is very different from most other languages, and does not have the user base to be well represented in training data, but that webpage is ridiculous and does not reflect the actual language. No one in the history of Finnish has ever spoken most of those forms. Grammar describes language, it does not define it!

rcarmo · 36d ago

"would like a word". I see what you did there...

Timwi · 36d ago

That's exactly what they're saying. If you write “the work required deep incite”, a traditional spell checker won't catch the mistake (but people consider it a spelling error).

SSLy · 36d ago

Cue people mistaking cue for queue

matsemann · 36d ago

Butt wouldn't you liked if a spell cheque could of fixed these command?

NitpickLawyer · 36d ago

Hah! Apple caught "of" and suggested "consider have instead", but left the rest untouched. Great qed for spell checkers.

Chatgpt fixed it though: "But wouldn't you like it if a spell check could have fixed these commands?"

stef25 · 36d ago

It might sound unbelievable but if you write in multiple languages and mix languages in the same message or sentence, often spell check doesn't work properly. Which is only normal.

I regularly send messages in 4 different languages (living in a bilingual city + frequent use of English and lots of Spanish friends). Sometimes even using 3 languages in one sentence.

Whatsapp kind of improved it now in that you can "activate" two languages at the same time. Apart from that I'm not sure there's much else that can be done.

It's not even that much of an edge case. Brussels is the one of the most international cities in the world, street names exist in 2 languages, a lot of slang and expressions get borrowed from other languages.

fragmede · 36d ago

Its knot.

8n4vidtmkvmk · 36d ago

How does grammarly exist then? Must be some secret sauce in there.

dleeftink · 36d ago

Solved how? Language is always evolving

never_inline · 36d ago

Google Docs spellcheck has been really good for few years even before LLMs

macleginn · 36d ago

Not for German, surprisingly.

Cthulhu_ · 36d ago

LLMs aren't very good in non-English anyway, one thing it does is translate the in- and output to and from English because it has more available information in English.

(disclaimer: single data point, a lot of assumptions in the above)

mountainriver · 36d ago

Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR.

This is because it can edit and doesn’t suffer from early token bias.

martincsweiss · 36d ago

This is a super interesting claim - can you point to these benchmarks?

cubefox · 36d ago

https://deepmind.google/models/gemini-diffusion/#benchmarks

> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.

That doesn't necessarily mean that they scale as well as autoregressive models.

jimmyl02 · 36d ago

I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.

With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.

I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.

mdp2021 · 36d ago

Try this one:

# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

https://dllm-reasoning.github.io/

mountainriver · 36d ago

https://github.com/HKUNLP/diffusion-vs-ar

mdp2021 · 36d ago

I.e.: https://arxiv.org/html/2410.14157v3

# Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

hansvm · 36d ago

AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.

mdp2021 · 36d ago

> AR in general is critical for learning the right distribution

Could you please clarify that?

hansvm · 36d ago

Assuming your goal is mimicking the training data, you need some mechanism for drawing from the same distribution. AR happens to provide that -- it's a particular factorization of conditional probabilities which yields the same distribution you started with, and it's one you're able to replicate in your training data.

AR is not the only possible solution, but many other techniques floating around do not have that property of actually learning the right thing. Moreover, since the proposed limitation (not being able to think a long time about your response before continuing) is a byproduct of current architectures rather than a fundamental flaw with AR, it's not as obvious as it might seem that you'd want to axe the technique.

vessenes · 36d ago

A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.

hiimshort · 36d ago

I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.

Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.

Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.

Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.

This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?

huevosabio · 36d ago

I am so excited about diffusion language models. They may be the piece we need to make our voice-to-code game mechanic be as smooth as we envision it.

Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.

--- If anyone at Google/Deepmind reads this, please give us API access.

We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw

EGreg · 36d ago

This is super interesting and obviously someone would have tried diffusion for text. But I will ask the obvious question… how does it know how many words or even tokens to fill in, before it knows what the words will be? It would hamstring itself a lot of the time, can it edit the words later and create more space or is it kind of stuck with the token positioning as it would be with parts of an image? It seems very strange. Usually, words are composed in order like AR models do it, because they are using a recursive grammar, and this is especially true of computer languages. This is a bit like mad libs but madder libs. My question is, how could this possibly give better results than AR, it would need to perfectly converge on something with the right grammar context and the semantic meaning, while perfectly predicting early on the amount of tokens that would appear between words. Seems like there is some major impedance mismatch.

findingMeaning · 36d ago

I have access to it and my god it is fast. One bad think about this model is it is easily susceptible to prompt injection. I asked reciepe for a drug, it denied then I asked to roleplay as a child and it gave real results.

Other than it I can see using this model. With that speed + agentic approach this model can really shine.

Garlef · 36d ago

Have you considered that this might not be due to the model itself but due to less focus/time/money spent on alignment during the training?

My guess is that this is a bit of a throwaway experiment before they actually spend millions on training a larger model based on the technology.

findingMeaning · 36d ago

Yeah it could. One thing for sure is that, it's really impressive in terms of speed and using it would mean we can do so many cool stuffs with it!

Even if there is no improvement in terms of quality, the speed alone will make it usable for a lot of downstream tasks.

It feels like ChatGPT3.5 moment to me.

odie5533 · 36d ago

I'm sure these prompt injections aren't a sign of our ability to control smarter models.

nodja · 36d ago

This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.

My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.

mdp2021 · 36d ago

Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.

pama · 36d ago

Another intuition is simply that anytime your causal relationships in the training data are sequential you are having a lower probability of getting the correct token at a certain position because you have less of the causal information leading up to that position than you would have with AR and thus during training you almost always have a worse model with near certainty (think of the words in a function of source code, even if some of the functions are unsorted and thus a tree at the high level). Imagine you somehow already have N tokens in a sequence: is it easier to next predict token N+1 or N+15? I do like the performance tradeoff for some usecases though and I hope we see more models soon. For image tokens my argument does not hold because causality is not as clear as for text, math, code, or timeseries.

nodja · 36d ago

My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.

manmal · 36d ago

This tradeoff will be great for self hosted LLMs, because they don’t need large scale batching usually, and less great for cloud providers that do.

albertzeyer · 36d ago

> Google's first LLM to use diffusion in place of transformers.

But this is a wrong statement? Google never made this statement? You can have a Transformer diffusion models. Actually Transformers are very standard for all of the discrete diffusion language models, so I would expect Gemini Diffusion also uses Transformers.

Edit Ah sorry, I missed, this was already addressed, also linked in the post: https://news.ycombinator.com/item?id=44057939 Maybe my remaining post is still useful to some.

The difference is, it's an encoder-only Transformer, and not a decoder-only Transformer. I.e. it gets fed in a full sequence (but noisy/corrupted), and it predicts the full correct sequence. And then you can iterate on that. All frames in the sequence can be calculated in parallel, and if you need only a few iterations, this is faster than the sequential decoding in decoder-only models (although speculative decoding also gets you some speedup for similar reasons). Those discrete diffusion models / encoder-only Transformers are usually trained with BERT-like masking, but that's actually an active field of research. It's really a pity that they don't provide any details here (on training and modeling).

I wonder how this relates to Gemini. Does it use the same modeling? Was the model checkpoint even imported from Gemini, and then further finetuned for discrete diffusion? Or knowledge distillation? Or is it just branding?

renjimen · 36d ago

The speed this can build makes me think software is soon to become a lot more fluid than our traditional iterative approach. Apps could ship minimal and build whatever else they need to at the user’s behest.

vFunct · 36d ago

The challenge for LLMs over the next year is to get them to operate on large data sets/code bases with millions/billions of tokens through some kind of distributed hierarchical framework, with each LLM operating on a local set of 20k or whatever subset of tokens.

moneywoes · 36d ago

any reading?

vFunct · 36d ago

I’m just a user, trying out the models first hand on a large project, learning as I go.

seydor · 36d ago

Just the idea of generating text by removing noise is so profound. Maybe each step is a level of hierarchy. Linguists must be astonished at the things happening these past years. I have to read more about it

padolsey · 36d ago

I feel the same. On reflection, it's how I think I experience thoughts emerging in my head. Language gets derived from initially noisy embeddings. It's quite beautiful that we've ended up closer to a de-noising architecture than auto-complete on steroids.

Workaccount2 · 36d ago

You might find this article interesting:

https://www.quantamagazine.org/when-chatgpt-broke-an-entire-...

How transformers pretty much blindsided the whole field of NLP.

Tostino · 36d ago

This is something I have been thinking about integrating into a sampler for standard autoregressive LLMs. The idea is to buffer N context tokens from the ongoing autoregressive generation. Then, every K tokens, a section of this buffer (or perhaps the whole buffer) could be processed by a diffusion model, guided by one or more specific commands to operate on that buffered text.

One application I envision for this kind of sampler, leveraging the diffusion model's capabilities, would be to detect and potentially correct instances of post-hoc reasoning within the buffer. The diffusion model could then help ensure that proper causal reasoning chains are established in that segment before the autoregressive model continues generating. You could also allow for slight, controlled backtracking or revision within that buffer window if the generation starts to go off-track, again using the diffusion model to smooth or adjust the text before committing it and moving forward.

sagarpatil · 36d ago

Why are you obsessed with Pelicans? What’s your story?

simonw · 36d ago

I'm from the UK originally. On one of my first trips to California I was up on the cliffs in Marin County and a squadron flew by and I was amazed by them - and the Californians I was with were like "yeah, you see them all the time".

Now I live in California and I still can't believe I get to see them here. They're absurd - they don't look like they should be able to fly at all. They're also incredibly pretty, especially in their breeding plumage.

I live in Half Moon Bay, just south of San Francisco, which turns out to be home to the second largest mega-roost of the California Brown Pelican (my favourite kind of pelican) in the world.

We've even rescued two of them (injured birds, we got them in a carrier and took them to the animal rescue place).

They make for a fun theme for all sorts of different AI experiments.

They're also very photogenic - I had a bunch of photos I've taken on my PyCon poster recently (you have to zoom in quite a bit to see them though): https://static.simonwillison.net/static/2025/poster-full-siz...

ggm · 36d ago

Visit Lake Eyre. In flood, it's home to a flock of thousands. I'm going in August.

simonw · 36d ago

In Australia? I just checked Google Image search and WOW. https://www.google.com/search?q=Lake+Eyre+pelicans&udm=2

turbonaut · 36d ago

> I'm from the UK originally.

No need to go as far as California for penguins!

https://www.royalparks.org.uk/visit/parks/st-jamess-park/pel...

pama · 36d ago

Nice image of your poster!

sagarpatil · 30d ago

Cool story.

behnamoh · 36d ago

tok/s speeds in the video:

- 1st message (empty context): 857 tok/s

- 2nd message (2244 tokens in context): 727 tok/s

- 3rd message (2244+1398 tokens in context): 693 tok/s

I'm no expert in diffusion models but this looks like a drastic drop in speed, especially in longer chats (this was just 3 messages).

angusturner · 36d ago

One under appreciated / misunderstood aspect of these models is they use more compute than an equivalent sized autoregressive model.

It’s just that for N tokens, autoregressive model has to make N sequential steps.

Where diffusion does K x N, with the N being done in parallel. And for K << N.

This makes me wonder how well they will scale to many users, since batching requests would presumably saturate the accelerators much faster?

Although I guess it depends on the exact usage patterns.

Anyway, very cool demo nonetheless.

thomasahle · 35d ago

> Now, diffusion LMs take this idea further. BERT can recover 15% of masked tokens ("noise"), but why stop here. Let's train a model to recover texts with 30%, 50%, 90%, 100% of masked tokens.

> Once you've trained that, in order to generate something from scratch, you start by feeding the model all [MASK]s. It will generate you mostly gibberish, but you can take some tokens (let's say, 10%) at random positions and assume that these tokens are generated ("final")

This is clearly wrong. If you actually froze 10% of gibberish tokens, your output would be terrible!

What you actually do in discrete statespace diffusion (see e.g. [1]) is to allow every token to change at every time step.

You combine this with a "schedule" that allows the model to know how close it is to being done. E.g. at t=0/20 the changes will be large, and at t=19/20 only small refinements are made.

Update: There is actually a kind of model that "greedily" freezes the top-p most confident tokens, similar to what the blog post describes (though not at random!) this is called MaskGit [2], but it is not a diffusion model and doesn't work as well.

Btw, you can also just use "continuous diffusion" with a transformer/bert model, where you've removed the top softmax layer. Then everything works as normal with Gaussian noise, and you just do softmax at the the final time step.

[1] https://arxiv.org/abs/2107.03006

[2] https://arxiv.org/abs/2202.04200

peter_d_sherman · 33d ago

>"Traditional autoregressive language models generate text one word – or token – at a time. This sequential process can be slow, and limit the quality and coherence of the output.

Diffusion models work differently. Instead of predicting text directly, they learn to generate outputs by refining noise, step-by-step. This means they can iterate on a solution very quickly and error correct during the generation process."

It would seem that diffusion / noise filtering / processing -- would be more parallelizable than traditional autoregressive AI large language models...

That might be an interesting area of study... the parallelization of such algorithms...

quantadev · 36d ago

Anyone able to summarize the current 'hold up' with diffusion models? I know exactly how Transformers work, but I'm not a diffusion expert. Diffusion is so much more powerful tho (from what I know) it seems like diffusion would already be beating Transformers. Why isn't it?

boroboro4 · 36d ago

Diffusion is about what goes into the model and what’s a result (in this case it’s denoising of the content) as opposed to autoregressive models (where the process is to predict continuation based on prefix). It’s orthogonal to model architecture, which can be transformer or (for example) mamba. I’m pretty sure Gemini diffusion is transformer too.

Diffusion brings different set of trade offs, and as you can see it improves speed but I would expect it increases compute required for generation. But this is hard to say for sure without knowing their exact sampling process.

Interestingly we have opposite direction in case with gpt-4o, OpenAI made autoregressive image generation model and it seems it works great.

atq2119 · 36d ago

Diffusion could potentially be more efficient for local inference. With auto-regressive models, token generation is basically one token at a time, and so is not compute intensive at all -- it's bandwidth bound. With diffusion, you always run the model on a decently sized batch of tokens, so you should be (close to) compute bound even for local inference.

If the "output quality per compute" is roughly the same for diffusion and auto-regression (is it? I have no idea...), then diffusion will be much more efficient for local inference because the same amount of compute can be packed into a much shorter time period.

boroboro4 · 35d ago

Yeah, it might be a win for local inference.

I think "output quality per compute" will be loss for diffusion models, but it might be similar (or even better?) for "output quality per number of parameters". Which will still make it better for local inference.

However autoregressive models also have own way of dealing with low compute utilization - it's speculative decoding. You can use smaller (and faster) model to generate bunch of different possible continuations and verify all of them at once. I think Eagle3 for example achieves ~8 tokens per iteration speedup this way (and to be frank I believe it can be even better).

machiaweliczny · 36d ago

I feel like diffusion would be much more useful for code it it could only mark tokens as "valid" if they were passing code checks. So it could be thought as adding more of "semantic chunks" instead of just words. Not sure how to validate it as some additions always will result in invalid code. You could argue that running tests, linters is the same but I think one could make it that generations are validated much more often with diffusion models.

Example: You remove some function, you also remove all uses of it. You can't use not existing variable etc. This could be trained on well commited git repos or stalking/stealing the work of developers via editor

breakyerself · 36d ago

If it's faster does that mean it uses less compute/resources?

nine_k · 36d ago

Or maybe can use as much in a more parallel way?

m101 · 36d ago

It makes one wonder what intelligence really is. The more I think about it the more I feel that speed is a fundamental unit of intelligence, with the other being some simple computation unit. As in, intelligence = speed * simple computation.

If you look around us it is the ability to iterate that drives innovation (and thereby evidence of "intelligence"). LLMs in industry are more useful, and used, the faster they are.

mdp2021 · 36d ago

> what intelligence really is

Something that actually arrives to good results - the faster the better, but you have to be able to finally achieve. Achievement of solutions is still far from granted - so we will have to remain focusing on that.

Intelligence finds good solutions in a solutions space.

incognito124 · 36d ago

Lately, I am thinking that intelligence is just a large memory + efficient search. I'm basing that off of multiple accounts of really smart people having extremely a good memory (I can only think of von Neumann right now, who could reproduce complete books)

CooCooCaCha · 36d ago

I really like Gwern’s definition of intelligence which is “search over Turing machines”.

In other words, searching for the right program given some goal.

r33b33 · 36d ago

Can it finally work with a large codebase? I have GooglePlay / AppStore app coded in Xcode, in C#, ported in Java with Python server. The codebase is large and expansive. It includes web support, server, client, etc... will this "Gemini Diffusion" finally allow me to use AI agent to code instead of hiring a programmer? Is there a tool that could help me as of today?

GistNoesis · 36d ago

Fast, you gotta go fast : Let me draw the roadmap of this line of thinking.

- Let's start by the traditional autoregressive LLM, where one token at a time is generated. It's a fundamentally sequential process which maps well to the sequential nature of writing as it goes.

- Then to make the generation go faster you try to generate multiple token in one pass to parallelize more the sequential process with things like "look ahead decoding"

- (<-- We are here) Then you realize that if your model isn't writing as it goes but rather forming an idea and pushing all at once you can instead use a diffusion model to generate the whole response, but you allow it to do number of diffusion steps edits to make all the errors that occurred during the generation disappear. Conceptually if number of diffusion steps == length of the sequence of token to generate, the diffusion process could generate tokens one at a time like a autoregressive LLM does. Usually 100 diffusion steps is a good starting point.

- Now the goal is to reduce the number of diffusion steps to reduce computation cost. And the diffusion literature is already well furnished and in the image/video domain it was shown that you can reduce this number of diffusion steps to one (albeit with quality reduction) or two, with techniques like "consistency models".

- Now that you only have a single diffusion step, you realize that you need to get speed-up elsewhere. You explore the literature and you realise that you can apply the trick you have already applied once, one more time. Compressing a few tokens into one, like you compressed multiple characters into one token. This allow to reduce the length of the sequence of tokens you need to generate by a factor 4. At the price of an additional decoding step. This decoding step can either be some form of "latent" encoding or some form of "hierarchical" encoding. So now you are consistency diffusing sentences vectors, which are then decoded into tokens sequences. But each step being smaller and transformer being quadratic the total speed-up is roughly a factor 10. But applying this trick multiple times get you diminishing return. Which you can partially compensate by increasing memory use (using a bigger "vocabulary" dictionary size).

- To make it go faster you now have to dig into the internals of the transformer itself. You suddenly realise it is just a residual network applied "number of layers" time. Being a residual network this "sequence of internal step" 's goal is to refine the input into the output progressively. But you realise that it's the same thing which allows you to go from "number of diffusion steps" to a single diffusion step. You realise that you can compress your stack of layer into a single (bigger to keep capacity) layer, and let the diffusion correct the mistakes.

- Now you have a single layer of transformer consistency model generating sentences vectors, you realise that transformers uses multiple heads to explore the space more efficiently but once training is done you can get by with a single head. Gaining an other 10x reduction of computation along the way.

- Taking a step-up you realize that your transformer now is just doing a near-neighbor search and mixing the outputs. But it's doing it in a brute-force fashion. So you replace it with some approximate Near-neighbor search like HNSW vector database, decoupling computation from capacity, allowing you to scale-up by trading space for time.

- But because Hierarchical Navigable Small World are just graphs under the hood, you realise that you just reinvented the Good Old Fashion Artificial Intelligence graph database ontology but in an emergent fashion with a graph being implicitly defined by some vector distance in a semantic space constructed in a way to make it easy to generate text once decoded appropriately.

- So now you only need make your database explainable by mapping into human understandable labels and you reach the graal : SQL.

djmips · 36d ago

Is this a Shaggy Dog Story?

GistNoesis · 36d ago

If only...

When you first encounter diffusion models, you usually see a well formed picture emerge from noise.

And then you realize, there is no reason it shouldn't work for anything where you can add noise to. Which means everything. From picture, to audio to text, to anything encoded in data.

An infinite world of images and human creations in 10GB of weights.

A meaningful universe lost in speckle of dusts.

I remembered the line from the Genesis : "For dust you are and to dust you shall return".

I suppose we all thought that, one way or another.

skydhash · 36d ago

You forgot about constraints, especially cascading ones. Where one detail can shape the whole thing (think shadows and lights location and direction)

benob · 36d ago

I guess autoregressive llms can be finetuned (or continual-pretrained) to do inference using diffusion. We've seen a recent paper (which I don't remember) training from scratch, but it seems overkill. Do Google say how they did it?

Also, does diffusion have the potential to increase speed of cpu-only inference?

heliophobicdude · 36d ago

Question for the researchers, can dLLMs be pinned down with a seed? Can they be made 100% deterministic?

hansvm · 36d ago

Yes, as with all of these models. The only architectures which struggle with that feature are those which have a strong "distributed" aspect to their computations, where it can take much more work than programmers typically expect to ensure you're actually performing equivalent computations.

When executing any of them on GPUs or other accelerators though (dLLMs or otherwise), you do have to remain cognizant of chip-specific approximations and deviations from the standard. That can be actual issues on the chip (a famous one comes to mind where some f16 or f32 computation passed through an intermediate, undocumented f8), or it can be issues with how your software compiles to a chip (e.g., (a+b+c)+(x+y+z) is not the same as (a+b)+(c+x)+(y+z) with floats, so you have a lot less freedom to lay out your computations in a way that fits the chip nicely).

lostmsu · 36d ago

Yes for any sequential algorithm running on a computer.

refulgentis · 36d ago

Yes

dr_dshiv · 36d ago

So thinking 5x faster, but lower quality (for now).

Anyone have experience or data on how lower model quality during thinking affects the performance of a higher quality model output? Like, is it worthwhile having lots of lower quality thinking that is then used by a higher quality model?

kartikarti · 36d ago

For those interested in crazy inference speeds, check out Groq.

(I’m not affiliated in any way.)

gdiamos · 36d ago

Where is the diffusion LLM inference framework so people can run these models?

bilsbie · 36d ago

> Despite the name, diffusion LMs have little to do with image diffusion and are much closer to BERT and old good masked language modeling.

Has anyone tried making text the way we do image diffusion? What happens?

lysecret · 36d ago

So what's its going to be in the end Diffusion or Auto-Regression? After OpenAi (probably) released an Auto-Regressive model for their image generator I thought things might sway the other way.

WhitneyLand · 35d ago

It’s promising but they need to find a way to add reasoning techniques to DLLMs.

Were a ways out still from matching from frontier performance.

x_may · 35d ago

Obviously its not at the scale of the top auto-regressive models yet but there are some OSS models https://github.com/dllm-reasoning/d1

transformi · 36d ago

Interesting to see if GROQ hardware can run this diffusion architecture..it will be two time magnitude of currently known speed :O

randomgoogler1 · 36d ago

(Disc: Googler but don't have any specific knowledge of this architecture)

My understanding of Groq is that the reason it is fast is that all the weights are kept in SRAM and since the SRAM <-> Compute bandwidth is much faster than HBM <-> Compute bandwidth, you can generate tokens faster (During generation the main bottleneck is just bringing in the weights + KV caches into compute).

If the diffusion models just do multiple unmasked forward passes through a transformer, then the activation * weights computation + (attention computation) will be the bottleneck which will make each denoising step compute bound and there won't be any advantage in storing the weights in SRAM since you can overlap the HBM -> compute transfer with compute itself.

But my knowledge of diffusion is non-existent, so take this with a truck of salt.

jgalt212 · 36d ago

Is this more evidence that the days of insatiable demand for GPU data centers and electricity are behind us?

loudmax · 36d ago

If the diffusion models are an improvement over autoregression models, then the answer is No, due to Jevon's paradox. That is, as these models get cheaper and better, they provide more utility, driving more demand. Even as your datacenters become more productive, the demand for their compute power increases at an even faster pace.

The thing that will limit demand for compute is when the world decides it has sufficient capacity of the form of "intelligence" these models provide. I don't think anyone has any idea what that world will even look like.

jgalt212 · 36d ago

Perhaps, but it's not clear that Jevon's paradox is at play here. AI / LLM uptake has been muted, or lacking legs, outside of coding. And it's not because of cost (AI inference is being provided below cost by megatech).

Havoc · 36d ago

Are there any open diffusion ones already? Or too early for that?

synapsomorphy · 36d ago

Nit: Diffusion isn't in place of transformers, it's in place of autoregression. Prior diffusion LLMs like Mercury [1] still use a transformer, but there's no causal masking, so the entire input is processed all at once and the output generation is obviously different. I very strongly suspect this is also using a transformer.

[1] https://www.inceptionlabs.ai/introducing-mercury

cubefox · 36d ago

Image diffusion models also use transformers nowadays. Here is the original "diffusion transformer" paper: https://arxiv.org/abs/2212.09748

Earlier image diffusion models used U-nets: https://en.wikipedia.org/wiki/U-Net

tripplyons · 36d ago

Many U-net based models such Stable Diffusion V1.5 modified the base architecture to include self-attention and cross-attention layers interleaved between convolution layers.

simonw · 36d ago

Thanks, I updated my post to quote this comment.

jszymborski · 36d ago

Interesting, my mind immediately went to block diffusion [0], but I think you are probably right.

[0] https://m-arriola.com/bd3lms/

crystal_revenge · 36d ago

> so the entire input is processed all at once

I’m a bit confused by this statement. Autoregresive LLMs also process the entire input “at once” otherwise tricks like speculative decoding wouldn’t work. Can you clarify what you mean by this?

cma · 36d ago

Each token in an LLM only looks backwards. Forward only causal transformers. Once it is in the KV cache it is only updated indirectly through later stuff at higher layers that are merging previous stuff with softmax and have to reinterpret stuff if it got the wrong representation at lower layers given new context.

They can also run in a bidirectional mode like BERT etc. and then get a much richer representation but it is more expensive to generate.

Once we hit a data wall, for raw models bidirectional can potentially give something like a gpt 3 -> gpt4 uplift on the same amount of data for way more compute, there are hybrid ways of still using causal but augmenting it with bidirectional by reprocessing past context bidirectionally occasionally, but it is looking like diffusion approaches may work better instead, they have a transformer operating bidirectionally I think.

Lots of written text, especially things like math textbooks is written more like a graph of concepts that only all click into place once all the concepts are processed (think of a math teacher saying we need to introduce and use this now, but only can explain the finer details or proof of it later).

I think bidirectional can handle that a lot better for same number of parameters and data, but they were intractable for generation, though forms of generation from them on longer and longer sequences wouldn't outpace Moore's law or anything so it could have been an approach if the data wall had been really harsh and we didn't find other stuff extending it in the meantime.

orbital-decay · 36d ago

Token prediction is just one way to look at autoregressive models. There's plenty of evidence they internally express the entire reply on each step, although with a limited precision, and use it to progressively reveal the rest. Diffusion is also similar (in fact it's built around this process from the start), but it runs in the crude to detailed direction, not start to end. I guess diffusion might possibly lose less precision on longer generations, but you still don't get the full insight into the answer until you actually generated it.

mattnewton · 36d ago

tokens in a diffusion model typically look like encoders where the tokens earlier in the sentence can “see” tokens later in the sentence, attending to their values. Noise is iteratively removed from an entire buffer all at once in a couple steps.

Versus one step per token, where autoregressive models only attend to previous tokens.

SpaceManNabs · 36d ago

Isnt there masked diffusion as well?

r0b05 · 36d ago

This architecture still hallucinates since it's still using transformers right?

janalsncm · 36d ago

Hallucinations are a result of using statistical models on data. Not just transformers have this issue.

r0b05 · 36d ago

Interesting, I thought it was a feature of the transformer specifically.

beernet · 36d ago

Serious question: Why does it appear that pages from this URL very often end up on top of HN? I don't find the content particularly special compared to the average HN post. Does the algorithm prefer certain URLs?

simonw · 36d ago

If anything HN is getting harder for me to get stuff on these days. Most of my self submissions of my content have failed to chart over the past few months: https://news.ycombinator.com/submitted?id=simonw

You can see submissions from my domain by other people here: https://news.ycombinator.com/from?site=simonwillison.net

It's weird what DOES make it. I had high hopes for my piece on ChatGPT memory yesterday - https://simonwillison.net/2025/May/21/chatgpt-new-memory/ - and it got nowhere. This Gemini piece was much more of a throwaway note, I mainly wanted to mark the release of an influential new model on my own site.

The reason I get content on here more than most people is that I write a LOT more than most people. So far this year I've published 45 long form blog entries and 274 short form (link blog) entries - this Gemini piece is one of those.

I try to always add something new - for this Gemini piece that was the video of it running, the references back to a similar demo from Cerebras and (updated since I first posted) a couple of quotes Hacker News comments to help explain diffusion LLMs better.

I wrote more about my approach to blogging here: https://simonwillison.net/2024/Dec/22/link-blog/

The sad truth is that very few people produce long-form writing online these days! Most people who publish regularly are doing tweets, LinkedIn posts and short-form video instead.

SalariedSlave · 35d ago

How much of your time do you spend in writing for your blog? Do you do this full time or is it more of a side gig?

Your shout that "more people should do this" is resonating with me - I have some interest in similar short form posts covering various topics of interest (even if only for my own reference), but I am not sure if I can manage this on the side.

I'm curious about the time required for this volume of content output. Do you use AI to help with writing?

simonw · 35d ago

Probably between an hour and an hour a half a day - or two hours on days when I publish something long form.

No AI for the writing itself but plenty for things like research, finding the right word, getting feedback on if what I've written makes sense.

Definitely a side gig! I make maybe $200/month off ads from it.

ealexhudson · 36d ago

Perhaps your content quality meter needs a recalibration?

beernet · 36d ago

How so? What makes this blog stand out in terms of quality? I prefer a constructive discussion over personal questions, maybe you should, too.

lcnPylGDnU4H9OF · 35d ago

It's the name at the top. This particular author has been active with LLM posts at least since the popularity explosion of ChatGPT and all of their posts on that topic seem to be well-informed (and they are otherwise community-famous for co-authoring Django). To your point, the content is only as special as the author's reputation makes it, which will be different from reader to reader.

https://en.wikipedia.org/wiki/Simon_Willison

cbeach · 35d ago

His posts on AI are often very insightful and, unusually for someone so involved in AI, he's not connected to any of the big AI companies. Therefore he is impartial.

Der_Einzige · 36d ago

If I don't get the ability to (upweight:1.5) and (downweight:0.7) tokens like with Stable Diffusion - it's worthless.

No comments yet

Jackson__ · 36d ago

What is this blog spam doing here? This is has literally no new information compared to the official release page. It would make a lot more sense to change the link to https://deepmind.google/models/gemini-diffusion/ to discuss the topic.

tezza · 36d ago

Simon’s blog is high-signal, low-noise deep dives. The very opposite of blog spam

simonw · 36d ago

This content is from my link blog, and the page you linked to is the primary link in my own post.

I think I added value over the official landing page:

1. I included a video showing how fast it runs. They don't have a video on that page.

2. I compared it to Cerebras (which is even faster). They obviously aren't going to compare themselves with a competitor in their own marketing material!

3. These are updates since first publishing, but my post now highlights a couple of Hacker News comments that help explain how this actually works.

I wrote a bunch about how I try to add value when link blogging here: https://simonwillison.net/2024/Dec/22/link-blog/#trying-to-a...

tinco · 36d ago

Its the interpretation of an expert which to me is preferable to the marketing website. Just from the marketing website someone who isn't up to date can't tell what's new, what's good and what's just being fluffed up.

lemontheme · 36d ago

Counterpoint: big-tech LLM labs tend to make big claims in their announcement posts, not to mention the inconsistent and selective application of benchmarks.

But when Simon says, Whoa, this is impressive – then I listen.

astrodude · 35d ago

Simon's blog explains things simply, without all that marketing/hype terminology. Compares performance with competitors. Love reading it

petercooper · 36d ago

TBH I learnt more from Simon's post than I did actually being physically on site at I/O where it was barely covered at all.

Show HN: I'm an airline pilot – I built interactive graphs/globes of my flights (jameshard.ing)

Weird Expressions in Rust (wakunguma.com)

Qwen VLo: From "Understanding" the World to "Depicting" It (qwenlm.github.io)

10 Years of Pomological Watercolors (parkerhiggins.net)

Transmitting data via ultrasound without any special equipment (halcy.de)

I Switched from Flutter and Rust to Rust and Egui (jdiaz97.github.io)

Bitmovin (YC S15) Is Hiring a Junior Solutions Engineer in Denver (bitmovin.com)

Moonbase Alpha: That time NASA made a meme video game (spacebar.news)

Whitesmiths C compiler: One of the earliest commercial C compilers available (github.com)

Parameterized types in C using the new tag compatibility rule (nullprogram.com)

Show HN: Zenta – Mindfulness for Terminal Users (github.com)

AlphaGenome: AI for better understanding the genome (deepmind.google)

PJ5 TTL CPU (pj5cpu.wordpress.com)

Launch HN: Issen (YC F24) – Personal AI language tutor

Sailing the fjords like the Vikings yields unexpected insights (arstechnica.com)

Show HN: Sink – Sync any directory with any device on your local network (github.com)

Alternative Layout System (alternativelayoutsystem.com)

XSLT – Native, zero-config build system for the Web (github.com)

My Lights Run on Bash – Tomasz Kramkowski (kramkow.ski)

The time is right for a DOM templating API (justinfagnani.com)

Copilot Chat in VS Code is now open source (github.com)

Why is the Rust compiler so slow? (sharnoff.io)

Starcloud can’t put a data centre in space at $8.2M in one Starship (angadh.com)

Blazing Matrix Products (panadestein.github.io)

Calculating the Fibonacci numbers on GPU (veitner.bearblog.dev)

A Lisp adventure on the calm waters of the dead C (2021) (mihaiolteanu.me)

Show HN: PILF, The ultimate solution to catastrophic oblivion on AI models (github.com)

VA Tech scientists are building a better fog harp (arstechnica.com)

The Effect of Noise on Sleep (empirical.health)

A lumberjack created more than 200 sculptures in Wisconsin's Northwoods (smithsonianmag.com)

Project Vend: Can Claude run a small shop? (And why does that matter?) (anthropic.com)

The Power and Beauty of Incrementalism (supernuclear.substack.com)

How much slower is random access, really? (samestep.com)

Snow - Classic Macintosh emulator (snowemu.com)

Show HN: Magnitude – Open-source AI browser automation framework (github.com)

Kea 3.0, our first LTS version (isc.org)

US Supreme Court limits federal judges' power to block Trump orders (theguardian.com)

Typr – TUI typing test with a word selection algorithm inspired by keybr (github.com)

Bogong moths use a stellar compass for long-distance navigation at night (nature.com)

US Supreme Court Upholds Texas Porn ID Law (wired.com)

Environmental crimes are often hidden by 'flying money' laundering schemes (news.mongabay.com)

Show HN: Wayland Speech-to-Text Tool (github.com)

Ask HN: Why aren't AIs being used as app beta testers yet?

The year of EU Linux desktop may come: digital sovereignty begins at the desktop (theregister.com)

A Review of Aerospike Nozzles: Current Trends in Aerospace Applications (mdpi.com)

'Peak flower power era': The story of first ever Glastonbury Festival in 1970 (bbc.com)

Introducing Gemma 3n (developers.googleblog.com)

Collections: Nitpicking Gladiator's Iconic Opening Battle, Part I (acoup.blog)

E.A. Spitzka's Studies of Exceptional and Deviant Brains (2024) (huntington.org)

Show HN: I built an AI dataset generator (github.com)

Gemini Diffusion

Comments (244)