Nit: Diffusion isn't in place of transformers, it's in place of autoregression. Prior diffusion LLMs like Mercury [1] still use a transformer, but there's no causal masking, so the entire input is processed all at once and the output generation is obviously different. I very strongly suspect this is also using a transformer.
I’m a bit confused by this statement. Autoregresive LLMs also process the entire input “at once” otherwise tricks like speculative decoding wouldn’t work. Can you clarify what you mean by this?
mattnewton · 37m ago
tokens in a diffusion model typically look like encoders where the tokens earlier in the sentence can “see” tokens later in the sentence, attending to their values. Noise is iteratively removed from an entire buffer all at once in a couple steps.
Versus one step per token, where autoregressive models only attend to previous tokens.
jszymborski · 3h ago
Interesting, my mind immediately went to block diffusion [0], but I think you are probably right.
Many U-net based models such Stable Diffusion V1.5 modified the base architecture to include self-attention and cross-attention layers interleaved between convolution layers.
shreezus · 23m ago
Is anyone else totally blown away by this? I feel like it’s easily the biggest announcement out of IO, however it’s been overshadowed by Veo 3 etc.
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
airstrike · 3h ago
That's...ridiculously fast.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
manmal · 12m ago
They could read the whole git history and have all issue tracker tickets in the context, and maybe even recordings from meetings. It remains to be seen though if such large context will yield usable results.
8n4vidtmkvmk · 43m ago
That's not been my experience so far. LLMs are good at mimicking existing good, it doesn't usually bring in new things when not asked. Sometimes I have to go out of my way to point to other bits of code in the project to copy from because it hasn't ingested enough of the codebase.
That said, a negative prompt like we have in stable diffusion would still be very cool.
ec109685 · 2h ago
If you make models fast enough, you can onboard that expert developer instantly and let them reason their way to a solution, especially when giving access to a RAG to.
Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.
airstrike · 2h ago
I thought of that as I wrote my comment, but I think the infrastructure and glue to make that possible in a consistent, fast and scalable way is still a few years out.
heliophobicdude · 3h ago
I think the lede is being buried. This is a great and fast InstructGPT. This is absolutely going to be used in spell checks, codemods, and code editors.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
KingMob · 1h ago
Spell check? Isn't that a well-solved problem at this point?
8n4vidtmkvmk · 42m ago
How does grammarly exist then? Must be some secret sauce in there.
dleeftink · 1h ago
Solved how? Language is always evolving
never_inline · 27m ago
Google Docs spellcheck has been really good for few years even before LLMs
renjimen · 22m ago
The speed this can build makes me think software is soon to become a lot more fluid than our traditional iterative approach. Apps could ship minimal and build whatever else they need to at the user’s behest.
vFunct · 16m ago
The challenge for LLMs over the next year is to get them to operate on large data sets/code bases with millions/billions of tokens through some kind of distributed hierarchical framework, with each LLM operating on a local set of 20k or whatever subset of tokens.
mountainriver · 3h ago
Diffusion is more than just speed. Early benchmarks show it better at reasoning and planning pound for pound compared to AR.
This is because it can edit and doesn’t suffer from early token bias.
hansvm · 1h ago
AR doesn't inhibit long planning processes, but some popular, modern instantiations of AR have that flaw. AR in general is critical for learning the right distribution.
mdp2021 · 50m ago
> AR in general is critical for learning the right distribution
Could you please clarify that?
martincsweiss · 3h ago
This is a super interesting claim - can you point to these benchmarks?
mdp2021 · 2h ago
Try this one:
# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.
That doesn't necessarily mean that they scale as well as autoregressive models.
jimmyl02 · 1h ago
I think there is no way to tell and we can only see with more research and time. One nuanced part that might not be clear is the transformer was a huge part of what made traditional LLMs scale.
With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.
I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.
vessenes · 3h ago
A claim I believe (or want to) but can you point to any papers about this? I haven’t seen any papers at all or demos showing a revise diffusion text step. I’d reallly like to use one though.
nodja · 2h ago
This is insanely fast, my guess is that the tradeoff here is that the GPUs will always be working at max capacity and there will be minimal compute savings from batching, which I realize now is not really a tradeoff.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
manmal · 7m ago
This tradeoff will be great for self hosted LLMs, because they don’t need large scale batching usually, and less great for cloud providers that do.
mdp2021 · 2h ago
Why do you suspect dLLMs should not match (or surpass) arLLMs in quality? The general idea is that it is easier to treat the output as a structured whole (idea, points, concepts, words - in a tree) which is iteratively treated - that should go in the direction of "proper" quality.
nodja · 2h ago
My intuition is that the harder it is for an LLM to do something during training the more actual compression/learning will be encoded in it's weights. With multi-token/diffusion it becomes much easier to "reward/loss hack" your way, this won't matter much during pretraining, but I assume a lot of "cheating" will happen in the finetune/RL phase.
hiimshort · 1h ago
I have been wondering about the use of diffusion techniques for text generation, it is nice to see Google release a model that, seemingly, validates some thoughts I had.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
huevosabio · 2h ago
I am so excited about diffusion language models. They may be the piece we need to make our voice-to-code game mechanic be as smooth as we envision it.
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
---
If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
EGreg · 11m ago
This is super interesting and obviously someone would have tried diffusion for text. But I will ask the obvious question… how does it know how many words or even tokens to fill in, before it knows what the words will be? It would hamstring itself a lot of the time, can it edit the words later and create more space or is it kind of stuck with the token positioning as it would be with parts of an image? It seems very strange. Usually, words are composed in order like AR models do it, because they are using a recursive grammar, and this is especially true of computer languages. This is a bit like mad libs but madder libs. My question is, how could this possibly give better results than AR, it would need to perfectly converge on something with the right grammar context and the semantic meaning, while perfectly predicting early on the amount of tokens that would appear between words. Seems like there is some major impedance mismatch.
sagarpatil · 2h ago
Why are you obsessed with Pelicans? What’s your story?
simonw · 2h ago
I'm from the UK originally. On one of my first trips to California I was up on the cliffs in Marin County and a squadron flew by and I was amazed by them - and the Californians I was with were like "yeah, you see them all the time".
Now I live in California and I still can't believe I get to see them here. They're absurd - they don't look like they should be able to fly at all. They're also incredibly pretty, especially in their breeding plumage.
I live in Half Moon Bay, just south of San Francisco, which turns out to be home to the second largest mega-roost of the California Brown Pelican (my favourite kind of pelican) in the world.
We've even rescued two of them (injured birds, we got them in a carrier and took them to the animal rescue place).
They make for a fun theme for all sorts of different AI experiments.
Question for the researchers, can dLLMs be pinned down with a seed? Can they be made 100% deterministic?
hansvm · 1h ago
Yes, as with all of these models. The only architectures which struggle with that feature are those which have a strong "distributed" aspect to their computations, where it can take much more work than programmers typically expect to ensure you're actually performing equivalent computations.
When executing any of them on GPUs or other accelerators though (dLLMs or otherwise), you do have to remain cognizant of chip-specific approximations and deviations from the standard. That can be actual issues on the chip (a famous one comes to mind where some f16 or f32 computation passed through an intermediate, undocumented f8), or it can be issues with how your software compiles to a chip (e.g., (a+b+c)+(x+y+z) is not the same as (a+b)+(c+x)+(y+z) with floats, so you have a lot less freedom to lay out your computations in a way that fits the chip nicely).
refulgentis · 1h ago
Yes
transformi · 3h ago
Interesting to see if GROQ hardware can run this diffusion architecture..it will be two time magnitude of currently known speed :O
randomgoogler1 · 2h ago
(Disc: Googler but don't have any specific knowledge of this architecture)
My understanding of Groq is that the reason it is fast is that all the weights are kept in SRAM and since the SRAM <-> Compute bandwidth is much faster than HBM <-> Compute bandwidth, you can generate tokens faster (During generation the main bottleneck is just bringing in the weights + KV caches into compute).
If the diffusion models just do multiple unmasked forward passes through a transformer, then the activation * weights computation + (attention computation) will be the bottleneck which will make each denoising step compute bound and there won't be any advantage in storing the weights in SRAM since you can overlap the HBM -> compute transfer with compute itself.
But my knowledge of diffusion is non-existent, so take this with a truck of salt.
breakyerself · 2h ago
If it's faster does that mean it uses less compute/resources?
nine_k · 2h ago
Or maybe can use as much in a more parallel way?
Tostino · 1h ago
This is something I have been thinking about integrating into a sampler for standard autoregressive LLMs. The idea is to buffer N context tokens from the ongoing autoregressive generation. Then, every K tokens, a section of this buffer (or perhaps the whole buffer) could be processed by a diffusion model, guided by one or more specific commands to operate on that buffered text.
One application I envision for this kind of sampler, leveraging the diffusion model's capabilities, would be to detect and potentially correct instances of post-hoc reasoning within the buffer. The diffusion model could then help ensure that proper causal reasoning chains are established in that segment before the autoregressive model continues generating. You could also allow for slight, controlled backtracking or revision within that buffer window if the generation starts to go off-track, again using the diffusion model to smooth or adjust the text before committing it and moving forward.
quantadev · 2h ago
Anyone able to summarize the current 'hold up' with diffusion models? I know exactly how Transformers work, but I'm not a diffusion expert. Diffusion is so much more powerful tho (from what I know) it seems like diffusion would already be beating Transformers. Why isn't it?
boroboro4 · 1h ago
Diffusion is about what goes into the model and what’s a result (in this case it’s denoising of the content) as opposed to autoregressive models (where the process is to predict continuation based on prefix). It’s orthogonal to model architecture, which can be transformer or (for example) mamba. I’m pretty sure Gemini diffusion is transformer too.
Diffusion brings different set of trade offs, and as you can see it improves speed but I would expect it increases compute required for generation. But this is hard to say for sure without knowing their exact sampling process.
Interestingly we have opposite direction in case with gpt-4o, OpenAI made autoregressive image generation model and it seems it works great.
atq2119 · 7m ago
Diffusion could potentially be more efficient for local inference. With auto-regressive models, token generation is basically one token at a time, and so is not compute intensive at all -- it's bandwidth bound. With diffusion, you always run the model on a decently sized batch of tokens, so you should be (close to) compute bound even for local inference.
If the "output quality per compute" is roughly the same for diffusion and auto-regression (is it? I have no idea...), then diffusion will be much more efficient for local inference because the same amount of compute can be packed into a much shorter time period.
Der_Einzige · 2h ago
If I don't get the ability to (upweight:1.5) and (downweight:0.7) tokens like with Stable Diffusion - it's worthless.
[1] https://www.inceptionlabs.ai/introducing-mercury
I’m a bit confused by this statement. Autoregresive LLMs also process the entire input “at once” otherwise tricks like speculative decoding wouldn’t work. Can you clarify what you mean by this?
Versus one step per token, where autoregressive models only attend to previous tokens.
[0] https://m-arriola.com/bd3lms/
Earlier image diffusion models used U-nets: https://en.wikipedia.org/wiki/U-Net
Diffusion models for code generation are a big deal. If they are using transformers this would likely fall into the DiT bucket (diffusion transformers). I had previously worked on use cases that leveraged U-Net diffusion several years ago and there was quite a bit of interest in hybrid models. I expect to see further leaps in the diffusion space in the near future.
I still feel like the best uses of models we've seen to date is for brand new code and quick prototyping. I'm less convinced of the strength of their capabilities for improving on large preexisting content over which someone has repeatedly iterated.
Part of that is because, by definition, models cannot know what is not in a codebase and there is meaningful signal in that negative space. Encoding what isn't there seems like a hard problem, so even as models get smarter, they will continue to be handicapped by that lack of institutional knowledge, so to speak.
Imagine giving a large codebase to an incredibly talented developer and asking them to zero-shot a particular problem in one go, with only moments to read it and no opportunity to ask questions. More often than not, a less talented developer who is very familiar with that codebase will be able to add more value with the same amount of effort when tackling that same problem.
That said, a negative prompt like we have in stable diffusion would still be very cool.
Over time, I models will add more memory and institutional knowledge capture rather than starting from a blank slate each time.
Instant edits feature can surgically perform text edits fast without all the extra fluff or unsolicited enhancements.
I copied shadertoys, asked it to rename all variables to be more descriptive and pasted the result to see it still working. I'm impressed.
This is because it can edit and doesn’t suffer from early token bias.
Could you please clarify that?
# d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
https://dllm-reasoning.github.io/
# Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning
> Gemini Diffusion’s external benchmark performance is comparable to much larger models, whilst also being faster.
That doesn't necessarily mean that they scale as well as autoregressive models.
With the diffusion transformer and newer architectures, it might be possible that transformers can now be applied to diffusion. Diffusion also has the benefit of being able to "think" with the amount of diffusion steps instead of having to output tokens and then reasoning about them.
I think it's hard to tell exactly where we are headed but it's an interesting research direction especially now that it's somewhat more validated by Google.
My only worry is that the diffusion objective will be worse than AR in terms of model capabilities, if that's the case hopefully multi-token AR models will perform as well as diffusion, or we can use this as a draft model for speculative decoding.
Most folks I have seen experimenting with AI are either using a paid service or running high-grade hardware (even if consumer-level). The best I have in my current repertoire is a 5700XT and am not able to upgrade from that yet. The limitation, though, has at least also given some more significant insights into the shortcomings of current models.
Model sizes have gotten quite large and coherence seems to mostly have scaled with the density of a model, leaving the smaller models useful for only smaller tasks. Context size is also extremely important from my experiments with long-running dialogues and agent sessions, but a smaller GPU simply cannot fit a decent model and enough context at the same time. I do wonder if diffusion techniques will allow for a rebalancing of this density-to-coherence connection, letting smaller models produce chunks of coherent text even if limited by context. From my viewpoint it seems it will. Mixed tool call + response outputs also have the potential to be better.
Speed is also another problem I, and everyone else, has had with modern LLMs. The nature of cycling around the input with a new additional output each time is time consuming. On an older GPU with no AI-specific hardware it is an eternity! Being able to at least track 0-100% progress state would be an improvement from the current solution. At the moment one must simply wait for the LLM to decide to stop (or hit the max number of inference tokens). I am hopeful that, even on lower-end GPUs, a diffusion model will perform slightly better.
This does now beg several questions. If we are processing noise, where does the noise come from? Is there a good source of noise for LLMs/text specifically? Is the entire block sized beforehand or is it possible to have variable length in responses?
Cerebras and Groq are amazing, but the fact that they use custom hardware really limits the ability to finetune or scale. The other route would be an MoE that has barely 0.5b parameters active, but that would be a major undertaking that we can't prioritize at the moment.
--- If anyone at Google/Deepmind reads this, please give us API access.
We are building generative sandbox games. First title is a monster trainer where you get to actually command your creature in realtime, here is an early prototype: https://youtu.be/BOwpLyj2Yqw
Now I live in California and I still can't believe I get to see them here. They're absurd - they don't look like they should be able to fly at all. They're also incredibly pretty, especially in their breeding plumage.
I live in Half Moon Bay, just south of San Francisco, which turns out to be home to the second largest mega-roost of the California Brown Pelican (my favourite kind of pelican) in the world.
We've even rescued two of them (injured birds, we got them in a carrier and took them to the animal rescue place).
They make for a fun theme for all sorts of different AI experiments.
They're also very photogenic - I had a bunch of photos I've taken on my PyCon poster recently (you have to zoom in quite a bit to see them though): https://static.simonwillison.net/static/2025/poster-full-siz...
No need to go as far as California for penguins!
https://www.royalparks.org.uk/visit/parks/st-jamess-park/pel...
When executing any of them on GPUs or other accelerators though (dLLMs or otherwise), you do have to remain cognizant of chip-specific approximations and deviations from the standard. That can be actual issues on the chip (a famous one comes to mind where some f16 or f32 computation passed through an intermediate, undocumented f8), or it can be issues with how your software compiles to a chip (e.g., (a+b+c)+(x+y+z) is not the same as (a+b)+(c+x)+(y+z) with floats, so you have a lot less freedom to lay out your computations in a way that fits the chip nicely).
My understanding of Groq is that the reason it is fast is that all the weights are kept in SRAM and since the SRAM <-> Compute bandwidth is much faster than HBM <-> Compute bandwidth, you can generate tokens faster (During generation the main bottleneck is just bringing in the weights + KV caches into compute).
If the diffusion models just do multiple unmasked forward passes through a transformer, then the activation * weights computation + (attention computation) will be the bottleneck which will make each denoising step compute bound and there won't be any advantage in storing the weights in SRAM since you can overlap the HBM -> compute transfer with compute itself.
But my knowledge of diffusion is non-existent, so take this with a truck of salt.
One application I envision for this kind of sampler, leveraging the diffusion model's capabilities, would be to detect and potentially correct instances of post-hoc reasoning within the buffer. The diffusion model could then help ensure that proper causal reasoning chains are established in that segment before the autoregressive model continues generating. You could also allow for slight, controlled backtracking or revision within that buffer window if the generation starts to go off-track, again using the diffusion model to smooth or adjust the text before committing it and moving forward.
Diffusion brings different set of trade offs, and as you can see it improves speed but I would expect it increases compute required for generation. But this is hard to say for sure without knowing their exact sampling process.
Interestingly we have opposite direction in case with gpt-4o, OpenAI made autoregressive image generation model and it seems it works great.
If the "output quality per compute" is roughly the same for diffusion and auto-regression (is it? I have no idea...), then diffusion will be much more efficient for local inference because the same amount of compute can be packed into a much shorter time period.
No comments yet