Diffusion Language Models Are Super Data Learners

64 babelfish 2 8/10/2025, 4:04:05 PM jinjieni.notion.site ↗

Comments (2)

woadwarrior01 · 40m ago
> During inference, generating sequences ranging from 16 to 4096 tokens incurs a 16× to 4700× increase in FLOPs compared to AR baselines.

I wonder why the increase in FLOPs has such a wide spectrum? Naively, I'd have expected the FLOPs to increase linearly with the number of tokens. OTOH, it sort of makes sense because because diffusion models are not autoregressive, as their name suggests.

ckjellqv · 1m ago
My guess is that autoregressive models can use Key Value (KV) caching to eliminate most of the FLOPs inside the self-attention block. Can't use KV caching inside diffusion (because it's not a causal model) but they sell this as a win anyway because they believe it leads to better reasoning.