The thing to understand about any model architecture is that there isn't really anything special about one or the other - as long as the process differentiable, ML can learn it.
You can build an image generator that basically renders each word on one line in an image, and then uses a transformer architecture to morph the image of the words into what the words are describing.
They only big difference is really efficiency, but we are just taking stabs at the dark at this point - there is work that Google is doing that eventually is going to result in the most optimal model for a certain type of task.
bcherry · 1h ago
"The sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material."
- Michelangelo
porphyra · 1h ago
Meanwhile, if you want diffusion models explained with math for a graduate student, there's Tony Duan's Diffusion Models From Scratch.
Notably, Lilian did not explain diffusion models simply. This is a fantastic resource that details how they actually work, but your casual reader is unlikely to develop any sort of understanding from this.
kmitz · 2h ago
Thanks, I was looking for an article like this, with a focus on the differences between generative AI techniques.
My guess is that since LLMs and image generation became mainstream at the same time, most people don't have the slightest idea they are based on fundamentally different technologies.
cubefox · 2h ago
It's nice that this contains a comparison between diffusion models that are used for image models, and the autoregressive models that are used for LLMs.
But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: https://arxiv.org/abs/2404.02905
The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.
In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.
jdthedisciple · 30m ago
Not to be that guy but an article on diffusion models with only one image ... and that too just noise?
cubefox · 2h ago
That's a nice high-level explanation: short and easy to understand.
You can build an image generator that basically renders each word on one line in an image, and then uses a transformer architecture to morph the image of the words into what the words are describing.
They only big difference is really efficiency, but we are just taking stabs at the dark at this point - there is work that Google is doing that eventually is going to result in the most optimal model for a certain type of task.
- Michelangelo
[1] https://www.tonyduan.com/diffusion/index.html
But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: https://arxiv.org/abs/2404.02905
The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.
In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.