TREAD: Token Routing for Efficient Architecture-Agnostic Diffusion Training

30 fzliu 5 8/18/2025, 5:29:16 PM arxiv.org ↗

Comments (5)

platers · 2h ago
I'm struggling to understand where the gains are coming from. What is the intuition for why DiT training was so inefficient?
joshred · 2h ago
This is the high-level explanation of the simplest diffusion architecture. The model trains by taking an image and iteratively adding noise to the image until there is only noise. Then they take that sequence of noisier and noisier images and they reverse it. The result is that they start with only noise, and they predict the removal of noise at step until they get to the final step (which should be the original image (or training input)).

That process means they may require a hundred or more training iterations on a single image. I haven't digested the paper, but it sounds like they are proposing something conceptually similar to skip layers (but significantly more involved).

arjvik · 1h ago
Isn't this just Mixture-of-Depths but for DiTs?

If so, what are the DiT specific changes that needed to be made?

lucidrains · 1h ago
very nice, will have to try it out! this is the same research group from which Robin Rombach (of stable diffusion fame) originated from
earthnail · 2h ago
Wow, Ommer’s students never fail to impress. 37x faster for a generic architecture, ie no domain specific tricks. Insane.