From Multi-Head to Latent Attention: The Evolution of Attention Mechanisms

73 mgninad 15 8/30/2025, 5:45:24 AM vinithavn.medium.com ↗

Comments (15)

attogram · 4h ago
"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....
sivm · 1h ago
Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.
treyd · 48m ago
Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?
adastra22 · 3h ago
Definitely. I always assumed that, having been involved in writing similarly groundbreaking papers… or so we thought at the time. All my coauthors spent significant time thinking about what the best title would be, and strategies like that were common. (It ended up not mattering for us.)
iLoveOncall · 1h ago
I recommend reading this article which explains how you can get your papers accepted, and explains that a catchy title is the #1 most important thing: https://maxwellforbes.com/posts/how-to-get-a-paper-accepted/ (not a plug, I just saved it because it was interesting)
hyperbovine · 1h ago
It sounds like a typical neurips paper to me. And no, they did know what a big deal it would be, else google never would have given the idea away.
JSR_FDED · 4h ago
Any way to read this without making an account?
djoldman · 3m ago
just turn off JS.
kuidaumpf · 4h ago
qcnguy · 3h ago
Just click the x at the top right of the interstitial?
iLoveOncall · 1h ago
That only work for a few articles per month. But usually opening in incognito does the trick.
mrtesthah · 5h ago
Do we know if any of these techniques are actually used in the so-called "frontier" models?
gchadwick · 1h ago
Who knows what the closed source models use but certainly going by what's happening in open models all the big changes and corresponding gains in capability are in training techniques not model architecture. Things like GQA and MLA as discussed in this article are important techniques for getting better scaling but are relatively minor tweak vs the evolution in training techniques.

I suspect closed models aren't doing anything too radically different from what's presented here.

vinithavn01 · 4h ago
The model names are mentioned under each type of attention mechanism