TransMLA: Multi-head latent attention is all you need
58 ocean_moist 4 5/13/2025, 3:29:47 AM arxiv.org ↗
Comments (4)
olq_plo · 2h ago
Very cool idea. Can't wait for converted models on HF.
kavalg · 1h ago
My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.
yorwba · 1h ago
It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.
freeqaz · 1h ago
Also makes models smarter ("expressive")