Apple's Liquid Glass is prep work for AR interfaces, not just a design refresh (omc345.substack.com)

[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]

[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head

sounds like a big win for memory constrained environments like local inference

killerstorm · 31d ago

Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model: https://huggingface.co/papers/2505.03005

karmakaze · 31d ago

I'm not "in the field" though I like to read about and use LLMs. This video "How DeepSeek Rewrote the Transformer [MLA]"[0] is really good at explaining MHA, MQA, GQA, and MLA with clear visuals/animations and how DeepSeek MLA is 57x more efficient.

[0] https://www.youtube.com/watch?v=0VLAoVGf_74&t=960s

olq_plo · 32d ago

Very cool idea. Can't wait for converted models on HF.

MichaelMoser123 · 31d ago

deepseek-v2,v3,r1 are all using multi-headed attention.

magicalhippo · 32d ago

I'm just following the field from the sidelines, but this looks interesting to me. Especially the increase in expressiveness that the new model allows for over GQA, at the cost of just ~10% more memory, and the fact that you can convert existing GQA models like LLaMA, Qwen etc with just a bit of fine-tuning.

Perhaps a trivial insight but I feel a lot of progress often comes in the form of generalizations, where existing approaches can be seen as special cases. Here the authors show that Group Query Attention (GQA) and Multi-Query Attention (MQA) falls out as special cases of their new model.

edit:

Adding my own summary, as I understand it.

The key to what they're doing, no pun intended, is to rely on the fact that large, high-dimensional, matrices may contain a lot of redundant information. Thus one may be able to find an good approximation which has less redundant information, by going through an intermediary stage which has fewer dimensions.

A n-by-m matrix M takes n-dimensional vectors and transforms them to m-dimensional vectors. The trick here is to replace matrix A by two matrices, L and R, which are n-by-r and r-by-m respectively, where r is smaller than n and m. This is called a low-rank approximation.

In a sense you're "straining the matrix", by forcing the information to pass through an intermediary, low-dimensional vector.

The memory savings come from the fact that matrix A has n*m entries, while L and R have n*r and r*m entries respectively. Say n = m = 100 and r = 20, that means A has 100*100 = 10k entries, while L and R have just 100*20 + 20*100 = 4k entries in total.

The trick itself is not new, for example it is also used in LoRA where an additional low-rank approximation matrix is used to tweak the output of an existing model. The low rank means there's far fewer the matrix entries, aka parameters, to train than if one had used a regular fully dense matrix.

The extra expressiveness of MLA comes from the fact that in GQA, in order to save memory, some of the matrices are actually built by gluing copies of a narrower matrix together. This means the information in the glued-up matrices are very redundant and fixed in a certain way, and thus are restricted in how they can transform the inputs.

By using the low-rank approximation instead, the information in the full, reconstructed matrices are not fixed in the same way compared to the glued-up result. Thus the inputs can be transformed in a less restrictive way, leading to the increase in expressiveness.

The GQA method saves a bit more memory compared to MLA as the narrower matrices are even smaller than the low-rank matrices in MLA, but at the cost of expressiveness.

wiz21c · 32d ago

Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

kavalg · 32d ago

My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

yorwba · 32d ago

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

kavalg · 31d ago

Thanks for the clarification.

freeqaz · 32d ago

Also makes models smarter ("expressive")

EGreg · 32d ago

All you need to stop posting titles like that !

Peano arithmetic is enough, because Peano arithmetic encodes computation (math.stackexchange.com)

Last fifty years of integer linear programming: Recent practical advances (inria.hal.science)

Model Once, Represent Everywhere: UDA (Unified Data Architecture) at Netflix (netflixtechblog.com)

Google Cloud Incident Report – 2025-06-13 (status.cloud.google.com)

SIMD-friendly algorithms for substring searching (0x80.pl)

The Many Sides of Erik Satie (thereader.mitpress.mit.edu)

Solidroad (YC W25) Is Hiring (solidroad.com)

Slowing the flow of core-dump-related CVEs (lwn.net)

Endometriosis is an interesting disease (owlposting.com)

Filedb: Disk-based key-value store inspired by Bitcask (github.com)

Implementing Logic Programming (btmc.substack.com)

Me an' Algernon – grappling with (temporary) cognitive decline (tidyfirst.substack.com)

TimeGuessr (timeguessr.com)

Saab achieves AI milestone with Gripen E (saab.com)

Self-Adapting Language Models (arxiv.org)

Man Killed by Police After Spiraling into ChatGPT-Driven Psychosis (futurism.com)

The Army’s Newest Recruits: Tech Execs From Meta, OpenAI and More (wsj.com)

How I uncovered a potential ancient Rome wine scam (phys.org)

Liquid Glass – WWDC25 [video] (developer.apple.com)

The international standard for identifying postal items (akpain.net)

Student discovers fungus predicted by Albert Hoffman (wvutoday.wvu.edu)

Protecting your code from other people's bugs (doi.org)

If the moon were only 1 pixel: A tediously accurate solar system model (2014) (joshworth.com)

Mollusk shell assemblages as a tool for identifying unaltered seagrass beds (int-res.com)

I convinced HP's board to buy Palm and watched them kill it (philmckinney.substack.com)

Whatever Happened to Sandboxfs? (blogsystem5.substack.com)

The Hat, the Spectre and SAT Solvers (2024) (nhatcher.com)

100 years of Zermelo's axiom of choice: What was the problem with it? (2006) (research.mietek.io)

Apple's Liquid Glass is prep work for AR interfaces, not just a design refresh (omc345.substack.com)

$100 Hamburger (en.wikipedia.org)

Frequent reauth doesn't make you more secure (tailscale.com)

When random people give money to random other people (2017) (quomodocumque.wordpress.com)

Green Tea Garbage Collector (github.com)

Jemalloc Postmortem (jasone.github.io)

Show HN: Tattoy – a text-based terminal compositor (tattoy.sh)

Using computers more freely and safely (2023) (akkartik.name)

Caltrain official lived in secret apartment built illegally inside train station (sfstandard.com)

OxCaml - a set of extensions to the OCaml programming language. (oxcaml.org)

AI agent startups at Y Combinator’s Spring ’25 Demo Day (businessinsider.com)

High-speed fluorescence light field tomography of whole freely moving organisms (opg.optica.org)

Ask HN: How do I give back to people helped me when I was young and had nothing?

Shaping Light – Volumetric Lighting (blog.maximeheckel.com)

Paleoproteomic profiling recovers diverse proteins from 200yo human brains (phys.org)

How I program with agents (crawshaw.io)

A Study of the Winston Red: The Smithsonian's New Fancy Red Diamond (gia.edu)

EDAN: Towards Understanding Memory Parallelism and Latency Sensitivity in HPC [pdf] (spcl.inf.ethz.ch)

Geometry from Quantum Temporal Correlations (arxiv.org)

GPU-accelerated Llama3.java inference in pure Java using TornadoVM (github.com)

Meta invests $14.3B in Scale AI to kick-start superintelligence lab (nytimes.com)

RISC-V in AI and HPC Part 1: Per Aspera Ad Astra? (eetimes.com)

TransMLA: Multi-head latent attention is all you need

Comments (32)