Muvera: Making multi-vector retrieval as fast as single-vector search

71 georgehill 4 6/26/2025, 10:29:34 AM research.google ↗

Comments (4)

trengrj · 5h ago
We added Muvera to Weaviate recently https://weaviate.io/blog/muvera and also have a nice podcast on it https://www.youtube.com/watch?v=nSW5g1H4zoU.

When looking at multi-vector / ColBERT style approaches, the embedding per token approach can massively increase costs. You might go from a single 768 dimension vector to 128 x 130 = 16,640 dimensions. Even with better results from a multi-vector model this can make it unfeasible for many use-cases.

Muvera, converts the multiple vectors into a single fixed dimension (usually net smaller) vector that can be used by any ANN index. As you now have a single vector you can use all your existing ANN algorithms and stack other quantization techniques for memory savings. In my opinion it is a much better approach than PLAID because it doesn't require specific index structures or clustering assumptions and can achieve lower latency.

bobosha · 1h ago
how is this different from generating a feature hash of the embeddings i.e reduce from many to one embedding reduction? Could a UMAP or such technique be helpful in reducing to a single vector?
dinkdonkbell · 1h ago
UMAP doesn't project values into the same coordinate space. While the abstract properties are the same between projections, where it projects it to in coordinate space won't be the same.
dinobones · 3h ago
So this is basically an “embedding of embeddings”, an approximation of multiple embeddings compressed into one, to reduce dimensionality/increase performance.

All this tells me is that: the “multiple embeddings” are probably mostly overlapping and the marginal value of each additional one is probably low, if you can represent them with a single embedding.

I don’t otherwise see how you can keep comparable performance without breaking information theory.