So this is basically an “embedding of embeddings”, an approximation of multiple embeddings compressed into one, to reduce dimensionality/increase performance.
All this tells me is that: the “multiple embeddings” are probably mostly overlapping and the marginal value of each additional one is probably low, if you can represent them with a single embedding.
I don’t otherwise see how you can keep comparable performance without breaking information theory.
When looking at multi-vector / ColBERT style approaches, the embedding per token approach can massively increase costs. You might go from a single 768 dimension vector to 128 x 130 = 16,640 dimensions. Even with better results from a multi-vector model this can make it unfeasible for many use-cases.
Muvera, converts the multiple vectors into a single fixed dimension (usually net smaller) vector that can be used by any ANN index. As you now have a single vector you can use all your existing ANN algorithms and stack other quantization techniques for memory savings. In my opinion it is a much better approach than PLAID because it doesn't require specific index structures or clustering assumptions and can achieve lower latency.
All this tells me is that: the “multiple embeddings” are probably mostly overlapping and the marginal value of each additional one is probably low, if you can represent them with a single embedding.
I don’t otherwise see how you can keep comparable performance without breaking information theory.
When looking at multi-vector / ColBERT style approaches, the embedding per token approach can massively increase costs. You might go from a single 768 dimension vector to 128 x 130 = 16,640 dimensions. Even with better results from a multi-vector model this can make it unfeasible for many use-cases.
Muvera, converts the multiple vectors into a single fixed dimension (usually net smaller) vector that can be used by any ANN index. As you now have a single vector you can use all your existing ANN algorithms and stack other quantization techniques for memory savings. In my opinion it is a much better approach than PLAID because it doesn't require specific index structures or clustering assumptions and can achieve lower latency.