The Theoretical Limitations of Embedding-Based Retrieval

59 fzliu 6 8/29/2025, 8:25:34 PM arxiv.org ↗

Comments (6)

gdiamos · 2h ago
Their idea is that capacity of even 4096-wide vectors limits their performance.

Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.

It seems like the holy grail is a sparse semantic model. I wonder how splade would do?

CuriouslyC · 38m ago
We already have "sparse" embeddings. Google's Matryoshka embedding schema can scale embeddings from ~150 dimensions to >3k, and it's the same embedding with layers of representational meaning. Imagine decomposing an embedding along principle components, then streaming the embedding vectors in order of their eigenvalue, kind of the idea.
jxmorris12 · 2m ago
Matryoshka embeddings are not sparse. And SPLADE can scale to tens or hundreds of thousands of dimensions.
tkfoss · 24m ago
Wouldn't holy grail then be parallel channels for candidate generation;

  euclidean embedding
  hyperbolic embedding
  sparse BM25 / SPLADE lexical search
  optional multi-vector signatures

  ↓ merge & deduplicate candidates
followed by weight scoring, expansion (graph) & rerank (LLM)?
ArnavAgrawal03 · 58m ago
we used multi-vector models at Morphik, and I can confirm the real-world effectiveness, especially when compared with dense-vector retrieval.
codingjaguar · 7m ago
Curious is that colbert-like ones?