This paper has been misrepresented many times. At then end it says:
Multi-vector models
Multi-vector models are more expressive through the use of multiple vectors
per sequence combined with the MaxSim operator [Khattab and Zaharia, 2020]. These models show
promise on the LIMIT dataset, with scores greatly above the single-vector models despite using a
smaller backbone (ModernBERT, Warner et al. [2024]). However, these models are not generally
used for instruction-following or reasoning-based tasks, leaving it an open question to how well
multi-vector techniques will transfer to these more advanced tasks.
Sparse models
Sparse models (both lexical and neural versions) can be thought of as single vector
models but with very high dimensionality. This dimensionality helps BM25 avoid the problems of the
neural embedding models as seen in Figure 3. Since the of their vectors is high, they can scale to
many more combinations than their dense vector counterparts. However, it is less clear how to apply
sparse models to instruction-following and reasoning-based tasks where there is no lexical or even
paraphrase-like overlap. We leave this direction to future work.
In other words, it says that both multi-vector (i.e. late interaction) and sparse models hold promise.
Multi-vector models
Multi-vector models are more expressive through the use of multiple vectors per sequence combined with the MaxSim operator [Khattab and Zaharia, 2020]. These models show promise on the LIMIT dataset, with scores greatly above the single-vector models despite using a smaller backbone (ModernBERT, Warner et al. [2024]). However, these models are not generally used for instruction-following or reasoning-based tasks, leaving it an open question to how well multi-vector techniques will transfer to these more advanced tasks.
Sparse models
Sparse models (both lexical and neural versions) can be thought of as single vector models but with very high dimensionality. This dimensionality helps BM25 avoid the problems of the neural embedding models as seen in Figure 3. Since the of their vectors is high, they can scale to many more combinations than their dense vector counterparts. However, it is less clear how to apply sparse models to instruction-following and reasoning-based tasks where there is no lexical or even paraphrase-like overlap. We leave this direction to future work.
In other words, it says that both multi-vector (i.e. late interaction) and sparse models hold promise.