> Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation.
Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.
gnabgib · 9m ago
Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
For the first time, it introduced native sparse attention into the full training process, achieving up to 11× inference speedup while maintaining model performance.
Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.
The awards page for ACL seems to disagree with this editorialized title: https://2025.aclweb.org/program/awards/