> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.
This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.
yorwba · 19m ago
The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.
esafak · 7m ago
Good point. Does that make them mitigate hallucinations?
Calavar · 2h ago
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.
This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].
[1] https://arxiv.org/abs/1712.02950
I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting