How Attention Sinks Keep Language Models Stable

18 pr337h4m 3 8/8/2025, 8:53:10 AM hanlab.mit.edu ↗

Comments (3)

Calavar · 47m ago
> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

am17an · 7m ago
This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!
Havoc · 53m ago
> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting