How Attention Sinks Keep Language Models Stable

Comments (3)

Calavar · 47m ago

> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

am17an · 7m ago

This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!

Havoc · 53m ago

> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting

HTTP Error Simulator (github.com)

Modeling Symmetries (markneumann.xyz)

Firefly lights the markets up; SpaceX starts selling trips to Mars (arstechnica.com)

Real time tariff fees by country (tariffcheck.org)

Hacker5News: A Minimalist App for the Top Hacker News Stories (github.com)

Why Students Stopped Walking to School (medium.com)

Wassette: WebAssembly-based tools for AI agents (opensource.microsoft.com)

Show HN: TAIsk – AI tool that breaks down yearly goals into daily to-dos (taisk.app)

Show HN: Azure Key Vault Emulator (github.com)

FCC Democrat: Trump admin is declaring "Mission Accomplished" on broadband (arstechnica.com)

Ask HN: Help for Choosing an Embedding Model?

Show HN: I built a task manager that track progress, not just completion (apps.apple.com)

Webb Finds Evidence for Planet Around Closest Solar Twin (science.nasa.gov)

Bash Harder with Vim (oppi.li)

The Windows 10 emoji picker has been broken for a month (rozab.dev)

Web browsers with AI assistants built-in are coming (manualdousuario.net)

Gamer playing old arcade games with just one credit (youtube.com)

Pull-based AI waits for you to ask. Push-based AI acts before you even think to

I Cried and Begged on Deaf Ears by Humiliating Man (etechx.co.ke)

Ultrathin business card runs a fluid simulation (github.com)

Resonant Awakenings: The Social Lives of Conspiracy Theorists (journals.sagepub.com)

Hotels in Europe to sue Booking.com over 'abusive' practices (theguardian.com)

Best robot lawn mowers of 2025 tested and ranked – no boundary wire needed (notebookcheck.net)

The GPT-5 Launch Was Concerning (blog.charliemeyer.co)

What we learned from creating PostCSS (evilmartians.com)

Amtrak NextGen Acela Debuts on August 28 (media.amtrak.com)

Lack of Scientific Expertise in US Courts Is a Cause of Concern after Chevron (agupubs.onlinelibrary.wiley.com)

It's Beginning to Smell a Lot Like Stagflation (paulkrugman.substack.com)

Show HN: Trickle Canvas – Collaborate visually with AI to build websites (trickle.so)

US Executive Order: Improving Oversight of Federal Grantmaking (whitehouse.gov)

Airflow is not your data platform (tower.dev)

Lyten to Acquire All Remaining Northvolt Assets in Sweden and Germany (lyten.com)

I don't read your email threads (loganmarek.com)

Dropbox announces new gen server hardware for higher efficiency and scalability (dropbox.tech)

"Change your profile picture to a Clippy" [video] (youtube.com)

What If We Had a Different Trichromacy? [video] (youtube.com)

Trump demands 'highly conflicted' Intel CEO resign over China ties (reuters.com)

I responded to one of those spam texts from a "recruiter"–then took the job (slate.com)

Google ending Steam for Chromebook support in 2026 (9to5google.com)

McCulloch vs. Maryland (1819) (archives.gov)

Why blow up satellites when you can just hack them? (theregister.com)

GitHub Spark (githubnext.com)

UK allows facial recognition scans of passport, immigration databases (theregister.com)

Show HN: Sysmodeler.ai – AI-native modeling for cars, aircraft, & space systems (sysmodeler.ai)

iOS Client for Proton Authenticator (github.com)

Ask HN: Which AI do you think is the best, and tell me why?

When asked what model it is, GTP-5 says it's GPT-4.5

Understanding World or Predicting Future? A Comprehensive Survey of World Models (dl.acm.org)

KLM, Air France latest major organizations looted for customer data (theregister.com)

GPT-5 ChatGPT integration coming to Apple Intelligence in iOS 26 (ithinkdiff.com)

How Attention Sinks Keep Language Models Stable

Comments (3)