Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

> Barbero et al. have shown that attention sinks serve as "pressure valves" preventing what researchers call "over-mixing"—a pathological state where deep models processing long sequences blur important distinctions between tokens. The presence of a sink draws attention away from other tokens, limiting the spread of information (and noise) and resulting in more stable embeddings.

This sounds like it is working for the wrong reasons. Surely the right behavior is for the right neurons to receive attention rather than the first handful. Jamming everything there is the complementary sin of blurring. I would investigate attention equalization paired with a sparsity prior or something similar to prevent blurring.

yorwba · 19m ago

The point is that there's not always a right token to attend to. If the information you're looking for is not there, no clever attention scheme will find it. The best you can hope for when that happens is that the value returned in the "not found" case is distinguishable from the "found" case. Having an attention sink serve as a fixed "not found" value is one way to do this.

esafak · 7m ago

Good point. Does that make them mitigate hallucinations?

Calavar · 2h ago

> Researchers had observed similar patterns in BERT, where "a surprisingly large amount of attention focuses on the delimiter token [SEP] and periods," which they argued was used by the model as a sort of no-op. The same summer at Meta, researchers studying vision transformers found similar behavior, observing that models would repurpose uninformative background patches as computational scratchpads.

This seems to go beyond just transformers. For example, I recall reading a paper a while ago that showed a similar effect in an image to image model with a GAN/U-Net architecture [1].

[1] https://arxiv.org/abs/1712.02950

Havoc · 2h ago

> The first few tokens often carried minimal semantic information—sometimes just a start-of-sequence marker or common words like "the" or "a."

I wonder if it makes sense to use the first word as a title of sorts rather than going straight in grammatically correct sentence when prompting

am17an · 1h ago

This is nice and useful because the new GPT-OSS model uses this technique. Kudos to the original authors!

Ultrathin business card runs a fluid simulation (github.com)

Food, housing, & health care costs are a source of major stress for many people (apnorc.org)

HorizonDB, a geocoding engine in Rust that replaces Elasticsearch (radar.com)

Window Activation (blog.broulik.de)

GPT-5 (openai.com)

Linear sent me down a local-first rabbit hole (bytemash.net)

The BLS Can't Be Replaced by the Private Sector (bloomberg.com)

Show HN: Trayce – “Burp Suite for developers” (trayce.dev)

Flipper Zero dark web firmware bypasses rolling code security (rtl-sdr.com)

The Windows 10 emoji picker has been broken for a month (rozab.dev)

How we enforce .NET coding standards to improve productivity (anthonysimmon.com)

Historical Tech Tree (historicaltechtree.com)

How Attention Sinks Keep Language Models Stable (hanlab.mit.edu)

Cursor CLI (cursor.com)

What Is Popover=Hint? (una.im)

GPT-5: Key characteristics, pricing and system card (simonwillison.net)

OpenAI's new open-source model is basically Phi-5 (seangoedecke.com)

FLUX.1-Krea and the Rise of Opinionated Models (dbreunig.com)

Exit Tax: Leave Germany before your business gets big (eidel.io)

A love letter to my future employer (2020) (catzkorn.dev)

Virtual Linux Devices on ARM64 (underjord.io)

GPT-5 for Developers (openai.com)

Complex Iterators Are Slow (caolan.uk)

Turn any website into an API (parse.bot)

Encryption made for police and military radios may be easily cracked (wired.com)

Writing a storage engine for Postgres: An in-memory table access method (2023) (notes.eatonphil.com)

The GPT-5 Launch Was Concerning (blog.charliemeyer.co)

Building Bluesky comments for my blog (natalie.sh)

Cursed Knowledge (immich.app)

Achieving 10,000x training data reduction with high-fidelity labels (research.google)

How AI conquered the US economy: A visual FAQ (derekthompson.org)

I don't read your email threads (loganmarek.com)

Windows XP Professional (win32.run)

Claude Code IDE integration for Emacs (github.com)

Benchmark Framework Desktop Mainboard and 4-node cluster (github.com)

TSMC to go 3D with wafer-sized processors (tomshardware.com)

Infinite Pixels (meyerweb.com)

Vibechart (vibechart.net)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

Open music foundation models for full-song generation (map-yue.github.io)

How to sell if your user is not the buyer (writings.founderlabs.io)

Zero-day flaws in authentication, identity, authorization in HashiCorp Vault (cyata.ai)

Show HN: Browser AI agent platform designed for reliability (github.com)

Show HN: Kitten TTS – 25MB CPU-Only, Open-Source TTS Model (github.com)

How to Not Build the Torment Nexus (buttondown.com)

An LLM does not need to understand MCP (hackteam.io)

Leonardo Chiariglione – Co-founder of MPEG (leonardo.chiariglione.org)

The Paranoid Style in American Politics (1964) (harpers.org)

Meta Details Ultra-Wide FOV and "Hyperrealistic VR" Prototype Headsets (uploadvr.com)

Show HN: Octofriend, a cute coding agent that can swap between GPT-5 and Claude (github.com)

How Attention Sinks Keep Language Models Stable

Comments (6)