Fast (catherinejue.com)

Deep seek papers are a must to read for anyone who wants to understand how to make LLMs operate at hyper scale. All western labs hide their best results, or at most release summaries that are about as meaningful as the answers Cleo used to give on stack exchange: https://math.stackexchange.com/questions/562694/integral-int...

I have a suspicion with how quiet all the major players got after the two weeks after deepseek R1 was released that they were reading and implementing everything in the papers that came with it as fast as humanly possible.

Art9681 · 1h ago

None of the major players have ever been quiet. DeepSeek enjoyed about a week or two's worth of press before its spotlight was stolent from the next great model. It never held the top spot, ever, mind you. So I don't understand why you think major players had to say anything about it, when the model was neither first, second or third in real world capability, and why they would have to say anything about it when DeepSeek service processes maybe an 1/8 of what OpenAI, Google or Claude in any given span of time.

I applaud their open efforts. But being "altruistic" and being best are two different things.

sabakhoj · 4h ago

> Despite being sparse, NSA surpasses Full Attention baseline on average across general benchmarks, long-context tasks, and reasoning evaluation.

Isn't it very notable that the latency improvement didn't have a performance loss? I'm not super familiar with all the technical aspects, but that seems like it should be one of the main focuses of the paper.

pyuser583 · 3h ago

I'd say award for best title is a tie between: "Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems"; "Finding Needles in Images: Can Multi-modal LLMs Locate Fine Details?"; and "Steering off Course: Reliability Challenges in Steering Language Models."

CalmStorm · 8h ago

For the first time, it introduced native sparse attention into the full training process, achieving up to 11× inference speedup while maintaining model performance.

israrkhan · 2h ago

Well deserved

gnabgib · 4h ago

Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

The awards page for ACL seems to disagree with this editorialized title: https://2025.aclweb.org/program/awards/

fourdnet · 3h ago

The ACL webpage has not been updated yet. Here are the announcement slides: https://cspaper.org/topic/116/record-breaking-acl-2025-crown...

aspenmayer · 2h ago

The page that the person you’re replying to does have this so it may not be updated, or they were looking in the wrong place originally, or both:

> Industry Track Awards

> Best Paper

> Speed Without Sacrifice: Fine-Tuning Language Models with Medusa and Knowledge Distillation in Travel Applications

> Daniel Zagyva, Emmanouil Stergiadis, Laurens van der Maas, Aleksandra Dokic, Eran Fainman, Ilya Gusev, Moran Beladev

Per TFA, the paper we’re looking for is this one:

> Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

> Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng

I’m not finding it by author on the page you linked but I think it’s this reference by title:

> DeepSeek × PKU × UW — Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

I did find it on this page:

https://2025.aclweb.org/program/main_papers/

ninjin · 2h ago

Link to the published paper rather than the preprint (update link?):

https://aclanthology.org/2025.acl-long.1126

Fast (catherinejue.com)

Study mode (openai.com)

Copyparty – Turn almost any device into a file server (github.com)

EU age verification app to ban any Android system not licensed by Google (reddit.com)

Dumb Pipe (dumbpipe.dev)

‘I witnessed war crimes’ in Gaza – former worker at GHF aid site [video] (bbc.com)

Enough AI copilots, we need AI HUDs (geoffreylitt.com)

Performance and telemetry analysis of Trae IDE, ByteDance's VSCode fork (github.com)

Slow (michaelnotebook.com)

M8.7 earthquake in Western Pacific, tsunami warning issued (earthquake.usgs.gov)

Show HN: Use Their ID – Use your local UK MP’s ID for the Online Safety Act (use-their-id.com)

Show HN: Draw a fish and watch it swim with the others (drawafish.com)

Our $100M Series B (oxide.computer)

Vibe code is legacy code (blog.val.town)

‘No Other Land’ consultant Awdah Hathaleen killed by Israeli settler (latimes.com)

Face it: you're a crazy person (experimental-history.com)

VPN use surges in UK as new online safety rules kick in (ft.com)

Tom Lehrer has died (nytimes.com)

Sleep all comes down to the mitochondria (science.org)

Visa and Mastercard are getting overwhelmed by gamer fury over censorship (polygon.com)

Claude Code weekly rate limits

My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air) (simonwillison.net)

Sign in with Google in Chrome (underpassapp.com)

Corporation for Public Broadcasting ceasing operations (cpb.org)

Ollama's new app (ollama.com)

Tao on “blue team” vs. “red team” LLMs (mathstodon.xyz)

How was the Universal Pictures 1936 opening logo created? (movies.stackexchange.com)

4k NASA employees opt to leave agency through deferred resignation program (kcrw.com)

iPhone 16 cameras vs. traditional digital cameras (candid9.com)

The anti-abundance critique on housing is wrong (derekthompson.org)

MacBook Pro Insomnia (manuel.bernhardt.io)

Irrelevant facts about cats added to math problems increase LLM errors by 300% (science.org)

When we get Komooted (bikepacking.com)

Live coding interviews measure stress, not coding skills (hadid.dev)

Many countries that said no to ChatControl in 2024 are now undecided (digitalcourage.social)

LLM Embeddings Explained: A Visual and Intuitive Guide (huggingface.co)

I designed my own fast game streaming video codec – PyroWave (themaister.net)

Gemini 2.5 Deep Think (blog.google)

RIP Shunsaku Tamiya, the man who made plastic model kits a global obsession (japanesenostalgiccar.com)

Try the Mosquito Bucket of Death (energyvanguard.com)

Debian switches to 64-bit time for everything (theregister.com)

The Math Is Haunted (overreacted.io)

Wikimedia Foundation Challenges UK Online Safety Act Regulations (wikimediafoundation.org)

ACM Transitions to Full Open Access (acm.org)

Ubiquiti launches UniFi OS Server for self-hosting (lazyadmin.nl)

I hacked my washing machine (nexy.blog)

FDA has approved Yeztugo, a drug that provides protection against HIV infection (newatlas.com)

Crush: Glamourous AI coding agent for your favourite terminal (github.com)

Stop selling “unlimited”, when you mean “until we change our minds” (blog.kilocode.ai)

Releasing weights for FLUX.1 Krea (krea.ai)

Native Sparse Attention

Comments (10)