Lightning declines over shipping lanes following regulation of sulfur emissions (theconversation.com)

I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so that's almost in line with their relative performance.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, or removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 4090 than H100). They are generous with memory bandwidth though, nearly 2TB/s is amazing!

steinvakt2 · 12m ago

Isn't 5090 FE (roughly 2500 USD in my country) pretty good FLOP value? 32 GB VRAM (and flash attention pushes it even faster compared to apple/mps relatively cheap "vram")

ProofHouse · 2h ago

Damn awesome. This going to take me 3 reads and a week to digest

steinvakt2 · 2h ago

I had a 5090 some months ago but couldnt get flash attention to work. Does it now work natively? What about 5080?

sigmoid10 · 1h ago

Pytorch now has native support for the Blackwell architecture:

https://pytorch.org/blog/pytorch-2-7/

SynasterBeiter · 14m ago

It does, but the performance is pretty bad, worse than Hopper.

zackangelo · 55m ago

Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.

doctorpangloss · 2h ago

Hmm, but supposing the accelerated NVIDIA specific inference data types were available for Triton, then you would just use that? Why not contribute to Triton, they accept PRs? Like so what if you do free product ecosystem development for NVIDIA and giant corporations by contributing to Triton?

qeternity · 2h ago

Second line of the post:

> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.

RFC 9839 and Bad Unicode (tbray.org)

Libre – An anonymous social experiment without likes, followers, or ads (libreantisocial.com)

Librebox: An open source, Roblox-compatible game engine (github.com)

Manim: Animation engine for explanatory math videos (github.com)

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++ (gau-nernst.github.io)

I Made a Floppy Disk from Scratch (kottke.org)

Rethinking the Linux cloud stack for confidential VMs (lwn.net)

450× Faster Joins with Index Condition Pushdown (readyset.io)

Bild AI (YC W25) Is Hiring Applied AI Founding Engineer (workatastartup.com)

Developer's block (underlap.org)

Determinants and causal effects of admission to selective private colleges [pdf] (nber.org)

WebR – R in the Browser (docs.r-wasm.org)

Lightning declines over shipping lanes following regulation of sulfur emissions (theconversation.com)

Waitgroups: What they are, how to use them and what changed with Go 1.25 (mfbmina.dev)

The JWST Rocky Worlds DDT Program reveals GJ 3929B to likely be a bare rock (arxiv.org)

David Klein's TWA Posters (2018) (flashbak.com)

World Wide Lightning Location Network (wwlln.net)

Converting an online game to work without any JavaScript (bejofo.com)

Shader Academy: Learn computer graphics by solving challenges (shaderacademy.com)

Texas Instruments' $60B U.S. project, the next iPhone chips fabric (cnbc.com)

RFK Jr demanded a vaccine study be retracted – the journal said no (nature.com)

My experience creating software with LLM coding agents – Part 2 (Tips) (efitz-thoughts.blogspot.com)

The first Media over QUIC CDN: Cloudflare (moq.dev)

You can't grow cool-climate plants in hot climates (crimepaysbutbotanydoesnt.com)

Show HN: JavaScript-free (X)HTML Includes (github.com)

Nitro: A tiny but flexible init system and process supervisor (git.vuxu.org)

The Fancy Rug Dilemma (epan.land)

The theory and practice of selling the Aga cooker (1935) [pdf] (comeadwithus.wordpress.com)

Game math: precise control over numeric springing (allenchou.net)

Echidna Enters a New Era of Symbolic Execution (gustavo-grieco.github.io)

FFmpeg 8.0 (ffmpeg.org)

Top Secret: Automatically filter sensitive information (thoughtbot.com)

I run a full Linux desktop in Docker just because I can (howtogeek.com)

Robots can now learn to use tools just by watching us (techxplore.com)

ArduinoOS (2017) (github.com)

I Hacked McDonald's (Security Contact Was Harder to Find Than Secret Recipe) (bobdahacker.com)

LabPlot: Free, open source and cross-platform Data Visualization and Analysis (labplot.org)

The Importance of Counter-Clockwise Dance Rituals (honest-broker.com)

Glyn: Type-safe PubSub and Registry for Gleam actors with distributed clustering (github.com)

The use of LLM assistants for kernel development (lwn.net)

It’s not wrong that "\u{1F926}\u{1F3FC}\u200D\u2642\uFE0F".length == 7 (2019) (hsivonen.fi)

Measuring the environmental impact of AI inference (arstechnica.com)

Leaving Gmail for Mailbox.org (giuliomagnifico.blog)

The issue of anti-cheat on Linux (2024) (tulach.cc)

Launch HN: BlankBio (YC S25) – Making RNA Programmable

Websites and web developers mostly don't care about client-side problems (utcc.utoronto.ca)

Bluesky Goes Dark in Mississippi over Age Verification Law (wired.com)

Closing the Nix gap: From environments to packaged applications for rust (devenv.sh)

VHS-C: When a lazy idea stumbles towards perfection [video] (youtube.com)

Computer fraud laws used to prosecute leaking air crash footage to CNN (techdirt.com)

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

Comments (9)