Activeloop (YC S18) Is Hiring Member of Technical Staff – Back End Engineering (careers.activeloop.ai)

I was surprised to see 5090's theoretical BF16 TFLOPs at just 209.5. That's not even 10% of the server Blackwell (B200 is 2250, and GB200 is 2500). B200 costs around $30-40k per GPU, so they are pretty close in performance per dollar.

Starting with 4090, NVIDIA limits the performance of tensor cores on gaming cards, specifically for ops that might be used in ML training. FP8 and FP16 matmuls run at full speed if accumulating in FP16 (I've never seen anyone use this), but only half speed when accumulating in FP32. This restriction is not present for lower precision matmuls like FP4, and is removed entirely on the workstation-class cards like RTX Pro 6000.

It doesn't seem worth it to use NVIDIA gaming cards as a "cheaper FLOPs" alternative anymore (e.g. diffusion models could have been cheaper to run on 3090 than A100). They are generous with memory bandwidth though, nearly 2TB/s on 5090 is amazing!

mota7 · 17m ago

Is there really that big a different in TFLOPS between the GB100 and GB202 chips? The GB100 has fewer SMs than the GB202, so I'm confused about where the 10x performance would be coming from?

laidoffamazon · 5m ago

Isn't the new trend to train in lower precision anyway?

steinvakt2 · 1h ago

Isn't 5090 FE (roughly 2500 USD in my country) pretty good FLOP value? 32 GB VRAM (and flash attention pushes it even faster compared to apple/mps relatively cheap "vram")

neilmovva · 1h ago

Not really:

5090: 210 TF / $2k == 105 TF/$k

B200: 2250 TF / $40k == 56 TF/$k

Getting only 2x the FLOPs per dollar probably isn't worth the hassle of having to rack 10x as many GPUs, while having no NVLink.

lossolo · 53s ago

One of the reasons they removed NVLink from consumer cards (they supported it before). There’s also an issue with power consumption (1xB200 vs 10x5090)

gautamcgoel · 40m ago

Do you have a source for that B200 price?

Scene_Cast2 · 55m ago

My issue with upgrading to the 5090 for workstation ML use is that it both has higher TDP than the 4090 and it can only be limited to 70% power (not 50% like the 4090).

ProofHouse · 4h ago

Damn awesome. This going to take me 3 reads and a week to digest

steinvakt2 · 4h ago

I had a 5090 some months ago but couldnt get flash attention to work. Does it now work natively? What about 5080?

sigmoid10 · 3h ago

Pytorch now has native support for the Blackwell architecture:

https://pytorch.org/blog/pytorch-2-7/

SynasterBeiter · 1h ago

It does, but the performance is pretty bad, worse than Hopper.

zackangelo · 2h ago

Curious what issues you were having. The kernel should compile natively if you pass nvcc the correct arch flags, although it probably won't take advantage of any new hardware features.

doctorpangloss · 4h ago

Hmm, but supposing the accelerated NVIDIA specific inference data types were available for Triton, then you would just use that? Why not contribute to Triton, they accept PRs? Like so what if you do free product ecosystem development for NVIDIA and giant corporations by contributing to Triton?

qeternity · 3h ago

Second line of the post:

> The main objective is to learn writing attention in CUDA C++, since many features are not available in Triton, such as MXFP8 / NVFP4 MMA for sm120.

Bild AI (YC W25) Is Hiring an Applied AI Engineer (workatastartup.com)

Text.ai (YC X25) Is Hiring Founding Full-Stack Engineer (ycombinator.com)

Cua (YC X25) is hiring design engineers in SF (ycombinator.com)

Activeloop (YC S18) Is Hiring Member of Technical Staff – Back End Engineering (careers.activeloop.ai)

Coris (YC S22) Is Hiring (ycombinator.com)

14.ai (YC W24) is hiring engineers in SF to build an AI-native Zendesk (14.ai)

Spice Data (YC S19) Is Hiring a Product Associate (New Grad) (ycombinator.com)

Ashby (YC W19) Is Hiring Design Engineers in AMER and EMEA (ashbyhq.com)

EasyPost (YC S13) Is Hiring (easypost.com)

Tesorio (YC S15) Is Hiring a Senior GenAI Engineer (100% Remote) (tesorio.com)

OneSignal (YC S11) Is Hiring Engineers (onesignal.com)

Axle (YC S22) is hiring product engineers (ycombinator.com)

Mbodi AI (YC X25) Is Hiring a Founding Research Engineer (Robotics) (ycombinator.com)

ReadMe (YC W15) Is Hiring a Developer Experience PM (readme.com)

Weave (YC W25) is hiring a founding AI engineer (ycombinator.com)

Depot (YC W23) Is Hiring a Community and Events Manager (Remote) (ycombinator.com)

CoLoop (YC S21) Is Hiring AI Engineers in London

Trellis (YC W24) Is Hiring: Automate Prior Auth in Healthcare (ycombinator.com)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Foundry (YC F24) is hiring staff-level product engineers (ycombinator.com)

GoGoGrandparent (YC S16) Is Hiring Back End and Full-Stack Engineers

Kyber (YC W23) is hiring enterprise account executives (ycombinator.com)

Converge (YC S23) well-capitalized New York startup seeks product developers (runconverge.com)

Great Question (YC W21) Is Hiring a VP of Engineering (Remote) (ycombinator.com)

Coverage Cat (YC S22) Is Hiring a Senior, Staff, or Principal Engineer (coveragecat.com)

Kaizen (YC X25) is hiring engineers to build browser agents that work (kaizenautomation.com)

Infracost (YC W21) hiring first PM to shift $600B cloud spend to proactive (ycombinator.com)

Sei (YC W22) Is Hiring a Full Stack Engineer in Chennai, India (ycombinator.com)

Artie (YC S23) Is Hiring Founding AEs (ycombinator.com)

Cedana (YC S23) Is Hiring a Systems Engineer (ycombinator.com)

CodeCrafters (YC S22) is hiring first Marketing Person (ycombinator.com)

Writing Speed-of-Light Flash Attention for 5090 in CUDA C++

Comments (15)