How to Think About GPUs

62 alphabetting 19 8/18/2025, 6:18:36 PM jax-ml.github.io ↗

Comments (19)

tormeh · 37m ago
I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.
Philpax · 13m ago
There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.
saagarjha · 16m ago
Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.
nickysielicki · 4h ago
The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck · 2h ago
It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

physicsguy · 2h ago
It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

gregorygoc · 2h ago
It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

threeducks · 38m ago
With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.
aanet · 5h ago
Fantastic resource! Thanks for posting it here.
tucnak · 44m ago
This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

akshaydatazip · 2h ago
Thanks for the really thorough research on that . Right what I wanted for my morning coffee
porridgeraisin · 1d ago
A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.
camel-cdr · 21h ago
SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

achierius · 1h ago
That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.
camel-cdr · 1h ago
I'm not aware of any GPU that implements this.

Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].

Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.

    if (theradIdx.x < 4) {
        A;
        B;
    } else {
        X;
        Y;
    }
    Z;
The diagram shows how this executes in the following order:

Volta:

    ->|   ->X   ->Y   ->Z|->
    ->|->A   ->B   ->Z   |->
pre Volta:

    ->|      ->X->Y|->Z
    ->|->A->B      |->Z
The SIMD equivilant of pre Volta is:

    vslt mask, vid, 4
    vopA ..., mask
    vopB ..., mask
    vopX ..., ~mask
    vopY ..., ~mask
    vopZ ...
The Volta model is:

    vslt mask, vid, 4
    vopA ..., mask
    vopX ..., ~mask
    vopB ..., mask
    vopY ..., ~mask
    vopZ ...

[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...

[2] https://stackoverflow.com/questions/70987051/independent-thr...

adrian_b · 56m ago
"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.

"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").

SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.

What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.

Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".

What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.

porridgeraisin · 20h ago
I was referring to this portion of TFA

> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.

adrian_b · 47m ago
This flexibility of CUDA is a software facility, which is independent of the hardware implementation.

For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).

tomhow · 1h ago
Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)