How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

I find it very hard to justify investing time into learning something that's neither open source nor has multiple interchangeable vendors. Being good at using Nvidia chips sounds a lot like being an ABAP consultant or similar to me. I realize there's a lot of money to be made in the field right now, but IIUC historically this kind of thing has not been a great move.

Philpax · 13m ago

There's more in common with other GPU architectures than there are differences, so a CUDA consultant should be able to pivot if/when the other players are a going concern. It's more about the mindset than the specifics.

saagarjha · 16m ago

Sure, but you can make money in the field and retire faster than it becomes irrelevant. FWIW none of the ideas here are novel or nontransferable–it's just the specific design that is proprietary. Understanding how to do an AllReduce has been of theoretical interest for decades and will probably remain worth doing far into the future.

nickysielicki · 4h ago

The calculation under “Quiz 2: GPU nodes“ is incorrect, to the best of my knowledge. There aren’t enough ports for each GPU and/or for each switch (less the crossbar connections) to fully realize the 450GB/s that’s theoretically possible, which is why 3.2TB/s of internode bandwidth is what’s offered on all of the major cloud providers and the reference systems. If it was 3.6TB/s, this would produce internode bottlenecks in any distributed ring workload.

Shamelessly: I’m open to work if anyone is hiring.

aschleck · 2h ago

It's been a while since I thought about this but isn't the reason providers advertise only 3.2tbps because that's the limit of a single node's connection to the IB network? DGX is spec'ed to pair each H100 with a Connect-X 7 NIC and those cap out at 400gbps. 8 gpus * 400gbps / gpu = 3.2tbps.

Quiz 2 is confusingly worded but is, iiuc, referring to intranode GPU connections rather than internode networking.

physicsguy · 2h ago

It’s interesting that nvshmem has taken off in ML because the MPI equivalents were never that satisfactory in the simulation world.

Mind you, I did all long range force stuff which is difficult to work with over multiple nodes at the best of times.

gregorygoc · 2h ago

It’s mind boggling why this resource has not been provided by NVIDIA yet. It reached the point that 3rd parties reverse engineer and summarize NV hardware to a point it becomes an actually useful mental model.

What are the actual incentives at NVIDIA? If it’s all about marketing they’re doing great, but I have some doubts about engineering culture.

threeducks · 38m ago

With mediocre documentation, NVIDIAs closed-source libraries, such as cuBLAS and cuDNN, will remain the fastest way to perform certain tasks, thereby strengthening vendor lock-in. And of course it makes it more difficult for other companies to reverse engineer.

aanet · 5h ago

Fantastic resource! Thanks for posting it here.

tucnak · 44m ago

This post is a great illustration why TPU's lend more nicely towards homogenous computing: yes, there's systolic array limitations (not good for sparsity) but all things considering, bandwidth doesn't change as your cluster ever so larger grows. It's a shame Google is not interested in selling this hardware: if they were available, it would open the door to compute-in-network capabilities far beyond what's currently available; by combining non-homogenous topologies involving various FPGA solutions, i.e. with Alveo V80 exposing 4x800G NIC's.

Also: it's a shame Google doesn't talk about how they use TPU's outside of LLM.

akshaydatazip · 2h ago

Thanks for the really thorough research on that . Right what I wanted for my morning coffee

porridgeraisin · 1d ago

A short addition that pre-volta nvidia GPUs were SIMD like TPUs are, and not SIMT which post-volta nvidia GPUs are.

camel-cdr · 21h ago

SIMT is just a programming model for SIMD.

Modern GPUs still are just SIMD with good predication support at ISA level.

achierius · 1h ago

That's not true. SIMT notably allows for divergence and reconvergence, whereby single threads actually end up executing different work for a time, while in SIMD you have to always be in sync.

camel-cdr · 1h ago

I'm not aware of any GPU that implements this.

Even the interleaved execution introduced in Volta still can only execute one type of instruction at a time [1]. This feature wasn't meant to accelerate code, but to allow more composable programming models [2].

Going of the diagram, it looks equivilant to rapidly switching between predicates, not executing two different operations at once.

    if (theradIdx.x < 4) {
        A;
        B;
    } else {
        X;
        Y;
    }
    Z;

The diagram shows how this executes in the following order:

Volta:

    ->|   ->X   ->Y   ->Z|->
    ->|->A   ->B   ->Z   |->

pre Volta:

    ->|      ->X->Y|->Z
    ->|->A->B      |->Z

The SIMD equivilant of pre Volta is:

    vslt mask, vid, 4
    vopA ..., mask
    vopB ..., mask
    vopX ..., ~mask
    vopY ..., ~mask
    vopZ ...

The Volta model is:

    vslt mask, vid, 4
    vopA ..., mask
    vopX ..., ~mask
    vopB ..., mask
    vopY ..., ~mask
    vopZ ...

[1] https://chipsandcheese.com/i/138977322/shader-execution-reor...

[2] https://stackoverflow.com/questions/70987051/independent-thr...

adrian_b · 56m ago

"Divergence" is supported by any SIMD processor, but with various amounts of overhead depending on the architecture.

"Divergence" means that every "divergent" SIMD instruction is executed at least twice, with different masks, so that it is actually executed only on a subset of the lanes (i.e. CUDA "threads").

SIMT is a programming model, not a hardware implementation. NVIDIA has never explained exactly how the execution of divergent threads has been improved since Volta, but it is certain that, like before, the CUDA "threads" are not threads in the traditional sense, i.e. the CUDA "threads" do not have independent program counters that can be active simultaneously.

What seems to have been added since Volta is some mechanism for fast saving and restoring separate program counters for each CUDA "thread", in order to be able to handle data dependencies between distinct CUDA "threads" by activating the "threads" in the proper order, but those saved per-"thread" program counters cannot become active simultaneously if they have different values, so you cannot execute simultaneously instructions from different CUDA "threads", unless they perform the same operation, which is the same constraint that exists in any SIMD processor.

Post-Volta, nothing has changed when there are no dependencies between the CUDA "threads" composing a CUDA "warp".

What has changed is that now you can have dependencies between the "threads" of a "warp" and the program will produce correct results, while with older GPUs that was unlikely. However dependencies between the CUDA "threads" of a "warp" shall be avoided whenever possible, because they reduce the achievable performance.

porridgeraisin · 20h ago

I was referring to this portion of TFA

> CUDA cores are much more flexible than a TPU’s VPU: GPU CUDA cores use what is called a SIMT (Single Instruction Multiple Threads) programming model, compared to the TPU’s SIMD (Single Instruction Multiple Data) model.

adrian_b · 47m ago

This flexibility of CUDA is a software facility, which is independent of the hardware implementation.

For any SIMD processor one can write a compiler that translates a program written for the SIMT programming model into SIMD instructions. For example, for the Intel/AMD CPUs with SSE4/AVX/AVX-512 ISAs, there exists a compiler of this kind (ispc: https://github.com/ispc/ispc).

tomhow · 1h ago

Discussion of original series:

How to scale your model: A systems view of LLMs on TPUs - https://news.ycombinator.com/item?id=42936910 - Feb 2025 (30 comments)

AGENTS.md – Open format for guiding coding agents (agents.md)

How to Think About GPUs (jax-ml.github.io)

Ask HN: Why does the US Visa application website do a port-scan of my network?

How to Draw a Space Invader (muffinman.io)

Copilot broke audit logs, but Microsoft won't tell customers (pistachioapp.com)

How we exploited CodeRabbit: From simple PR to RCE and write access on 1M repos (research.kudelskisecurity.com)

Type-machine (arthi-chaud.github.io)

Modern CI Is Too Complex and Misdirected (gregoryszorc.com)

Tiny microbe challenges the definition of cellular life (nautil.us)

The Value of Hitting the HN Front Page (mooreds.com)

D2 (text to diagram tool) now supports ASCII renders (d2lang.com)

How I Made Ruby Faster Than Ruby (noteflakes.com)

Fast and observable background job processing for .NET (github.com)

Analysis of the GFW's Unconditional Port 443 Block on August 20, 2025 (gfw.report)

Calling Their Bluff (anguscheng.com)

Emacs as your video-trimming tool (xenodium.com)

Show HN: Hanaco Weather – A poetic weather SNS from the OS Yamato project (github.com)

Databricks is raising a Series K Investment at >$100B valuation (databricks.com)

Rails Charts Using ECharts from Apache (github.com)

Gaussian Processes for Machine Learning [pdf] (gaussianprocess.org)

Without the futex, it's futile (h4x0r.org)

Candle Flame Oscillations as a Clock (cpldcpu.com)

Intel Foundry Demonstrates First Arm-Based Chip on 18A Node (hothardware.com)

Drunken Bishop (2023) (re.factorcode.org)

AnduinOS (anduinos.com)

CRDT: Text Buffer (madebyevan.com)

How Figma’s multiplayer technology works (2019) (figma.com)

We’re Not So Special: A new book challenges human exceptionalism (democracyjournal.org)

Custom telescope mount using harmonic drives and ESP32 (svendewaerhert.com)

Show HN: OpenAI/reflect – Physical AI Assistant that illuminates your life (github.com)

Why Semantic Layers Matter (and how to build one with DuckDB) (motherduck.com)

Launch HN: Uplift (YC S25) – Voice models for under-served languages

Monoid-Augmented FIFOs, Deamortised (pvk.ca)

Physically Based Rendering in Filament (google.github.io)

The joy of recursion, immutable data, & pure functions: Making mazes with JS (jrsinclair.com)

Geotoy – Shadertoy for 3D Geometry (3d.ameo.design)

A renovation project in Turkey led to the discovery of a lost city (2023) (atlasobscura.com)

Passive Microwave Repeaters (computer.rip)

Positron, a New Data Science IDE (posit.co)

PyPI Preventing Domain Resurrection Attacks (blog.pypi.org)

Notion releases offline mode (notion.com)

Skill issues – Dialectical Behavior Therapy and its discontents (2024) (thedriftmag.com)

Perfect Freehand – Draw perfect pressure-sensitive freehand lines (perfectfreehand.com)

Critical Cache Poisoning Vulnerability in Dnsmasq (lists.thekelleys.org.uk)

The Future of JavaScript: What Awaits Us (jsdev.space)

KPMG wrote 100-page prompt to build agentic TaxBot (theregister.com)

Launch HN: Parachute (YC S25) – Guardrails for Clinical AI

Lazy-brush – smooth drawing with mouse or finger (lazybrush.dulnan.net)

Rough Numbers Between Consecutive Primes (arxiv.org)

Vendors that treat single sign-on as a luxury feature (sso.tax)

How to Think About GPUs

Comments (19)