This is my favorite type of HN post, and definitely going to be a classic in the genre for me.
> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.
In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.
ashvardanian · 1h ago
My expectation, they will perform great! I’m now mostly benchmarking on 192 core Intel, AMD, and Arm instances on AWS, and in some workloads they come surprisingly close to GPUs even on GPU-friendly workloads, once you get the SIMD and NUMA pinning parts right.
For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs
vlovich123 · 19m ago
I'm generally surprised they're still using the unmaintained old version of jemalloc instead of a newer allocator like the Bazel-based TCMalloc or mimalloc which have significantly better techniques due to better OS primitives & about a decade or so of R&D behind them.
mrits · 4m ago
besides jemalloc also being used by other columnar databases it has a lot of control and telemetry built in. I don't closely follow tcmalloc but I'm not sure it focuses on large objects and fragmentation over months/years.
secondcoming · 24m ago
Those ClickHouse people get to work on some cool stuff
sdairs · 20m ago
We do! (and we're hiring!)
bee_rider · 1h ago
288 cores is an absurd number of cores.
Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…
That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).
jsheard · 4m ago
It's pretty wide, but 288 cores with 8 FP32 SIMD lanes each is still only about a tenth of the lanes on an RTX 5090. GPUs are really, really, really wide.
bri3d · 13m ago
Sierra Forest (the 288-core one) does not have AVX512.
Intel split their server product line in two:
* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.
* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.
On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.
ashvardanian · 2m ago
Largely true, but there is always a caveat.
On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…
Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(
ashvardanian · 1h ago
Sadly, no! On the bright side, they support new AVX2 VNNI extensions, that help with low precision integer dot products for Vector Search!
SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(
yvdriess · 1h ago
Something that could help is to use llvm-mca or similar to get an idea of the potential speedup.
Sesse__ · 1h ago
A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.
pclmulqdq · 53m ago
AVX-512 is on the P-cores only (along with AMX now). The E-cores only support 256-bit vectors.
If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.
sdairs · 36m ago
how long until I have 288 cores under my desk I wonder?
This post looks like excellent low-level optimisation writing just in the first sections, and (I know this is kinda petty, but...) my heart absolutely sings at their use of my preferred C++ coding convention where & (ref) neither belongs to the type nor the variable name!
nivertech · 1h ago
I think it belongs to type, but since they use “auto” it looks standalone and can be confused with the “&” operator. I personally always used * and & as a prefix of the variable name, not as a suffix in the type name, except when used to specify types in templates.
jiehong · 1h ago
Great work!
I like duckdb, but clickhouse seems more focused on large scale performance.
I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?
sdairs · 50m ago
ClickHouse works in-process and on the CLI just like DuckDB, but also scales to hundreds of nodes - so it's really not limited to just large scale. Handling those smaller cases with a great experience is still a big focus for us
hobo_in_library · 1h ago
Not sure what happened here, but it's not uncommon for a post to have one primary author and then multiple reviewers/supporters also credited
> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.
In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.
For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs
Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…
That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).
Intel split their server product line in two:
* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.
* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.
On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.
On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…
Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(
SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(
If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.
https://www.titancomputers.com/Titan-A900-Octane-Dual-AMD-EP...
I like duckdb, but clickhouse seems more focused on large scale performance.
I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?