Optimizing ClickHouse for Intel's 280 core processors

88 ashvardanian 23 9/17/2025, 6:46:03 PM clickhouse.com ↗

Comments (23)

epistasis · 2h ago

This is my favorite type of HN post, and definitely going to be a classic in the genre for me.

> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.

In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.

ashvardanian · 1h ago

My expectation, they will perform great! I’m now mostly benchmarking on 192 core Intel, AMD, and Arm instances on AWS, and in some workloads they come surprisingly close to GPUs even on GPU-friendly workloads, once you get the SIMD and NUMA pinning parts right.

For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs

vlovich123 · 19m ago

I'm generally surprised they're still using the unmaintained old version of jemalloc instead of a newer allocator like the Bazel-based TCMalloc or mimalloc which have significantly better techniques due to better OS primitives & about a decade or so of R&D behind them.

mrits · 4m ago

besides jemalloc also being used by other columnar databases it has a lot of control and telemetry built in. I don't closely follow tcmalloc but I'm not sure it focuses on large objects and fragmentation over months/years.

secondcoming · 24m ago

Those ClickHouse people get to work on some cool stuff

sdairs · 20m ago

We do! (and we're hiring!)

bee_rider · 1h ago

288 cores is an absurd number of cores.

Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…

That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).

jsheard · 4m ago

It's pretty wide, but 288 cores with 8 FP32 SIMD lanes each is still only about a tenth of the lanes on an RTX 5090. GPUs are really, really, really wide.

bri3d · 13m ago

Sierra Forest (the 288-core one) does not have AVX512.

Intel split their server product line in two:

* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.

* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.

On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.

ashvardanian · 2m ago

Largely true, but there is always a caveat.

On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…

Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(

ashvardanian · 1h ago

Sadly, no! On the bright side, they support new AVX2 VNNI extensions, that help with low precision integer dot products for Vector Search!

SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(

yvdriess · 1h ago

Something that could help is to use llvm-mca or similar to get an idea of the potential speedup.

Sesse__ · 1h ago

A basic block simulator like llvm-mca is unlikely to give useful information here, as memory access is going to play a significant part in the overall performance.

pclmulqdq · 53m ago

AVX-512 is on the P-cores only (along with AMX now). The E-cores only support 256-bit vectors.

If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.

sdairs · 36m ago

how long until I have 288 cores under my desk I wonder?

zokier · 16m ago

Does 2x160 cores count?

https://www.titancomputers.com/Titan-A900-Octane-Dual-AMD-EP...

sdairs · 4m ago

Damn... Think it runs Skyrim?

pixelpoet · 1h ago

This post looks like excellent low-level optimisation writing just in the first sections, and (I know this is kinda petty, but...) my heart absolutely sings at their use of my preferred C++ coding convention where & (ref) neither belongs to the type nor the variable name!

nivertech · 1h ago

I think it belongs to type, but since they use “auto” it looks standalone and can be confused with the “&” operator. I personally always used * and & as a prefix of the variable name, not as a suffix in the type name, except when used to specify types in templates.

jiehong · 1h ago

Great work!

I like duckdb, but clickhouse seems more focused on large scale performance.

I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?

sdairs · 50m ago

ClickHouse works in-process and on the CLI just like DuckDB, but also scales to hundreds of nodes - so it's really not limited to just large scale. Handling those smaller cases with a great experience is still a big focus for us

hobo_in_library · 1h ago

Not sure what happened here, but it's not uncommon for a post to have one primary author and then multiple reviewers/supporters also credited

sdairs · 53m ago

Yep that's pretty much the case here!