LL3M: Large Language 3D Modelers (threedle.github.io)

> Modern enterprise NVMe SSDs are very fast…. This is a simple benchmark on a Samsung PM9A1 on a with a theoretical maximum transfer rate of 3.5 GB/s. … It should be noted that this is a sub-optimal setup that is less powerful than what the PM9A1 is capable of due to running on a downgraded PCIe link.

Samsung has client, datacenter, and enterprise lines. The PM9A1 is part of the OEM client segment and is about the same as a 980 Pro. Its top speeds (about 7GB/s read, 5GB/s write) are better than the comparable datacenter class drive, PM9A3. This top speeds comes with less consistent performance than you get with a PM9A3 or an enterprise drive like a PM1733 from the same era (early PCIe Gen 4 drives).

kvemkon · 1h ago

> 128 KB appears a point of diminishing returns, larger block sizes yield similar or worse performance.

Indeed, 128 KB is a well-known long lasted optimal buffer size [1], [2].

Until it has been increased to 256 KB recently (07.04.2024) [3].

[1] https://github.com/MidnightCommander/mc/commit/e7c01c7781dcd...

[2] https://github.com/MidnightCommander/mc/issues/2193

[3] https://github.com/MidnightCommander/mc/commit/933b111a5dc7d...

marginalia_nu · 1h ago

I wonder if a more robust option is to peek in the sysfs queue info on Linux.

It has some nice information about hardware io operation limits, and also an optimal_io_size hint.

https://www.kernel.org/doc/html/v5.3/block/queue-sysfs.html

marginalia_nu · 3h ago

I urge you to read the papers and articles I linked at the end if any of this is your jam. They are incredible bangers all of them.

6r17 · 3h ago

Thanks for sharing this !

codeaether · 1h ago

Actually, to fully utilize NVME performance, one really need to try to avoid OS overhead by leveraging AsyncIO such as IO_Uring. In fact, 4KB page works quite well if you can issue enough outstanding requests. See a paper from the link below by the TUM folks.

https://dl.acm.org/doi/abs/10.14778/3598581.3598584

marginalia_nu · 49m ago

As part of the problem domain in index lookups, issuing multiple requests at the same time is not possible, unless as part of some entirely guess-based readahead scheme thay may indeed drive up disk utilization but are unlikely to do much else. Large blocks are a solution with that constraint as a given.

That paper seems to mostly focus on throughput via concurrent independent queries, rather than single-query performance. It's arriving at a different solution because it's optimizing for a different variable.

ozgrakkurt · 21m ago

4KB is much slower than 512KB if you are using the whole data. Smaller should be better if there is read amplification

dataflow · 1h ago

SPDK is what folks who really care about this use, I think.

kvemkon · 2h ago

> 256 KB vs 512 B

> A counter argument might be that this drives massive read amplification,

For that, one need to know the true minimal block size SSD controller is able to physically read from flash. Asking for less than this wouldn't avoid the amplification.

jeffbee · 2h ago

Fun post. One unmentioned parameter is the LBA format being used. Most devices come from the factory configured for 512B, so you can boot NetWare or some other dumb compatibility concern. But there isn't a workload from this century where this makes sense, so it pays to explore the performance impact of the LBA formats your device offers. Using a larger one can mean your device manages io backlogs more efficiently.