Shouldn't you also compare to mmap with huge page option? My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it.
Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.
hsn915 · 45m ago
Shouldn't this be "io_uring is faster than mmap"?
I guess that would not get much engagement though!
That said, cool write up and experiment.
jared_hulbert · 41m ago
Lol. Thanks.
titanomachy · 36m ago
Very interesting article, thanks for publishing these tests!
Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.
Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?
> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.
lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.
inetknght · 49m ago
Nice write-up with good information, but not the best. Comments below.
Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.
> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*
So the OS will (or could) prefetch the file into memory. OK.
> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*
Indeed.
> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.
There are even-wider vector instructions by the way. But, you mention another page down:
> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.
Yup, indeed :)
Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.
> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.
In the worst case, that's true. But you can also get the kernel to prefetch the data.
See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.
You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.
Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.
I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.
Thanks for the article. What about using file reads from a mounted ramdisk?
nchmy · 44m ago
I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
jared_hulbert · 20m ago
I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
john-h-k · 31m ago
You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
Jap2-0 · 1h ago
Would huge pages help with the mmap case?
jared_hulbert · 52m ago
Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.
But the arm64 systems with 16K or 64K native pages would have fewer faults.
inetknght · 47m ago
> I'd have look into that. Off the top of my head I don't know how you'd make that happen.
Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)
jared_hulbert · 39m ago
Would this actually create huge page page cache entries?
inetknght · 30m ago
It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)
And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.
Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.
I guess that would not get much engagement though!
That said, cool write up and experiment.
Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.
Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?
> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.
lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.
Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.
> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*
So the OS will (or could) prefetch the file into memory. OK.
> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*
Indeed.
> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.
There are even-wider vector instructions by the way. But, you mention another page down:
> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.
Yup, indeed :)
Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.
> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.
In the worst case, that's true. But you can also get the kernel to prefetch the data.
See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.
You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.
Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.
I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.
[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?
When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.
Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.
AMD has something similar.
The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.
In theory... the specifics of what is supported exactly? I can't vouch for that.
But the arm64 systems with 16K or 64K native pages would have fewer faults.
Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)
And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.
[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html
Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.