Amazon RTO policy is costing it top tech talent, according to internal document (businessinsider.com)

Shouldn't you also compare to mmap with huge page option? My understanding is its presicely meant for this circumstance. I don't think its a fair comparison without it.

Respectfully, the title feels a little clickbaity to me. Both methods are still ultimately reading out of memory, they are just using different i/o methods.

hsn915 · 45m ago

Shouldn't this be "io_uring is faster than mmap"?

I guess that would not get much engagement though!

That said, cool write up and experiment.

jared_hulbert · 41m ago

Lol. Thanks.

titanomachy · 36m ago

Very interesting article, thanks for publishing these tests!

Is the manual loop unrolling really necessary to get vectorized machine code? I would have guessed that the highest optimization levels in LLVM would be able to figure it out from the basic code. That's a very uneducated guess, though.

Also, curious if you tried using the MAP_POPULATE option with mmap. Could that improve the bandwidth of the naive in-memory solution?

> humanity doesn't have the silicon fabs or the power plants to support this for every moron vibe coder out there making an app.

lol. I bet if someone took the time to make a high-quality well-documented fast-IO library based on your io_uring solution, it would get use.

inetknght · 49m ago

Nice write-up with good information, but not the best. Comments below.

Are you using linux? I assume so since stating use of mmap() and mention using EPYC hardware (which counts out macOS). I suppose you could use any other *nix though.

> We'll use a 50GB dataset for most benchmarking here, because when I started this I thought the test system only had 64GB and it stuck.*

So the OS will (or could) prefetch the file into memory. OK.

> Our expectation is that the second run will be faster because the data is already in memory and as everyone knows, memory is fast.*

Indeed.

> We're gonna make it very obvious to the compiler that it's safe to use vector instructions which could process our integers up to 8x faster.

There are even-wider vector instructions by the way. But, you mention another page down:

> NOTE: These are 128-bit vector instructions, but I expected 256-bit. I dug deeper here and found claims that Gen1 EPYC had unoptimized 256-bit instructions. I forced the compiler to use 256-bit instructions and found it was actually slower. Looks like the compiler was smart enough to know that here.

Yup, indeed :)

Also note that AVX2 and/or AVX512 instructions are notorious for causing thermal throttling on certain (older by now?) CPUs.

> Consider how the default mmap() mechanism works, it is a background IO pipeline to transparently fetch the data from disk. When you read the empty buffer from userspace it triggers a fault, the kernel handles the fault by reading the data from the filesystem, which then queues up IO from disk. Unfortunately these legacy mechanisms just aren't set up for serious high performance IO. Note that at 610MB/s it's faster than what a disk SATA can do. On the other hand, it only managed 10% of our disk's potential. Clearly we're going to have to do something else.

In the worst case, that's true. But you can also get the kernel to prefetch the data.

See several of the flags, but if you're doing sequential reading you can use MAP_POPULATE [0] which tells the OS to start prefetching pages.

You also mention 4K page table entries. Page table entries can get to be very expensive in CPU to look up. I had that happen at a previous employer with an 800GB file; most of the CPU was walking page tables. I fixed it by using (MAP_HUGETLB | MAP_HUGE_1GB) [0] which drastically reduces the number of page tables needed to memory map huge files.

Importantly: when the OS realizes that you're accessing the same file a lot, it will just keep that file in memory cache. If you're only mapping it with PROT_READ and PROT_SHARED, then it won't even need to duplicate the physical memory to a new page: it can just re-use existing physical memory with a new process-specific page table entry. This often ends up caching the file on first-access.

I had done some DNA calculations with fairly trivial 4-bit-wide data, each bit representing one of DNA basepairs (ACGT). The calculation was pure bitwise operations: or, and, shift, etc. When I reached the memory bus throughput limit, I decided I was done optimizing. The system had 1.5TB of RAM, so I'd cache the file just by reading it upon boot. Initially caching the file would take 10-15 minutes, but then the calculations would run across the whole 800GB file in about 30 seconds. There were about 2000-4000 DNA samples to calculate three or four times a day. Before all of this was optimized, the daily inputs would take close to 10-16 hours to run. By the time I was done, the server was mostly idle.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

jared_hulbert · 1h ago

Cool. Original author here. AMA.

comradesmith · 14m ago

Thanks for the article. What about using file reads from a mounted ramdisk?

nchmy · 44m ago

I just saw this post so am starting with Part 1. Could you replace the charts with ones on some sort of log scale? It makes it look like nothing happened til 2010, but I'd wager its just an optical illusion...

And, even better, put all the lines on the same chart, or at least with the same y axis scale (perhaps make them all relative to their base on the left), so that we can the relative rate of growth?

jared_hulbert · 20m ago

I tried with the log scale before. They failed to express the exponential hockey stick growth unless you really spend the time with the charts and know what log scale is. I'll work on incorporating log scale due to popular demand. They do show the progress has been nice and exponential over time.

When I put the lines on the same chart it made the y axis impossible to understand. The units are so different. Maybe I'll revisit that.

Yeah around 2000-2010 the doubling is noticeable. Interestingly it's also when alot of factors started to stagnate.

john-h-k · 31m ago

You mention modern server CPUs have capability to “read direct to L3, skipping memory”. Can you elaborate on this?

jared_hulbert · 14m ago

https://www.intel.com/content/www/us/en/io/data-direct-i-o-t...

AMD has something similar.

The PCIe bus and memory bus both originate from the processor or IO die of the "CPU" when you use an NVMe drive you are really just sending it a bunch of structured DMA requests. Normally you are telling the drive to DMA to an address that maps to the memory, so you can direct it cache and bypass sending it out on the DRAM bus.

In theory... the specifics of what is supported exactly? I can't vouch for that.

Jap2-0 · 1h ago

Would huge pages help with the mmap case?

jared_hulbert · 52m ago

Oh man... I'd have look into that. Off the top of my head I don't know how you'd make that happen. Way back when I'd have said no. Now with all the folio updates to the Linux kernel memory handling I'm not sure. I think you'd have to take care to make sure the data gets into to page cache as huge pages. If not then when you tried to madvise() or whatever the buffer to use huge pages it would likely just ignore you. In theory it could aggregate the small pages into huge pages but that would be more latency bound work and it's not clear how that impacts the page cache.

But the arm64 systems with 16K or 64K native pages would have fewer faults.

inetknght · 47m ago

> I'd have look into that. Off the top of my head I don't know how you'd make that happen.

Pass these flags to your mmap call: (MAP_HUGETLB | MAP_HUGE_1GB)

jared_hulbert · 39m ago

Would this actually create huge page page cache entries?

inetknght · 30m ago

It's right in the documentation for mmap() [0]! And, from my experience, using it with an 800GB file provided a significant speed-up, so I do believe the documentation is correct ;)

And, you can poke around in the linux kernel's source code to determine how it works. I had a related issue that I ended up digging around to find the answer to: what happens if you use mremap() to expand the mapping and it fails; is the old mapping still valid or not? Answer: it's still valid. I found that it was actually fairly easy to read linux kernel C code, compared to a lot (!) of other C libraries I've tried to understand.

[0]: https://www.man7.org/linux/man-pages/man2/mmap.2.html

inetknght · 47m ago

> Would huge pages help with the mmap case?

Yes. Tens- or hundreds- of gigabytes of 4K page table entries take a while for the OS to navigate.

What Is the Fourier Transform? (quantamagazine.org)

Stripe Launches L1 Blockchain: Tempo (tempo.xyz)

LLM Visualization (bbycroft.net)

Classic 8×8-pixel B&W Mac patterns (pauladamsmith.com)

Is the decline of reading making politics dumber? (economist.com)

WiFi signals can measure heart rate (news.ucsc.edu)

What If OpenDocument Used SQLite? (sqlite.org)

Wikipedia survives while the rest of the internet breaks (theverge.com)

ICPC 2025 World Finals Results (worldfinals.icpc.global)

What happens when 10k AI agents are left to self-govern in a virtual world? (aivilization.ai)

Memory is slow, Disk is fast – Part 2 (bitflux.ai)

Atlassian is acquiring The Browser Company (cnbc.com)

Le Chat: Custom MCP Connectors, Memories (mistral.ai)

Melvyn Bragg steps down from presenting In Our Time (bbc.co.uk)

I ditched Spotify and set up my own music stack (leshicodes.github.io)

Age Simulation Suit (age-simulation-suit.com)

Unix Conspiracy (1991) (catb.org)

Action was the best 8-bit programming language (goto10retro.com)

Updating restrictions of sales to unsupported regions (anthropic.com)

3D QR Codes (erikdemaine.org)

Artie (YC S23) Is Hiring Engineers, AES, and Senior PMM (ycombinator.com)

A PM's Guide to AI Agent Architecture (productcurious.com)

Saquon Barkley is playing for equity (readtheprofile.com)

Rocketships and Slingshots (postround.substack.com)

Wal3: A Write-Ahead Log for Chroma, Built on Object Storage (trychroma.com)

Launch HN: Slashy (YC S25) – AI that connects to apps and does tasks

Almost anything you give sustained attention to will begin to loop on itself (henrikkarlsson.xyz)

A programmable display using microfluidics [video] (youtube.com)

How we built an interpreter for Swift (bitrig.app)

A high schooler writes about AI tools in the classroom (theatlantic.com)

AI not affecting job market much so far, New York Fed says (money.usnews.com)

We Investigated Tesla's Autopilot. It's Scarier Than You Think [video] (youtube.com)

Amazon RTO policy is costing it top tech talent, according to internal document (businessinsider.com)

16-inch softball (en.wikipedia.org)

30 minutes with a stranger (pudding.cool)

Polars Cloud and Distributed Polars now available (pola.rs)

Farewell to Meshnet (nordvpn.com)

Berg's Card Sorting Task (neurobs.com)

I should have loved electrical engineering (blog.tdhttt.com)

Inverting the Xorshift128 random number generator (littlemaninmyhead.wordpress.com)

UK government trial of M365 Copilot finds no clear productivity boost (theregister.com)

Mangrove Restoration Frustration (2021) (knowablemagazine.org)

How to build vector tiles from scratch (debuisne.com)

Étoilé – desktop built on GNUStep (etoileos.com)

The thousands of atomic bombs exploded on Earth (2015) (kottke.org)

Thunk: Build Rust program to support Windows XP, Vista and more (github.com)

The Paradigm (nonint.com)

OpenAI announces AI-powered hiring platform to take on LinkedIn (techcrunch.com)

Hollow Knight: Silksong causes server chaos on Xbox, Steam, and Nintendo (eurogamer.net)

Yes, America Has a Housing Emergency – Paul Krugman (paulkrugman.substack.com)

Memory is slow, Disk is fast – Part 2

Comments (17)