A ChatGPT Pro subscription costs 38.6 months of income in low-income countries (policykahani.substack.com)

There's an error here: “NT instructions are used when there is an overlap between destination and source since destination may be in cache when source is loaded.”

Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon, so it shouldn't push out other things in the cache. They may skip the cache entirely, or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.

orlp · 35m ago

> Non-temporal instructions don't have anything to do with correctness. They are for cache management; a non-temporal write is a hint to the cache system that you don't expect to read this data (well, address) back soon

I disagree with this statement (taken at face value, I don't necessarily agree with the wording in the OP either). Non-temporal instructions are unordered with respect to normal memory operations, so without a _mm_sfence() after doing your non-temporal writes you're going to get nasty hardware UB.

Sesse__ · 9m ago

You mean if you access it from a different core? I believe that within the same core, you still have the normal ordering, but indeed, non-temporal writes don't have an implicit write fence after them like x86 stores normally do.

In any case, if so they are potentially _less_ correct; they never help you.

m0th87 · 18m ago

I had interpreted GP to mean that you don’t slap on NTs for correctness reasons, rather you do it for performance reasons.

orlp · 16m ago

That is something I can agree with, but I can't in good faith just let "it's just a hint, they don't have anything to do with correctness" stand unchallenged.

m0th87 · 33m ago

I work on optimizations like this at work, and yes this is largely correct. But do you have a source on this?

> or (more likely) go into just some special small subsection of it reserved for non-temporal writes only.

I hadn’t heard of this before. It looks like older x86 CPUs may have had a dedicated cache.

Tuna-Fish · 37s ago

IIRC they used the write-combining buffer, which was also a cache.

A common trick is to cache it but put it directly in the last or second-to-last bin in your pseudo-LRU order, so it's in cache like normal but gets evicted quickly when you need to cache a new line in the same set. Other solutions can lead to complicated situations when the user was wrong and the line gets immediately reused by normal instructions, this way it's just in cache like normal and gets promoted to least recently used if you do that.

Sesse__ · 2m ago

A source on what? The Intel optimization manuals explain what MOVNTQ is for. I don't think they explain in detail how it is implemented behind-the-scenes.

See e.g. https://cdrdv2.intel.com/v1/dl/getContent/671200 chapter 13.5.5:

“The non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD) allow data to be moved from the processor’s registers directly into system memory without being also written into the L1, L2, and/or L3 caches. These instructions can be used to prevent cache pollution when operating on data that is going to be modified only once before being stored back into system memory. These instructions operate on data in the general-purpose, MMX, and XMM registers.”

I believe that non-temporal moves basically work similar to memory marked as write-combining; which is explained in 13.1.1: “Writes to the WC memory type are not cached in the typical sense of the word cached. They are retained in an internal write combining buffer (WC buffer) that is separate from the internal L1, L2, and L3 caches and the store buffer. The WC buffer is not snooped and thus does not provide data coherency. Buffering of writes to WC memory is done to allow software a small window of time to supply more modified data to the WC buffer while remaining as non-intrusive to software as possible. The buffering of writes to WC memory also causes data to be collapsed; that is, multiple writes to the same memory location will leave the last data written in the location and the other writes will be lost.”

In the old days (Pentium Pro and the likes), I think there was basically a 4- or 8-way associative cache, and non-temporal loads/stores would go to only one of the sets, so you could only waste 1/4 (or 1/8) on your cache on it at worst.

userbinator · 4h ago

It's not clear from a skim of this article, but a common problem I've seen in the past with memory copying benchmarks is to not serialise and access the copied data in its destination to ensure that it was actually completed before concluding the timing. A simple REP MOVS should be at or near the top, especially on CPUs with ERMSB.

kachapopopow · 4h ago

Yah, these benchmarks are irrelevant since the CPU executes instructions out of order. Majority of the time the cpu will continue executing assembly while a copy operation is ongoing.

viraptor · 3h ago

The full reorder buffer is still going to be only 200-500 instructions. The actual benchmark is not linked, but it would take only a hundred or so messages to largely ignore the reordering. On the other hand, when you use the library, the write needs to actually finish in the shared memory before you notify the other process. So unless the benchmark was tiny for some reason, why would this be irrelevant?

davrosthedalek · 23m ago

> Since the loop copies data pointer by pointer, it can handle the case of overlapping data.

I don't think this loop does the right thing if destination points somewhere into source. It will start overwriting the non-copied parts of source.

Arech · 3h ago

It's not clear how the author controlled for HW caching. Without this, the results are, unfortunately, meaningless, even though some good work has been gone

jesse__ · 3h ago

Would have loved to see performance comparisons along the way, instead of just the small squashed graph at the end. Nice article otherwise :)

brucehoult · 4h ago

Conclusion

Stick to `std::memcpy`. It delivers great performance while also adapting to the hardware architecture, and makes no assumptions about the memory alignment.

----

So that's five minutes I'll never get back.

I'd make an exception for RISC-V machines with "RVV" vectors, where vectorised `memcpy` hasn't yet made it into the standard library and a simple ...

    0000000000000000 <memcpy>:
       0:   86aa                    mv      a3,a0
    
    0000000000000002 <.L1^B1>:
       2:   00267757                vsetvli a4,a2,e8,m4,tu,mu
       6:   02058007                vle8.v  v0,(a1)
       a:   95ba                    add     a1,a1,a4
       c:   8e19                    sub     a2,a2,a4
       e:   02068027                vse8.v  v0,(a3)
      12:   96ba                    add     a3,a3,a4
      14:   f67d                    bnez    a2,2 <.L1^B1>
      16:   8082                    ret

... often beats `memcpy` by a factor of 2 or 3 on copies that fit into L1 cache.

https://hoult.org/d1_memcpy.txt

viraptor · 3h ago

> So that's five minutes I'll never get back.

Confirming null hypothesis, with good supporting data is still interesting. Could save you from doing this yourself.

adwn · 3h ago

> The operation of copying data is super easy to parallelize across multiple threads. […] This will make the copy super-fast especially if the CPU has a large core count.

I seriously doubt that. Unless you have a NUMA system, a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller. If you can avoid going through main memory – e.g., when copying between the L2 caches of different cores – multi-threading can speed things up. But then you need precise knowledge of your program's memory access behavior, and this is outside the scope of a general-purpose memcpy.

bob1029 · 50m ago

> a single core in a desktop CPU can easily saturate the bandwidth of the system RAM controller.

Modern x86 machines offer far more memory bandwidth than what a single core can consume. The entire architecture is designed on purpose to ensure this.

The interesting thing to note is that this has not always been the case. The 2010s is when the transition occurred.

hugh-avherald · 3h ago

I've experienced modest but significant improvements in speed using very basic pragma omp section style parallelizing of this sort of thing.

adwn · 2h ago

Do you remember any specifics? For example, the size of the copy, whether it was a NUMA system, or the total bandwidth of your system RAM?

waschl · 4h ago

Thought about zero-copy IPC recently. In order to avoid memcopy for the complete chain, I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created. Is this a standard thing in such optimized IPC and which libraries offer this?

comex · 3h ago

IPC libraries often specifically avoid zero-copy for security reasons. If a malicious message sender can modify the message while the receiver is in the middle of parsing it, you have to be very careful not to enable time-of-check-time-of-use attacks. (To be fair, not all use cases need to be robust against a malicious sender.)

o11c · 3h ago

On Linux, that's exactly what `memfd` seals are for.

That said, even without seals, it's often possible to guarantee that you only read the memory once; in this case, even if the memory is technically mutating after you start, it doesn't matter since you never see any inconsistent state.

kragen · 1h ago

Thanks for the reference! I had been wondering if there was a way to do this on Linux for years. https://lwn.net/Articles/591108/ seems to be the relevant note?

duped · 2h ago

What's the threat model where a malicious message sender has write access to shared memory

kragen · 1h ago

When you are using the shared memory to communicate with an untrusted sender. Examples might include:

- browser main processes that don't trust renderer processes

- window system compositors that don't trust all windowed applications, and vice versa

- database servers that don't trust database clients, and vice versa

- message queue brokers that don't trust publishers and subscribers, and vice versa

- userspace filesystems that don't trust normal user processes

hmry · 2h ago

How would someone send a message over shared memory without write access to that memory?

IshKebab · 1h ago

I think he meant what's the scenario where you're using IPC via shared memory and don't trust both processes. Basically it only applies if the processes are running as two different users. (I think Android does that a lot?)

a_t48 · 1h ago

I've looked into this a bit - the big blocker isn't on the transport/IPC library, but the serializer itself, assuming you _also_ want to support serializing messages to disk or over network. It's a bit of a pickle - at least in C++, tying an allocator to a structure and its children is an ugly mess. And what happens if you do something like resize a string? Does it mean a whole new allocation? I've (partially) solved it before for single process IPC by having a concept of a sharable structure and its serialization type, you could do the same for shared memory. One could also use a serializer that offers promises around allocations, FlatBuffer might fit the bill. There's also https://github.com/Verdant-Robotics/cbuf but I'm not sure how well maintained it is right now, publicly.

As for allocation - it looks like Zenoh might offer the allocation pattern necessary. https://zenoh-cpp.readthedocs.io/en/1.0.0.5/shm.html TBH most of the big wins come from not copying big blocks of memory around from sensor data and the like. A thin header and reference to a block of shared memory containing an image or point cloud coming in over UDS is likely more than performant enough for most use cases. Again, big wins from not having to serialize/deserialize the sensor data.

Another pattern which I haven't really seen anywhere is handling multiple transports - at one point I had the concept of setting up one transport as an allocator (to put into shared memory or the like) - serialize once to shared memory, hand that serialized buffer to your network transport(s) or your disk writer. It's not quite zero copy but in practice most zero copy is actually at least one copy on each end.

(Sorry, this post is a little scatterbrained, hopefully some of my points come across)

dataflow · 4h ago

> I guess it would be best if the sender allocates its payload directly on the shared memory when it’s created.

On an SMP system yes. On a NUMA system it depends on your access patterns etc.

6keZbCECT2uB · 4h ago

I've been meaning to look at Iceoryx as a way to wrap this.

Pytorch multiprocessing queues work this way, but it is hard for the sender to ensure the data is already in shared memory, so it often has a copy. It is also common for buffers to not be reused, so that can end up a bottleneck, but it can, in principle, be limited by the rate of sending fds.

throwaway81523 · 4h ago

This is one of mmap's designed-for use cases. Look at DPDK maybe.

yokaze · 3h ago

Boost.Interprocess:

https://www.boost.org/doc/libs/1_46_0/doc/html/interprocess/...

dataflow · 4h ago

I thought this was going to be about https://github.com/Blosc/c-blosc

Orangeair · 3h ago

[2020]

wolfi1 · 3h ago

the "dumb of perf": some Freudian Slip?

_ZeD_ · 3h ago

soo... time to send a patch to glibc?

bawolff · 2h ago

Given their conclusion that glibc was the best option for most use cases, i would say no.

Google paid a $250K reward for a bug (issues.chromium.org)

Vanishing from Hyundai’s data network (techno-fandom.org)

A ChatGPT Pro subscription costs 38.6 months of income in low-income countries (policykahani.substack.com)

Generic Containers in C: Safe Division Using Maybe (uecker.codeberg.page)

Basic Social Skills Guide (improveyoursocialskills.com)

Going faster than memcpy (squadrick.dev)

Dropbox announces new gen server hardware for higher efficiency and scalability (dropbox.tech)

Millau Viaduct (fosterandpartners.com)

Raised by Wolves Is Original Sci-Fi at Its Most Polarizing (2020) (rogerebert.com)

Try and (ygdp.yale.edu)

Compiling a Lisp: Lambda lifting (bernsteinbear.com)

GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2 (magazine.sebastianraschka.com)

Graham: Synchronizing Clocks by Leveraging Local Clock Properties (2022) [pdf] (usenix.org)

Show HN: Bolt – A super-fast, statically-typed scripting language written in C (github.com)

Lists and Lists: Basics of Lisp through interactive fiction (1996) (eblong.com)

Fight Chat Control (fightchatcontrol.eu)

Show HN: Engineering.fyi – Search across tech engineering blogs in one place (engineering.fyi)

Show HN: A Sinclair ZX81 retro web assembler+simulator

One Million Screenshots (onemillionscreenshots.com)

Creating the Longest Possible Ski Jump in “The Games: Winter Challenge” (mrwint.github.io)

The enduring puzzle of static electricity (pubs.aip.org)

1910: The year the modern world lost its mind (derekthompson.org)

Diffusion language models are super data learners (jinjieni.notion.site)

PHP compile time generics: yay or nay? (thephp.foundation)

Bouncing on trampolines to run eBPF programs (bootlin.com)

Booting 5000 Erlangs on Ampere One 192-core (underjord.io)

Writing simple tab-completions for Bash and Zsh (mill-build.org)

How I code with AI on a budget/free (wuu73.org)

TCP Client Self-Connect (2013) (sgros.blogspot.com)

Abogen – Generate audiobooks from EPUBs, PDFs and text (github.com)

Reflections on Soviet Amateur Photography (publicbooks.org)

Digital Foundry leaves IGN, now independent [video] (youtube.com)

Show HN: QuickShelf – Stop opening Finder just to drag files (quickshelf-app.slowlab.dev)

Squashing my dumb bugs and why I log build IDs (rachelbythebay.com)

Events (developer.mozilla.org)

Open Lovable (github.com)

Abusing Entra OAuth for fun and access to internal Microsoft applications (research.eye.security)

The Framework Desktop is a beast (world.hey.com)

Type (YC W23) is hiring a founding engineer to build an AI-native doc editor (ycombinator.com)

Easily run Windows software on Linux with Bottles (usebottles.com)

How Does a Blind Model See the Earth? (outsidetext.substack.com)

ECScape: Understanding IAM Privilege Boundaries in Amazon ECS (sweet.security)

Flock Now Using AI to Report to Police If Our Movement Patterns Are "Suspicious" (aclu.org)

Anti-competitive practices masquerading as security is a dangerous pattern (blog.alinelerner.com)

Curious about the training data of OpenAI's new GPT-OSS models? I was too (twitter.com)

The current state of LLM-driven development (blog.tolki.dev)

Conversations remotely detected from cell phone vibrations, researchers report (psu.edu)

Flintlock – Create and manage the lifecycle of MicroVMs, backed by containerd (github.com)

A CT scanner reveals surprises inside the 386 processor's ceramic package (righto.com)

How Potatoes Evolved (nhm.ac.uk)

Going faster than memcpy

Comments (38)