"The requeue-once rule is enforced by only allowing requeueing to the futex previously passed to futex_wait_requeue_pi as uaddr2, so it's not possible to requeue from A to B, then from B to C - but it is possible to requeue from B to B.
When this happens, if (!q.rt_waiter) passes, so rt_mutex_finish_proxy_lock is never called. (Also, AFAIK, free_pi_state is never called, which is true even without this weird requeue; in the case where futex_requeue calls requeue_pi_wake_futex directly, pi_state will sit around until it gets cleaned up in exit_pi_state_list when the thread exits. This is not a vulnerability.) futex_wait_requeue_pi exits, and various pointers to rt_waiter become dangling.
"
mmastrac · 2h ago
I think the coolest part of the futex is that it's a handle-less concept. There's no allocation or deallocation via syscall, just a kernel-based memory watcher that turns out to be incredibly useful as a primitive.
Everything goes cleanly away when there are no more waiters, and the kernel never even sees a mutex where there's no contention.
I would be interested in a technical deep dive of how the kernel manages these in a performant way, however.
Exactly! At the same time you also don't want to call into the kernel's internal malloc() whenever a thread ends up blocking on a lock to allocate the data structures that are needed to keep track of queues of blocked threads for a given lock.
To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.
I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.
> Many people won’t worry about crashed threads, as they often will crash the whole program. However, you can catch the signal a crash generates and keep the overall process from terminating.
That doesn't help if the entire process dies for any reason and you want to clean up the locks. Solution to that is called "robust" locks. You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.
Joker_vD · 59m ago
> And in practice, behavior across common implementations [of recursive locks] is not remotely consistent. There’s a good reason why this was left undefined – it’s kind of hard.
This is such a frustrating stance that most standards have, honestly. "Well, obviously we can't expect the OS/language implementers to be able to reliably implement feature X ― let's just leave it to the application programmer to deal with; they are, after all, are expected to have great skill sets and could easily work around it". Or rather, "well, we can't force feature X on the people who will actually implement the standard (they are the members of this very committee, after all), but we can't trivially force the downstream users to cope with the feature's absence because seriously, what can those losers do? Switch the vendors?".
ori_b · 48m ago
If you overspecify, you close the door to better implementations. This is why, for example, C++ standard hash tables and regexes are an order of magnitude slower than third party ones.
The standard didn't say "you must implement std::unordered_map as a hash table with chained buckets and extra memory allocations", but ithe standard specified several guarantees that make it very difficult to implement hash tables with open addressing.
Every constraint that you specify potentially locks out a better implementation.
For recursive rwlocks, there's a lot of ways to implement them. Do you want to lock out high performance implementations that do less error checking, for example?
Joker_vD · 19m ago
And if you underspecify, you force the users to invent their own hacks and workarounds, often very poorly.
And no, I don't want to "high performance" lock implementations that regularly completely deadlock the whole process (deadlocked processes are not very performant) unless wrap every single one of the several hundred uses of them in a test with dubiously-predictable branches, or worse, just completely break the lock invariants (e.g., the lock is now permanently unlocked, even after a lock() on it succeeds) ― it really is not that important how fast you can do the wrong thing.
dragontamer · 35m ago
To clarify, I believe the issue is C++ unordered map iterators and when / where they are allowed to go invalid.
OpenAddressing means that an address of map[thing] could change on insert. Which means iterators and pointer invalidation concepts can go stale on insert.
C++11 standard for unordered_map guarantees this won't happen. But that forces slower implementations.
And now people rely upon the standard so we can't change it. At best we do fast_unordered_map or unordered_map2 with different guarantees.
Lets be clear here; the book sets itself up as a way to gain understanding of multiprocessor programming, in a way that promotes skilled reasoning that is applicable across many subdomains. In many places it points out that you should avoid implementing constructs yourself, but instead use a library/language/system provided construct. It specifically calls this out for mutexes.
The book is quite clearly about concurrency in general, and not for a specific platform. The author of this article has set up a straw man to facilitate the writing and marketing of an otherwise moderately interesting article on futexes.
Personally I find the approach taken by this article more than a little distasteful - presenting from a point of exaggerated conflict is both tiresome and likely to confuse. This article could easily have been written from the perspective "what TAoMP doesn't tell you" and in that vein be taken a lot more collaboratively.
Of course it doesn't escape me that this blog is new, this article was posted by Phil, and Phil has promoted one of their other articles before.
viega · 1h ago
I wrote the article; it was motivated by reading the book, which to my estimation is not well aimed for either academics or practitioners. That's a problem across a big chunk of academia right now, and I hear it not just from industry who would like to have people more prepared coming out of college, but from masters students who realize that they're not learning what they want to be good.
So in no way was it meant to be a strawman around a "hey, learn about the futex!" post (as evidenced by other complaints at the end of things lacking). The fact is, I was disappointed enough with the book, that I put aside another post I was writing for it.
But as for Phil, we did work together several years ago, and he reads my stuff. I didn't just start writing, and have never had problems finding an audience in the past, Phil or not.
mwcampbell · 1h ago
As the article mentions, Windows introduced a futex-like thing in Windows 8. I know that the original Win32 critical section is based on a kernel-level semaphore. What about the SRW lock introduced in Vista?
garaetjjte · 31m ago
Neither CRITICAL_SECTION nor SRWLock enters the kernel when uncontended.
avodonosov · 8m ago
Can anyone suggest a good explanation of memory barriers?
smallstepforman · 28m ago
I still haven’t seen a good comparison between Futex and Benaphore. Benaphores I understand, it predates Futexes by almost a decade, but what do Futexes add to the equation since hardly anyone talks about Benaphores (or is it a case of not invented here)?
the_mitsuhiko · 16m ago
Other then avoiding a syscall on an uncontended path they are not really similar. A benaphore is just a semaphore with an extra atomic counter in userspace to count waiters.
You can’t really use semaphores to implement things that can’t mutexes or semaphores so the overall utility is limited compare to futexes that you can use for condvars and other primitives too.
reinhardt1053 · 15m ago
So what's recommended as a better alternative to The Art of Multiprocessor Programming?
It's not that deep. The futex was developed just to save you from issuing a special system call to ask the OS to put you on a wait queue.
The whole point is that implementing a mutex requires doing things that only the privileged OS kernel can do (e.g. efficiently blocking/unblocking processes). Therefore, for systems like Linux, it made sense to combine the features for a fast implementation.
viega · 2h ago
Also, I should say, in user-land you can efficiently enough save thread state, go off and do something else with that thread, then come back to it, never hitting the kernel while something blocks. That's pretty much async in a nutshell (or green threads).
The point of the article anyway is that it's inexcusable to have a modern concurrency textbook and not cover the futex, since it's at the core of any efficient primitive on modern hardware.
koverstreet · 2h ago
The problem with green threads, historically, was that there was no way to do arbitrary syscalls async; if your syscall blocks it blocks all your other green threads. Doh.
io_uring is supposed to be about solving this, but it's quite the kitchen sink so I have no idea how complete it is on the "arbitrary syscall async" front.
viega · 1h ago
Yes, it's gotten quite large, but I think with far fewer wrong turns in the API compared to the futex. Enough was available async via `epoll()` + having fd interfaces to things that I never was as worried about the arbitrary latency of syscalls, but it's still incredibly cool, especially in the number of calls it avoids outright.
johncolanduoni · 1h ago
`epoll` doesn’t actually do any IO though, so it doesn’t help with syscall latency. It just avoids the overhead of doing IO via a large number of threads (memory overhead, hard context switches, etc.).
viega · 1h ago
No it doesn't, which is one key reason why I am a fan of `io_uring`. I brought `epoll` up because it does help with the blocking though, for most of the things that matter when it comes to async (at a cost to latency, of course).
viega · 2h ago
You actually issue the `futex` system call to get yourself on the wait queue tied to the memory address. It separates out the waiting from the locking.
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.
ajross · 1h ago
> It separates out the waiting from the locking.
It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.
viega · 1h ago
It quite does; the kernel is not the keeper of the lock, it only needs to detect the race condition that would result in a spurious sleep. It cares not one bit about the rest of your semantics.
It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.
ajross · 1h ago
Why is this gray!? This is absolutely correct. Futex was added as an ad hoc solution to the obvious needs of SMP processes communicating via atomic memory operations who still wanted blocking IPC. And it had to be updated and reworked repeatedly as it moved out of the original application (locks and semaphores) into stuff like condition variables and priority inheritance where it didn't work nearly as well.
In point of fact futex is really not a particularly simple syscall and has a lot of traps, see the man page. But the core idea is indeed "not that deep".
viega · 1h ago
As the article says, the futex system call is overly complicated. But one shouldn't downplay its importance. Every major OS has had a slimmed down equivalent for about a decade, and the futex is at the core of any good modern lock.
Many things are obvious after, but there was plenty of time before for other people to do the same thing, it's not like we didn't know sysv semaphores didn't scale well.
"ad hoc" feels like an opinion here. My opinion is that when separation of concerns leads to big gains like the futex did, that's elegant, and an achievement. No need to diminish the good work done!
bicolao · 1h ago
If this is ad hoc solution, what's the "right" approach?
wizerno · 1h ago
Another great read on futexes is Ulrich Drepper’s paper "Futexes Are Tricky" [1].
Is it just me or moving things out of kernel space improves performance in general? Like context switching, mutex or entire TCP stack. I wonder what else can be moved into user space.
viega · 1h ago
It depends on the relative cost of what you're doing of course, but in general, yes, the cost of entering the kernel can be impactful, if you are making a lot of calls into it, since a lot of work has to happen every time.
There are definitely user-land TCP/IP implementations, thread implementations (Go's being particularly notable) and even file systems (FUSE).
bicolao · 57m ago
fuse probably isn't a good example here because you still have to enter kernel space if i'm not mistaken, then out again to the fs driver in userspace then probably back to kernel space (block driver). fuse has many upsides, but I don't think performance is one of them.
viega · 48m ago
Well, you still have to enter the kernel to actually queue an OS-level thread w/ a futex. The kernel supporting being able to move stuff to userland sure doesn't guarantee better performance-- the main opportunity from that perspective is minimizing how often you cross the boundary.
You're 100% right that there are plenty of other considerations, often positive for lifting things out, like minimization of ring 0 attack surface.
whstl · 1h ago
You're right. With the caveat that it's the moving in and out of the kernel that's the expensive part, so putting things inside it also helps with performance... at the expense of security and reliability, of course, so userspace it is!
There's been a nice stream of improvements to futex2 since.
NUMA support (finally landing!), https://www.phoronix.com/news/FUTEX2-NUMA-Small-Futex https://www.phoronix.com/news/FUTEX2-Improvements-Linux-6.16 (see also this fantastic recent submission on NUMA in general, absolutely critical performance stuff, https://news.ycombinator.com/item?id=44936575)
Io_uring support in 6.7 (2024), (with a nice write up on it speeding up postgresql aio), https://www.phoronix.com/news/IO_uring-FUTEX-Linux-6.7
Small requeue and single wait additions in 6.7, https://www.phoronix.com/news/Linux-6.7-Locking-FUTEX2
The linux equivalent of WFMO is select/poll/epoll.
"The requeue-once rule is enforced by only allowing requeueing to the futex previously passed to futex_wait_requeue_pi as uaddr2, so it's not possible to requeue from A to B, then from B to C - but it is possible to requeue from B to B.
When this happens, if (!q.rt_waiter) passes, so rt_mutex_finish_proxy_lock is never called. (Also, AFAIK, free_pi_state is never called, which is true even without this weird requeue; in the case where futex_requeue calls requeue_pi_wake_futex directly, pi_state will sit around until it gets cleaned up in exit_pi_state_list when the thread exits. This is not a vulnerability.) futex_wait_requeue_pi exits, and various pointers to rt_waiter become dangling. "
Everything goes cleanly away when there are no more waiters, and the kernel never even sees a mutex where there's no contention.
I would be interested in a technical deep dive of how the kernel manages these in a performant way, however.
EDIT: TIL about futex2 as well: https://docs.kernel.org/userspace-api/futex2.html
To prevent that, many operating systems allocate these 'queue objects' whenever threads are created and will attach a pointer to it from the thread object. Whenever a thread then stumbles upon a contended lock, it will effectively 'donate' this queue object to that lock, meaning that every lock having one or more waiters will have a linked list of 'queue objects' attached to it. When threads are woken up, they will each take one of those objects with them on the way out. But there's no guarantee that they will get their own queue object back; they may get shuffled! So by the time a thread terminates, it will free one of those objects, but that may not necessarily be the one it created.
I think the first operating system to use this method was Solaris. There they called these 'queue objects' turnstiles. The BSDs adopted the same approach, and kept the same name.
https://www.oreilly.com/library/view/solaristm-internals-cor...
https://www.bsdcan.org/2012/schedule/attachments/195_locking...
That doesn't help if the entire process dies for any reason and you want to clean up the locks. Solution to that is called "robust" locks. You can register list of held futexes with the kernel using sys_set_robust_list, and when the thread dies kernel for each entry will set a specific bit and wake waiter if there's one.
This is such a frustrating stance that most standards have, honestly. "Well, obviously we can't expect the OS/language implementers to be able to reliably implement feature X ― let's just leave it to the application programmer to deal with; they are, after all, are expected to have great skill sets and could easily work around it". Or rather, "well, we can't force feature X on the people who will actually implement the standard (they are the members of this very committee, after all), but we can't trivially force the downstream users to cope with the feature's absence because seriously, what can those losers do? Switch the vendors?".
The standard didn't say "you must implement std::unordered_map as a hash table with chained buckets and extra memory allocations", but ithe standard specified several guarantees that make it very difficult to implement hash tables with open addressing.
Every constraint that you specify potentially locks out a better implementation.
For recursive rwlocks, there's a lot of ways to implement them. Do you want to lock out high performance implementations that do less error checking, for example?
And no, I don't want to "high performance" lock implementations that regularly completely deadlock the whole process (deadlocked processes are not very performant) unless wrap every single one of the several hundred uses of them in a test with dubiously-predictable branches, or worse, just completely break the lock invariants (e.g., the lock is now permanently unlocked, even after a lock() on it succeeds) ― it really is not that important how fast you can do the wrong thing.
OpenAddressing means that an address of map[thing] could change on insert. Which means iterators and pointer invalidation concepts can go stale on insert.
C++11 standard for unordered_map guarantees this won't happen. But that forces slower implementations.
And now people rely upon the standard so we can't change it. At best we do fast_unordered_map or unordered_map2 with different guarantees.
I hate it, but it's true
The book is quite clearly about concurrency in general, and not for a specific platform. The author of this article has set up a straw man to facilitate the writing and marketing of an otherwise moderately interesting article on futexes.
Personally I find the approach taken by this article more than a little distasteful - presenting from a point of exaggerated conflict is both tiresome and likely to confuse. This article could easily have been written from the perspective "what TAoMP doesn't tell you" and in that vein be taken a lot more collaboratively.
Of course it doesn't escape me that this blog is new, this article was posted by Phil, and Phil has promoted one of their other articles before.
So in no way was it meant to be a strawman around a "hey, learn about the futex!" post (as evidenced by other complaints at the end of things lacking). The fact is, I was disappointed enough with the book, that I put aside another post I was writing for it.
But as for Phil, we did work together several years ago, and he reads my stuff. I didn't just start writing, and have never had problems finding an audience in the past, Phil or not.
You can’t really use semaphores to implement things that can’t mutexes or semaphores so the overall utility is limited compare to futexes that you can use for condvars and other primitives too.
https://man7.org/tlpi/
https://marabos.nl/atomics/
The whole point is that implementing a mutex requires doing things that only the privileged OS kernel can do (e.g. efficiently blocking/unblocking processes). Therefore, for systems like Linux, it made sense to combine the features for a fast implementation.
The point of the article anyway is that it's inexcusable to have a modern concurrency textbook and not cover the futex, since it's at the core of any efficient primitive on modern hardware.
io_uring is supposed to be about solving this, but it's quite the kitchen sink so I have no idea how complete it is on the "arbitrary syscall async" front.
And that can absolutely save a bunch of system calls, especially vs. polling mixed with `sleep()` or similar.
It does not, in fact the two are fundamentally inseparable and the state of the memory address must be treated atomically with the waiting state. The magic of futex is that you can use a hardware atomic operation (c.f. lock cmpxchg on x86) to get the lock in the common/uncontended case, but if you have to wait you need to tell the kernel both that you need to wait and the address on which you're waiting, so it can use the same hardware interlocks along with its own state locking to put you to sleep race-free.
It's true you could use it that way, but it's not the way it's meant to be used, defeating the purpose by requiring a system call even for uncontended locks.
In point of fact futex is really not a particularly simple syscall and has a lot of traps, see the man page. But the core idea is indeed "not that deep".
Many things are obvious after, but there was plenty of time before for other people to do the same thing, it's not like we didn't know sysv semaphores didn't scale well.
"ad hoc" feels like an opinion here. My opinion is that when separation of concerns leads to big gains like the futex did, that's elegant, and an achievement. No need to diminish the good work done!
[1] https://cis.temple.edu/~giorgio/cis307/readings/futex.pdf
There are definitely user-land TCP/IP implementations, thread implementations (Go's being particularly notable) and even file systems (FUSE).
You're 100% right that there are plenty of other considerations, often positive for lifting things out, like minimization of ring 0 attack surface.
https://en.wikipedia.org/wiki/Microkernel