The Art of SQL Query Optimization (jnidzwetzki.github.io)
2 points by gwen-shapira 9m ago 0 comments
ChatGPT and Public MCP Servers (community.openai.com)
1 points by johannes_aready 28m ago 1 comments
Fun with Futex
88 ingve 21 6/3/2025, 6:37:25 AM blog.fredrb.com ↗
As a result, uncontended locks work the same as described in the blog post above; under contention, performance is similar to a futex too. But now your locks are only one byte in size, regardless of platform – while Windows allows 1-byte futexes, they're always 4 bytes on Linux and iirc Darwin doesn't quite have an equivalent api (but I might be wrong there). You also have more control over parked threads if you want to implement different fairness criteria, reliable timeouts or parking callbacks.
One drawback of this is that you can only easily use this within one process, while at least on Linux futexes can be shared between processes.
I've written a blog post[2] about using futexes to implement monitors (reëntrant mutexes with an associated condvar) in a compact way for my toy Java Virtual Machine, though I've since switched to a parking-lot-like approach.
[0]: https://github.com/amanieu/parking_lot [1]: https://webkit.org/blog/6161/locking-in-webkit [2]: https://specificprotagonist.net/jvm-futex.html
That's not a very useful property, though. Because inter-core memory works on cache-line granularities, packing more than one lock in a cache line is a Bad Idea™. Potentially it allows you to pack more data being protected by a lock with that data... but alignment rules means that you're going to invariably end up spending 4 or 8 bytes (via a regular integer or a pointer) on that lock anyways.
You can avoid this with a `parking_lot::Mutex<()>` or `parking_lot::RawMutex` guarding other contents, but then you need to use `unsafe` because the borrow checker doesn't understand what you're doing.
I coincidentally was discussing this elsewhere recently: https://www.reddit.com/r/rust/comments/1ky5gva/comment/mv3kp...
No comments yet
With 4 byte locks your run into the exact same false sharing issues.
Double checks. Nope. The api is there and the patch to implement them has been posted multiple times: https://lore.kernel.org/lkml/20241025093944.707639534@infrad...
But the small futex2 patch will not go forward until some users say they want/need the feature
That's the wrong way to think about FUTEX_WAIT. What it does is "put this thread to sleep unless the value is not X".
> If you call futex wait and the value is unchanged, the sleeping thread will not wake up!
[I assume this was meant to be FUTEX_WAKE] I can't be bothered to check the kernel source or to test, but I would be surprised if this is true as it might cause missed wakeups in an ABA scenario. Futex_wake must wake up at least one successful futex_wait that happens-before the wake. Futexes are best understood as edge triggered, stateless (outside of the wait list itself) primitives, so the value at the futex location (as opposed to its address) is not really important[1], except as a guard to avoid missed wakeups.
Unfortunately the name itself (Fast Userspace Mutex) is a bit misleading, because a mutex is only one of the many things you can do with a futex. They really are a generalized waiting and signaling primitive.
[1] for plain WAIT and WAKE at least, the bitset operations or the robust futex operations are more complex and attach semantics to the value.
More importantly, what could "unchanged" even mean? For FUTEX_WAIT we provide val, a value we're saying is the value stored at the futex address, and the kernel can check that's true. But for FUTEX_WAIT val is filled out with a count - typically 1 meaning "Only wake one" or its maximum positive value meaning "everybody" although in principle if you can find a reason to wake up to 7 waiters but no more that's allowed.
Guess we'll just never know for sure, lol.
[1] https://lwn.net/Articles/944895/
For this reason, on Darwin if you want a mutex of the sort this article describes, you'll typically want to reach for os_unfair_lock, as that will donate the priority of the waiting threads to the thread that holds the lock, thus avoiding the priority inversion issue.
I have been thinking on and off on how to create a syscall-less wake operation. One way to get almost what you want is to have a polling io_uring. That still requires one kernel thread that busy polls per application. Maybe this is fine in some application architectures but it's not ideal.
It would be nice if there was a way to use Intel's debug registers to write a value to some address, which would then interrupt some kernel task, allowing that kernel task to somehow figure out what futex to wake, without the interrupter having to enter the kernel.
What you have described is literally the syscall mechanism. That's what it is. You perform some register write (via a specific instruction) and an interrupt is taken to the kernel. Maybe you believe that an asynchronous interrupt would cost less than a synchronous interrupt for this particular objective but I'm not sure there's evidence for that claim.
Apparently Intel cpus were supposed to get user space interrupts which would do exactly this. I'm not sure of hardware was ever shipped with support though.
Also look into monitor/mwait.