Without the futex, it's futile
h4x0r.org297 points by eatonphil 2 days ago
297 points by eatonphil 2 days ago
Windows gained a WaitForMultipleObjects, which Linux 5.16 (end 2021) aped with a new Futex2. https://www.phoronix.com/news/Linux-5.16-sys_futex_waitv
There's been a nice stream of improvements to futex2 since.
NUMA support (finally landing!), https://www.phoronix.com/news/FUTEX2-NUMA-Small-Futex https://www.phoronix.com/news/FUTEX2-Improvements-Linux-6.16 (see also this fantastic recent submission on NUMA in general, absolutely critical performance stuff, https://news.ycombinator.com/item?id=44936575)
Io_uring support in 6.7 (2024), (with a nice write up on it speeding up postgresql aio), https://www.phoronix.com/news/IO_uring-FUTEX-Linux-6.7
Small requeue and single wait additions in 6.7, https://www.phoronix.com/news/Linux-6.7-Locking-FUTEX2
Windows did not gain a WaitForMultipleObjects, it had it since the first Windows NT, more than 30 years ago.
While WaitForMultipleObjects was an advantage of Windows NT over UNIX, it was nothing new. IBM PL/I had an equivalent function already in 1965, almost 30 years before Windows NT.
The "wait" function of IBM PL/I was actually the model for the UNIX "wait", but the UNIX function was extremely simplified and much weaker than its model, like it was also the case with the many features inherited by UNIX from Multics. Unfortunately, many decades had to pass until the descendants of UNIX began to gain features comparable in sophistication with those of the ancestors of UNIX.
However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress, the equivalent of Linux futex.
It is true however that the Linux futex had and still has some annoying limitations, like the size of only 32 bits of the futex value, instead of 64 bits, and the fact that it is possible to wait only on a single event. Using atomic bit operations on the futex value it is actually possible to wait on multiple events, though not in the most efficient way. However here is where the 32-bit size of the futex value becomes annoying.
Therefore the work that attempts to combine the advantages of "futex" with some of the advantages of WaitForMultipleObjects is very welcome.
However this does not ape Windows, but it just reimplements techniques that are much older than the Microsoft company, which were well known more than a half of century ago.
>However the Microsoft WaitForSingleObject and WaitForMultipleObjects did not have an efficient implementation, which is why they had to add WaitOnAddress
WaitForSingle/MultipleObjects wait for kernel objects, similiar to poll. WaitOnAddress is lightweight synchronization, equivalent to futex. Windows doesn't have something like WaitForMultipleAddresses. futex_waitv was used by Wine patches because they implement NT kernel objects in userspace, and there were some semantic differences that made it hard to represent them as eventfds.
PS: But using futexes to emulate kernel objects breaks security boundaries of a process. That's why it was never merged into upstream Wine, and NTSYNC was developed.
That's very interesting history, thanks.
I agree with Linux still only supporting 32-bit futexes is a bit baffling. The only reason the width matters is for the race condition check, but that's a huge reason. I'd want to have the option to wait on values as wide as whatever the underlying hardware supports, at least!
The futex width matters if you implement your own waiting on multiple events, with one bit per event. The events can be signaled with atomic bit operations.
Ideally the value should be two words -- 16 bytes on 64-bit systems.
Emm, what? Why? If you mean two processor words, which I gather from what you are saying, then I think you are already in the space of full memory barriers.
In that case, why not just say, ideally it would be 256K words, or whatever?
Because mainstream modern architectures (practically speaking, x86-64-v2+ and ARMv8+) give you[1] a two-word compare-and-swap or LL/SC.
However using compare-and-swap as the atomic operation for implementing multiple events can be very inefficient, because it introduces waiting loops where the threads can waste a lot of time when there is high contention.
The signaling of multiple events is implemented efficiently with atomic bit set and bit clear operations, which are supported since Intel 80386 in 1985, so they are available in all Intel/AMD CPUs, and they are also available in 64-bit Arm CPUs starting with Armv8.2-A, since Cortex-A55 & Cortex-A75, in 2017-2018 (in theory the atomic bit operations were added in Armv8.1-A, but there were no consumer CPUs with that ISA).
With atomic bit operations, each thread signals its event independently of the others and there are no waiting loops. The monitoring thread can determine very fast the event with the highest priority using the LZCNT (count leading zeroes) or equivalent instructions, which is also available on all modern Arm-based or Intel/AMD CPUs.
When a futex is used to implement waiting for multiple events, despite not having proper support in the kernel, the thread that sets its bit to signal an event must also execute a FUTEX_WAKE, so that the monitoring thread will examine the futex value. Because the atomic bit operations are fetch-and-OP operations, like fetch-and-add or atomic exchange, the thread that signals an event can determine whether the previous event signaled by it has been handled or not by the monitoring thread, so it can act accordingly.
So currently on Linux you are limited to waiting for up to 32 events. The number of events can be extended by using a multi-level bitmap, but then the overhead increases significantly. Using a 64-bit futex value would have been much better.
In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.
This fact has forced the Arm company to correct their mistake from the first version of the 64-bit ARM ISA, where there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.
> In theory compare-and-swap or the equivalent instruction pair load-exclusive/store-conditional are more universal, but in practice they should be avoided whenever high contention is expected. The high performance algorithms for accessing shared resources are all based on using only fetch-and-add, atomic exchange, atomic bit operations and load-acquire/store-release instructions.
> This fact has forced ... there were no atomic read-modify-write operations, so they have added all such operations in the first revision of the ISA, i.e. Armv8.1-A.
I'm not sure if you meant for these two paragraphs to be related, but asking too make sure:
- Isn't compare-and-swap (CMPXCHG on x86) also read-modify-write, which in the first quoted paragraph you mention is slow?
- I think I've benchmarked LOCK CMPXCHG vs LOCK OR before, with various configurations of reading/writing threads. I was almost sure it was going to be an optimization, and it ended up being inobservable. IIRC, some StackOverflow posts lead me to the notion that LOCK OR still needs to acquire ownership of the target address in memory (RMW). Do you have any more insights? Cases where LOCK OR is better? Or should I have used a different instruction to set a single bit atomically?