Spinning around: Please don't – Common problems with spin locks

siliceum.com

93 points by bdash 12 hours ago


Animats - 2 hours ago

I struggled with this in Wine. "malloc" type memory allocation involves at least two levels of spinlocks. When you do a "realloc", the spinlocks are held during the copying operation. If you use Vec .push in Rust, you do a lot of reallocs. In a heavily multithreaded program, this can knock performance down by more than two orders of magnitude. It's hard to reproduce this with a simple program; it takes a lot of concurrency to hit futex congesion.

Real Windows, and Linux, don't have this problem. Only Wine's "malloc" in a DLL, which does.

Bug reports resulted in finger-pointing and denial.[1] "Unconfirmed", despite showing debugger output.

[1] https://bugs.winehq.org/show_bug.cgi?id=54979

pizlonator - 6 hours ago

TFA lists WebKit as a project that "does it wrong".

The author should read https://webkit.org/blog/6161/locking-in-webkit/ so that they understand what they are talking about.

WebKit does it right in the sense that:

- It as an optimal amount of spinning

- Threads wait (instead of spinning) if the lock is not available immediately-ish

And we know that the algorithms are optimal based on rigorous experiments.

jcranmer - 9 hours ago

The basic rule of writing your own cross-thread datastructures like mutexes or condition variables is... don't, unless you have very good reason not to. If you're in that rare circumstance where you know the library you're using isn't viable for some reason, then the next best rule is to use your OS's version of a futex as the atomic primitive, since it's going to solve most of the pitfalls for you automatically.

The only time I've manually written my own spin lock was when I had to coordinate between two different threads, one of which was running 16-bit code, so using any library was out of the question, and even relying on syscalls was sketchy because making sure the 16-bit code is in the right state to call a syscall itself is tricky. Although in this case, since I didn't need to care about things like fairness (only two threads are involved), the spinlock core ended up being simple:

    "thunk_spin:",
        "xchg cx, es:[{in_rv}]",
        "test cx, cx",
        "jnz thunk_has_data",
        "pause",
        "jmp thunk_spin",
    "thunk_has_data:",
spacechild1 - 5 hours ago

Nice article! Yes, using spinlocks in normal userspace applications is not recommended.

One area where I found spinlocks to be useful is in multithreaded audio applications. Audio threads are not supposed to be preempted by other user space threads because otherwise they may not complete in time, leading to audio glitches. The threads have a very high priority (or have a special scheduling policy) and may be pinned to different CPU cores.

For example, multiple audio threads might read from the same sample buffer, whose content is occasionally modified. In that case, you could use a reader-writer-spinlock where multiple readers would be able to progress in parallel without blocking each other. Only a writer would block other threads.

What would be the potential problems in that scenario?

rdtsc - 7 hours ago

> Notice that in the Skylake Client microarchitecture the RDTSC instruction counts at the machine’s guaranteed P1 frequency independently of the current processor clock (see the INVARIANT TSC property), and therefore, when running in Intel® Turbo-Boost-enabled mode, the delay will remain constant, but the number of instructions that could have been executed will change.

rdtsc may execute out of order, so sometimes an lfence (previously cpuid) can be used and there is also rdtscp

See https://github.com/torvalds/linux/blob/master/arch/x86/inclu...

And just because rdtsc is constant doesn't mean the processor clock will be constant that could be fluctuating.

- an hour ago
[deleted]
horizion2025 - 3 hours ago

My concurrency knowledge is a bit rusty but aren't spinlocks only supposed to be used for very brief waits like in the hundreds of cycles (or situations where you can't block... like internal o/s scheduling structures in SMP setups)? If so how much does all this back off and starvation of higher priority threads even matter? If it is longer then you should use a locking primitive (except for in those low level os structures!) where most of the things discussed are not an issue. Would love to hear the use cases where spin locks are needed in eg user space, I dont doubt they occur.

fsckboy - 3 hours ago

what is/are the thread synchronization protocol called which is the equivalent to ethernet's CSMA? there's no "carrier sensing", but instead "who won or mistakes were made" sensing. or is that just considered a form of spinlock? (you're not waiting for a lock, you perform your operation then see if it worked; though you could make the operation be "acquire lock" in which case it's a spinlock)

a-dub - 7 hours ago

i always got the sense that spinlocks were about maximum portability and reliability in the face of unreliable event driven approaches. the dumb inefficient thing that makes the heads of the inexperienced explode, but actually just works and makes the world go 'round.

jeffbee - 8 hours ago

"Unfair" paragraph is way too short. This is the main problem! The outlier starvation you get from contended spinlocks is extraordinary and, hypothetically, unbounded.

CamperBob2 - 10 hours ago

Sheesh. Can something this complicated ever truly be said to work?

gafferongames - 10 hours ago

Great article! Thanks for posting this.