The case of the UI thread that hung in a kernel call

devblogs.microsoft.com

142 points by luu 4 days ago


simscitizen - 4 days ago

Oh I've debugged this before. Native memory allocator had a scavenge function which suspended all other threads. Managed language runtime had a stop the world phase which suspended all mutator threads. They ran at about the same time and ended up suspending each other. To fix this you need to enforce some sort of hierarchy or mutual exclusion for suspension requests.

> Why you should never suspend a thread in your own process.

This sounds like a good general princple but suspending threads in your own process is kind of necessary for e.g. many GC algorithms. Now imagine multiple of those runtimes running in the same process.

ryao - 3 days ago

Who are these customers that get developer support from Microsoft engineering teams?

ot - 4 days ago

On Linux you'd do this by sending a signal to the thread you want to analyze, and then the signal handler would take the stack trace and send it back to the watchdog.

The tricky part is ensuring that the signal handler code is async-signal-safe (which pretty much boils down to "ensure you're not acquiring any locks and be careful about reentrant code"), but at least that only has to be verified for a self-contained small function.

Is there anything similar to signals on Windows?

zavec - 4 days ago

I knew from seeing a title like that on microsoft.com that it was going to be a Raymond Chen post! He writes fascinating stuff.

pitterpatter - 4 days ago

Reminds me of a hang in the Settings UI that was because it would get stuck on an RPC call to some service.

Why was the service holding things up? Because it was waiting on acquiring a lock held by one of its other threads.

What was that other thread doing? It was deadlocked because it tried to recursively acquire an exclusive srwlock (exactly what the docs say will happen if you try).

Why was it even trying to reacquire said lock? Ultimately because of a buffer overrun that ended up overwriting some important structures.

boxed - 4 days ago

I had a support issue once at a well known and big US defense firm. We got kernel hangs consistently in kernel space from normal user-level code. Crazy shit. I opened a support issue which eventually got closed because we used an old compiler. Fun times.

markus_zhang - 4 days ago

Although I understand nothing from these posts, read Raymond's posts somehow always "tranquil" my inner struggles.

Just curious, is this customer a game studio? I have never done any serious system programming but the gist feels like one.

saagarjha - 3 days ago

> If you want to suspend a thread and capture stacks from it, you’ll have to do it from another process, so that you don’t deadlock with the thread you suspended.

Unfortunately sometimes you don't have the luxury of being able to do this (e.g. on iOS, especially pre-MetricKit). We shipped one such implementation in the Twitter app (which was still there last I checked) and as far as I can tell it's safe but mostly by accident–I didn't want to to pause things for very long, so the code just suspends the thread, grabs register state, then writes the backtrace to a stack buffer before resuming. I originally wanted to grab traces without suspending the process, which is something you can actually "do" because getting register state doesn't require suspension and you need to put guards on your frame decoding anyway ("is this address I am about to dereference actually in the stack?"). But unfortunately after thinking about it I added the suspension back because trying to collect a trace from a running thread could give you a fragmented backtrace as it modifies it out from under you.

rat87 - 4 days ago

Reminds me of a bug that would bluescreen windows if I stopped Visual Studio debugging if it was in the middle of calling the native Ping from C#

Permik - 3 days ago

I have the weirdest hunch that the customer in question was Valve :D

frabona - 4 days ago

Such a clean breakdown. "Don’t suspend your own threads" should be tattooed on every Windows dev’s arm at this point

makz - 4 days ago

Looking at the title, at first I thought “uh?”, but then I saw microsoft and it made sense.

baruchthescribe - 3 days ago

>Naturally, a suspended UI thread is going to manifest itself as a hang.

The correct terminology is 'stopped responding' Raymond. You need to consult the style guide.

brcmthrowaway - 4 days ago

Can this happen with Grand Central Dispatch ?