We found a bug in Go's ARM64 compiler

blog.cloudflare.com

828 points by jgrahamc 5 days ago


Neywiny - 5 days ago

That's an incredible find and once I saw the assembly I was right along with them on the debug path. Interestingly it doesn't need to be assembly for this to work, it's just that that's where the split was. The IR could've done it, it just doesn't for very good reasons. So another win for being able to read arm assembly.

Unsure if this would be another way to do it but to save an instruction at the cost of a memory access you could push then pop the stack size maybe? Since presumably you're doing that pair of moves on function entry and exit. I'm not really sure what the garbage collector is looking for so maybe that doesn't work, but I'd be interested to hear some takes on it

pengaru - 5 days ago

For the impatient, here's the fix: https://github.com/golang/go/commit/f7cc61e7d7f77521e073137c...

Vipsy - 4 days ago

One thing that often gets missed is how hard it is to even suspect the compiler as the root cause. Most engineers waste hours chasing bugs in their own code because we’re trained to trust our tools. This mindset alone can make these rare compiler bugs much trickier to find.

renewiltord - 5 days ago

Great technical blog. Good pathway for narrative, tight examples, description so clear it makes me feel smarter than I am because so easy to follow though the last time I even read assembly seriously was x86 years ago.

Also, fulfills the marketing objective because I cannot help but think that this team is a bunch of hotshots who have the skill to do this on demand and the quality discipline to chase down rare issues.

I assume these are Ampere Altra? I was considering some of those for web servers to fill out my rack (more space than power) but ended up just going higher on power and using Epyc.

quotemstr - 4 days ago

This problem strikes me more as a debuginfo generation bug than a "compiler" bug.

> After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

Seems silly to pessimize the runtime, even slightly, to account for the partial register construction. DWARF bytecode ought to be powerful enough to express the calculations needed for restoring the true stack pointer if we're between immediate adjustments.

riobard - 5 days ago

What ARM64 machines are you using and what are they used for? Last year you were announcing Gen 12 servers on AMD EPYC (https://blog.cloudflare.com/gen-12-servers/), but IIRC there weren’t any mentions of ARM64. But now it seems you’re running ARM64 in full production.

MarkSweep - 4 days ago

I wonder if Go had a mode where you make it single step every instruction and trigger a GC interrupt on every opcode. That would make it easier to find these kinds of bugs.

wy1981 - 4 days ago

Great find and writeup.

As an aside, this is the type of a problem that I think model checkers can't help with. You can write perfect and complicated TLA+/Lean/FizzBee models and even if somehow these models can generate code for you from your correct models you can still run into bugs like these due to platform/compiler/language issues. But, thankfully, such bugs are rare.

alberth - 4 days ago

I thought Cloudflare was 100% Rust, and x86 (EPYC) these days.

Interesting to hear Go & ARM in use.

maguro_01 - a day ago

An x86-64 Windows 11 machine trying to access a previously available Website now always produces a Cloudflare "obsolete protocol" error on the ordinary attempt. Al browsers get the same error. Did your fix break something?

dreamcompiler - 5 days ago

Always adjust your stack pointer atomically, kids.

brcmthrowaway - 5 days ago

I don't get it, how were the machine threads being stopped in thr middle of two instructions? This is baremetal, right?

pfdietz - 4 days ago

I see something like this and I wonder "what testing methodology would have found this?" It has to be general, not something that would involve knowing what the bug was ahead of time.

Agingcoder - 5 days ago

Excellent article as always from the cloudflare blog - engineering without magic infrastructure and ml. One day I will apply !

Compiler bugs are actually quite common ( I used to find several a year in gcc ), but as the author says, some of them only appear when you work at a very large scale, and most people never dive that far.

Bengalilol - 4 days ago

I always appreciate articles like this, where you can clearly see the engineer’s way of thinking.

I was just puzzled by the middle part of the article, where they start investigating their code but seem to overlook the fact that it only happens on ARM64.

Still, I understand that it’s professional to proceed step by step logically.

Great article, it was a pleasure reading it!

javierhonduco - 5 days ago

Really enjoyed reading this. Thanks for writing it!

bradley13 - 4 days ago

I find it interesting, how rare it has become to find s compiler bug. For me, at least, it used to be a regular event.

Even Java, as widespread as it is, I have made half-a-dozen reports. None in the last several years, though.

Better testing? The sheer scale of software being produced?

wat10000 - 5 days ago

I would have thought that unwinding would use the frame pointer and this wouldn't be a problem.

mperham - 5 days ago

Did they ever explain why netlink was involved? Or was that a red herring?

yalok - 5 days ago

Classic problem of non-atomic stack pointer modification.

Used to have a lot of fun with those 3 decades ago.

lordnacho - 5 days ago

> This was a very fun problem to debug.

I'm sure it was a relief to find a thorough solution that addressed the root cause. But it doesn't seem plausible that it was fun while it was unexplained. When I have this kind of bug it eats my whole attention.

Something this deep is especially frustrating. Nobody suspects the standard library or the compiler. Devs have been taught from a young age that it's always you, not the tools you were given, and that's generally true.

One time, I actually did find a standard library bug. I ended up taking apart absolutely everything on my side, because of course the last hypothesis you test is that the pieces you have from the SDK are broken. So a huge amount of time is spent chasing the wrong lead when it actually is a fundamental problem.

On top of this, the thing is a race condition, so you can't even reliably reproduce it. You think it's gone like they did initially, and then it's back. Like cancer.

neuroelectron - 4 days ago

I've seen only one race condition in my career and it always surprises me how it is even found.

anthk - 4 days ago

I miss the Delve debugger for OpenBSD 386 BTW.

gok - 5 days ago

The real lesson here should be that doing crazy shit like swizzling the program counter in a signal handler and writing your own assembler is not a good idea.

me2too - 4 days ago

Great write-up

berz01 - 5 days ago

[flagged]