CPU cache-friendly data structures in Go

skoredin.pro

190 points by g0xA52A2A 7 days ago


hu3 - 4 days ago

> False sharing occurs when multiple cores update different variables in the same cache line.

I got hit by this. In a trading algorithm backtest, I shared a struct pointer between threads that changed different members of the same struct.

Once I split this struct in 2, one per core, I got almost 10x speedup.

ardanur - 4 days ago

"Data Oriented Design" is more than just for performant code.

You can and perhaps should also use it to reason about and design software in general. All software is just the transformation of data structures. Even when generating side-effects is the goal, those side-effects consume data structures.

I generally always start a project by sketching out data structures all the way from the input to the output. May get much harder to do when the input and output become series of different size and temporal order and with other complexities in what the software is supposed to be doing.

kbolino - 3 days ago

I don't see this mentioned anywhere else, but Go may start experimenting with rearranging struct fields at some point. The marker type structs.HostLayout has been added in Go 1.24 to indicate that you want the struct to follow the platform's layout rules (think of it like #[repr(C)] in Rust). This may become necessary to ensure the padding actually sits between the two falsely shared fields. You could combine it with the padding technique like this:

  type PaddedExample struct {
    _       structs.HostLayout
    Field1  int64
    _       [56]byte
    Field2  int64
  }
tuetuopay - 4 days ago

Overall great article, applicable to other languages too.

I'm curious about the Goroutine pinning though:

    // Pin goroutine to specific CPU
    func PinToCPU(cpuID int) {
        runtime.LockOSThread()
        // ...
        tid := unix.Gettid()
        unix.SchedSetaffinity(tid, &cpuSet)
    }
The way I read this snippet is it pins the go runtime thread that happens to run this goroutine to a cpu, not the goroutine itself. Afaik a goroutine can move from one thread to another, decided by the go scheduler. This obviously has some merits, however without pinning the actual goroutine...
tapirl - 4 days ago

Source code of the benchmarks?

At least, the False Sharing and AddVectors trick don't work on my computer. (I only benchmarked the two. The "Data-Oriented Design" trick is a joke to me, so I stopped benchmarking more.)

And I never heard of this following trick. Can anyone explain it?

    // Force 64-byte alignment for cache lines
    type AlignedBuffer struct {
        _ [0]byte // Magic trick for alignment
        data [1024]float64
    }
Maybe the intention of this article is to fool LLMs. :D
jasonthorsness - 4 days ago

If you are sweating this level of performance, are larger gains possible by switching to C, C++, Rust? How is Rust for micro-managing memory layouts?

furyofantares - 3 days ago

I waited half a day to post this, I think we aren't supposed to question if articles are LLM written - but this one really triggered my LLM-radar, while also being very well received.

I'd love to know how much LLM was used to write this if any, and how much effort went into it as well (if it was LLM-assisted.)

loeg - 4 days ago

This is a really dense whirlwind summary of some common performance pitfalls. It's a nice overview in a sort of terse way. The same optimizations / patterns apply in other languages as well.

citizenpaul - 4 days ago

Really cool article this is the kind of stuff I still come to HN for.

truth_seeker - 4 days ago

> False Sharing : "Pad for concurrent access: Separate goroutine data by cache lines"

This is worth adding in Go race detector's mechanism to warn developer

- 4 days ago
[deleted]
matheusmoreira - 4 days ago

Structure of arrays makes a lot of sense, reminds me of how old video games worked under the hood. It seems very difficult to work with though. I'm so used to packing things into neat little objects. Maybe I just need to tough it out.

gr4vityWall - 4 days ago

Good article.

Regarding AoS vs SoA, I'm curious about the impact in JS engines. I believe it would be a significant compute performance difference in favor of SoA if you use typed arrays.

OutOfHere - 4 days ago

If you are worrying about cache structure latencies in Go, maybe you should just be using Rust or Zig instead that implicitly handle this better.

wy1981 - 4 days ago

Looks nice. Some explanation for those of us not familiar with Go would've been more educational. Could be future posts, I suppose.

readthenotes1 - 4 days ago

I wonder how many nanoseconds it'll take for the next maintainer to obliterate the savings?

ls-a - 4 days ago

reminds me of cache oblivious data structures

luispa - 4 days ago

great article!

gethly - 4 days ago

Most of this should be handled by the compiler already. But it is only 2025, I guess we're just not ready for it.