Non-Uniform Memory Access (NUMA) is reshaping microservice placement

codemia.io

97 points by signa11 4 days ago


bboreham - 3 days ago

Very detailed and accurate description. The author clearly knows way more than I do, but I would venture a few notes:

1. In the cloud, it can be difficult to know the NUMA characteristics of your VMs. AWS, Google, etc., do not publish it. I found the ‘lscpu’ command helpful.

2. Tools like https://github.com/SoilRos/cpu-latency plot the core-to-core latency on a 2d grid. There are many example visualisations on that page; maybe you can find the chip you are using.

3. If you get to pick VM sizes, pick ones the same size as a NUMA node on the underlying hardware. Eg prefer 64-core m8g.16xlarge over 96-core m8g.24xlarge which will span two nodes.

stego-tech - 4 days ago

Solid writeup of NUMA, scheduling, and the need for pinning for folks who don’t spend a lot of time in the IT side of things (where we, unfortunately, have been wrangling with this for over a decade). The long and short of it is that if you’re building a HPC application, or are sensitive to throughput and latency on your cutting-edge/high-traffic system design, then you need to manually pin your workloads for optimal performance.

One thing the writeup didn’t seem to get into is the lack of scalability of this approach (manual pinning). As core counts and chiplets continue to explode, we still need better ways of scaling manual pinning or building more NUMA-aware OSes/applications that can auto-schedule with minimal penalties. Don’t get me wrong, it’s a lot better than ye olden days of dual core, multi-socket servers and stern warnings against fussing with NUMA schedulers from vendors if you wanted to preserve basic functionality, but it’s not a solved problem just yet.

throw0101c - 3 days ago

If you really want to get into it, then could start worrying about your I/O. In the AMD example here:

* https://www.thomas-krenn.com/en/wiki/Display_Linux_CPU_topol...

you'll see some NUMA nodes with networking I/O attached to them, others with NVMe, and others with no I/O. So if you're really worried about network latency then you'd pin the process to that node, but if you want look at disk numbers (a database?) you'd be potentially looking at that node.

In recent years there's also chiplet-level locality that may need to be considered as well.

Examining this has been a thing in the HPC space for a decade or two now:

* https://www.open-mpi.org/projects/hwloc/lstopo/

* https://slurm.schedmd.com/mc_support.html

jauntywundrkind - 4 days ago

There's a constant drum-beat of NUMA related work going by if you follow phoronix.com .

https://www.phoronix.com/news/Linux-6.17-NUMA-Locality-Rando... https://www.phoronix.com/news/Linux-6.13-Sched_Ext https://www.phoronix.com/news/DAMON-Self-Tuned-Memory-Tierin... https://www.phoronix.com/news/Linux-6.14-FUSE

There's some big work I'm missing thats more recent too, again about allocating & scheduling IIRC. Still trying to find it. The third link is in DAMON, which is trying to do a lot to optimize; good thread to tug more on!

I have this pocket belief that eventually we might see post NUMA post coherency architectures, where even a single chip acts more like multiple independent clusters, that use something more like networking (CXL or UltraEthernet or something) to allow RDMA, but without coherency.

Even today, the title here is woefully under-describing the problem. A Epyc chip is actually multiple different compute die, each with their own NUMA zone and their own L3 and other caches. For now yes each socket's memory is all via a single IO die & semi uniform, but whether that holds is in question, and even today, the multiple NUMA zones on one socket already require careful tuning for efficient workload processing.

pjmlp - 3 days ago

Even with SMP, back in the 2000's, the Windows NT/2000 scheduler wasn't that great, re-scheduling processes/threads across CPUs, already by making use of the processor affinity mask we managed a visible performance improvement.

NUMA systems now make this even more obvious, when scheduling is not done properly.

ashvardanian - 3 days ago

I've been struggling to find the settings to control NUMA on public cloud instances for a long time. Those are typically configured to present a single socket as a single UMA node, even on huge EPYCs. If someone has a tip on where to find those, I'd appreciate a link!

porridgeraisin - 3 days ago

Very nice article. Gonna try this out on our lab cluster and see what improvements it gives.

jwgehieir - 4 days ago

[flagged]

jwgehieir - 4 days ago

[flagged]