NUMA performance notes
- NUMA nodes false sharing is extensive:
- normal false sharing is of concern for cache lines
- on NUMA nodes, false sharing is also of concern for pages as pages have to be migrated between nodes for shared access. Avoid accessing the same page from different NUMA nodes.
- granularity: typical 32k pages on ARM, 4k pages on x86. What about huge pages (2Mb)?
- keep sharing processes confined to the same NUMA node
- use per-thread memory
- no jumping across memory
- reported by VTune as front-end stall (instruction stall!)
- atomics are slower than older non-NUMA systems; in particular read-modify-write
-
use NUMA-aware thread pools
- Linux kernel presure points:
- TLB: shootdowns (inter-processor interrupts!) by:
- NUMA page migrations
- changes in memory mappings
- memory access modes/protection
- (explicit) changes to the in-kernel page table
- use huge pages as relief
/proc/sys/kernel/mm/transparent_hugepage/enabledalways (by kernel as it sees fit; in practice: must request hp explicitly)- madvise(addr, size); addr is 4K aligned; size is x*2M
- memory must be reserved/come from mmap()
- disable NUMA migration:
- use numactl
- bind program/thread to a single NUMA node/set of
- turn off NUMA balancing
/proc/sys/kernel/numa_balancing0 - overall program likely to be slower now
- TLB: shootdowns (inter-processor interrupts!) by:
- low CPU usage on NUMA at OS-level (idle time). Likely IO bound
- try lowering the
kernel.sched_migration_cost_ns - move processes that use network NICs to node 0 (NIC connects to PCIE of node 0
- try lowering the
- CPU stays high (100%):
- maybe frequency drop when load is high: not enough wattage/power
- CPU simulators? Digital Ocean
@see https://www.youtube.com/watch?v=wGSSUSeaLgA
Written on February 16, 2026
