NUMA performance notes

NUMA nodes false sharing is extensive:
- normal false sharing is of concern for cache lines
- on NUMA nodes, false sharing is also of concern for pages as pages have to be migrated between nodes for shared access. Avoid accessing the same page from different NUMA nodes.
- granularity: typical 32k pages on ARM, 4k pages on x86. What about huge pages (2Mb)?
- keep sharing processes confined to the same NUMA node
- use per-thread memory
- no jumping across memory
- reported by VTune as front-end stall (instruction stall!)
atomics are slower than older non-NUMA systems; in particular read-modify-write
use NUMA-aware thread pools
Linux kernel presure points:
- TLB: shootdowns (inter-processor interrupts!) by:
  - NUMA page migrations
  - changes in memory mappings
  - memory access modes/protection
  - (explicit) changes to the in-kernel page table
  - use huge pages as relief /proc/sys/kernel/mm/transparent_hugepage/enabled always (by kernel as it sees fit; in practice: must request hp explicitly)
    - madvise(addr, size); addr is 4K aligned; size is x*2M
    - memory must be reserved/come from mmap()
- disable NUMA migration:
  - use numactl
  - bind program/thread to a single NUMA node/set of
  - turn off NUMA balancing /proc/sys/kernel/numa_balancing 0
  - overall program likely to be slower now
low CPU usage on NUMA at OS-level (idle time). Likely IO bound
- try lowering the kernel.sched_migration_cost_ns
- move processes that use network NICs to node 0 (NIC connects to PCIE of node 0
CPU stays high (100%):
- maybe frequency drop when load is high: not enough wattage/power
CPU simulators? Digital Ocean

@see https://www.youtube.com/watch?v=wGSSUSeaLgA

Written on February 16, 2026