ARM - why not to construct a shared L3 cache

Here is how 4-cores CPUs for Intel and ARM behave when the cache is churned hard. Both Intel and ARM caches have 64 bytes lines and 3 levels:

_config.yml _config.yml

The ARM is kind of struggling, to the point that a piece of software optimized for Intel has to be changed significantly:

  • a Intel-optimal locking mechanism is not optiaml anymore on ARM.
  • structures have to be realigned.
  • maybe algorithms running on these structures may have to be changed as well.

Luckily, the stdlib reports (correctly) a huge offset is needed on this ARM to avoid false sharing, so at least you have a hint:

Offset                                         Intel   ARM
---------------------------------------------  ------  ------
std::hardware_destructive_interference_size    64      256
std::hardware_constructive_interference_size   64      64

Per cppreference1, above values are for L1 (256 for that level?):

“These constants provide a portable way to access the L1 data cache line size”

I believe the reason of this poor performance is more likely the L3 cache, optional for ARMs2:

“All the cores in the cluster share the L3 cache”.

Written on June 1, 2024