A note on the performance of move-constructing

Everybody likes to move it move it, including King Julien of Madagascar cartoon fame. Inconditionally. But I heard at least twice that move-constructed objects can negatively impact performance (when compared to plain copy-constructed objects that is). There is an extra memory diffusion which does worsen data access. But I have not been shown any hard data nor could I find any.

Read More

40 ms delays

If you are victimized by misterious delays in your TCP traffic, especially the light type of traffic, even in the local traffic: consider the Nagle algorithm. In this case, the observed delays are 40 ms.

Read More

Memory model verification tools

  • Promela + mmlib + trio additions
    • Promela is architecture-agnostic so this is quite a stretch
  • c11tester-llvm
    • unmaintained
    • the pass can be compiled against llvm8
      • rumored to compile angainst llvm11 but I could not
      • there is an llvm12 branch; I have not tried it
      • seems to be possible to compile it as an out-of-tree but I have not tried it
      • llvm17 and later has a new pass manager - this would be the end of life for the c11tester pass
  • cdschecker
    • prefer relacy; too many false positives (see the shared_mutex tests)
    • useable but not with coroutines
    • original is unmaintained
  • relacy
    • to be preferred over other tools
    • still maintained: dvyukov and ccotter
    • dated codebase but useable
  • genmc
    • maintained
    • llvm C11 only
  • cppmem
  • nitpick
Read More

TAU - Notes about the Tuning and Analysis Utilities Kit

  • Optional: modules
  • Download it1
  • Optional, if some dependencies cannot be installed via the distro: PDT.
    • Download PDT
    • Untar the ext.tgz in the top level TAU directory: tar zxf ext.tgz. It will create an external_dependencies folder.
      • PDT can be installed using ./configure ; make ; make install
  • Optional: ./tau_setup
  • ./configure -prefix=/opt/tau && make
    • results will be in /opt/tau/x86_64/bin or similar arch folder.
    • add the bin folder to the PATH
  • ./tau_validate -v /opt/tau/x86_64
  • Optional: ./upgradetau /path/to/old/source
  • Use (see doc)
Read More

Use int as a loop variable

A little gem, courtesy of Fedor Pikus & team: use int as a loop variable. Even though, most of the time, using an unsigned variable is the logic thing. There is some performance that can be extracted by the signed variable. For x86, the case is clear (YMMV with another CPU or compiler). For ARM, there still is some little performance that can be squeezed (YMMV, etc.). Though the difference here is within the range of statistical noise, it was always tilted in favor of the int in all the runs I made.

Read More

Lock timings (II)

The std::shared_lock (aka the readers-writers lock) is more complex code than the std::mutex. As such, I would expect it to be slower, unless (maybe):

  • it is protecting a structure which is read more often than written. Writing is serialized anyway but reading can be parallelized.
  • the read contention is high
  • the critical section to be protected is sizeable
Read More

Some lock timings (I)

Made a set of measurements for various lock types. These can be used as a guide and not as full substitute for mesurements in a given software context & hardware:

  • these measurements are only indicative for a different architecture.
  • these measurements are even less useful out of the software context where used (i.e. what is the lock used for; how long is it held; etc)
Read More

Barriers usage

  • relaxed
    • for counters
    • when decrementing: use acquire-release if used for refcounting
  • release
    • for an index. Has dependent data that must be visible when updated.
Read More

A note on barriers costs

Here is a set of benchmarks for one Intel CPU and one ARM CPU. While you still need to measure for your specific platform (and application), I think this is a fair generalization:

  • On Intel: avoid std::memory_order_seq_cst and std::memory_order_acq_rel; if you can.
  • On ARM: use std::memory_order_relaxed if you can; everything else is expensive and about the same.

Intel:

| Threads  | R&W:plain   | R&W:cst      | R:acq/W:rel  |R:cons/W:rel  | R&W:acqrel   | R&W:rlxd
| -------  | --------    | -------      | -----------  |------------  | -------      | -----------
| 1        | 33127114    | 137253027    | 14527492     |14390159      | 83419552     | 15516709
| 2        | 65544331    | 111611757    | 40904954     |29246743      | 118626628    | 51504383
| 4        | 76579883    | 324397517    | 88833451     |93580268      | 344201518    | 68938834
| 8        | 136478453   | 674507507    | 156113484    |160151688     | 730754759    | 142440185
| 16       | 343907409   | 1520126869   | 309321524    |307596361     | 1449977644   | 266249518
| 32       | 535688807   | 2960453288   | 630654026    |556672636     | 2865121020   | 605446567
| 64       | 1223356722  | 5768121877   | 1142276229   |1156412571    | 5832937534   | 1261543095
| 128      | 2539714413  | 11670599124  | 2204149831   |2106698462    | 11955427568  | 2328507084

_config.yml _config.yml

ARM:

| Threads  | R&W:plain   | R&W:cst      | R:acq/W:rel  | R:cons/W:rel  | R&W:rlxd
| -------  | --------    | -------      | -----------  | ------------  | -----------
| 1        | 37050747    | 64841628     | 64885685     | 64855760      | 22342727
| 2        | 83245493    | 853507322    | 1024283791   | 1042490241    | 49977590
| 4        | 75488640    | 2167635118   | 2298913752   | 2350279620    | 78062916
| 8        | 149902079   | 4411031159   | 4679633671   | 4687182030    | 157835993
| 16       | 297785589   | 8860286175   | 9465185940   | 9225091012    | 296109828
| 32       | 588377972   | 17933268280  | 18381362453  | 18402923175   | 596372276 
| 64       | 1175653706  | 36222742891  | 36580019179  | 37389743218   | 1192288348
| 128      | 2342565804  | 73656199471  | 73646965131  | 72336900816   | 2353713656

_config.yml

Read More