A note on the performance of move-constructing

Everybody likes to move it move it, including King Julien of Madagascar cartoon fame. Inconditionally. But I heard at least twice that move-constructed objects can negatively impact performance (when compared to plain copy-constructed objects that is). There is an extra memory diffusion which does worsen data access. But I have not been shown any hard data nor could I find any.

Locally turning off optimizations

int [[gnu::optimize("O0")]] f(...) {...}
// also see gnu::noinline

Performance stuff

final, noexcept & more

boost concurrent_flat_map performance

How is boost::concurrent_flat_map comparing to an std::unordered_map protected by an readers-writers lock:

x86_64:
- write: concurrent map wins
- read: unordered map wins by a large margin (!). Caveat emptor if the code is read-intensive.
ARM:
- write: concurrent map wins by a large margin
- read: concurrent map wins

40 ms delays

If you are victimized by misterious delays in your TCP traffic, especially the light type of traffic, even in the local traffic: consider the Nagle algorithm. In this case, the observed delays are 40 ms.

constexpr issues of note

(Romeo/Khlebnikov/Meredith):

Just use const string ref as arg

I cannot tell any performance gain for using std::string_view as an argument of a function call insted of a const std::string&

Memory model verification tools

Promela + mmlib + trio additions
- Promela is architecture-agnostic so this is quite a stretch
c11tester-llvm
- unmaintained
- the pass can be compiled against llvm8
  - rumored to compile angainst llvm11 but I could not
  - there is an llvm12 branch; I have not tried it
  - seems to be possible to compile it as an out-of-tree but I have not tried it
  - llvm17 and later has a new pass manager - this would be the end of life for the c11tester pass
cdschecker
- prefer relacy; too many false positives (see the shared_mutex tests)
- useable but not with coroutines
- original is unmaintained
relacy
- to be preferred over other tools
- still maintained: dvyukov and ccotter
- dated codebase but useable
genmc
- maintained
- llvm C11 only
cppmem
nitpick

c11tester notes

Notes about c11tester-llvm and experimenting details

Upml - a tool to help with formal verification of UML state machines

Upml is a tool to transform a plantuml UML state machine specification into either a ProMeLa or a PlusCal specification suitable for formal verifications.

Half or double at the performance lottery

I have been tipped off how to speed up map searches when the key is a string: use transparent comparators.

jemalloc - find leaks

```

TAU - Notes about the Tuning and Analysis Utilities Kit

Optional: modules
Download it¹
Optional, if some dependencies cannot be installed via the distro: PDT.
- Download PDT
- Untar the ext.tgz in the top level TAU directory: tar zxf ext.tgz. It will create an external_dependencies folder.
  - PDT can be installed using ./configure ; make ; make install
Optional: ./tau_setup
./configure -prefix=/opt/tau && make
- results will be in /opt/tau/x86_64/bin or similar arch folder.
- add the bin folder to the PATH
./tau_validate -v /opt/tau/x86_64
Optional: ./upgradetau /path/to/old/source
Use (see doc)

https://www.cs.uoregon.edu/research/tau/home.php ↩

ARM - why not to construct a shared L3 cache

Summary:

keeping the L3 cache line size the same as L1/2 is a good idea
a line of 256 bytes is too long: fetching 256 for a line instead of 64 has a time/bandwidth extra cost; false sharing gets really annoying.
sharing L3 between CPUs is not a good idea

Basic enum reflection

As basic and simple as it can be¹. For a full reflection with heavy compiler torture and spending a basket of CPU cycles for it, use magic_enum or reflect-cpp or similar.

https://github.com/melintea/lpt-tools/blob/main/include/lpt/enum_tools.hpp ↩

Use int as a loop variable

A little gem, courtesy of Fedor Pikus & team: use int as a loop variable. Even though, most of the time, using an unsigned variable is the logic thing. There is some performance that can be extracted by the signed variable. For x86, the case is clear (YMMV with another CPU or compiler). For ARM, there still is some little performance that can be squeezed (YMMV, etc.). Though the difference here is within the range of statistical noise, it was always tilted in favor of the int in all the runs I made.

Lock timings (II)

The std::shared_lock (aka the readers-writers lock) is more complex code than the std::mutex. As such, I would expect it to be slower, unless (maybe):

it is protecting a structure which is read more often than written. Writing is serialized anyway but reading can be parallelized.
the read contention is high
the critical section to be protected is sizeable

Likely to be a bit unlikely

Two new C++20 attributes to guide the compiler with branching: [[likely]] and [[unlikely]]. For the moment, there are reports of being useful, including in the standard lib. But maybe not always so - here is a failure case. This failed me at least for clang 14 & gcc 11.4 (same ballpark results). Alternative [[unlikely]] explanation: the branch predictor is so good that it does not need any help. I ran a simple test¹:

https://github.com/melintea/lpt-tools/blob/main/src/benchmark/unlikely.cpp ↩

Some lock timings (I)

Made a set of measurements for various lock types. These can be used as a guide and not as full substitute for mesurements in a given software context & hardware:

these measurements are only indicative for a different architecture.
these measurements are even less useful out of the software context where used (i.e. what is the lock used for; how long is it held; etc)

Barriers usage

relaxed
- for counters
- when decrementing: use acquire-release if used for refcounting
release
- for an index. Has dependent data that must be visible when updated.

A note on barriers costs

Here is a set of benchmarks for one Intel CPU and one ARM CPU. While you still need to measure for your specific platform (and application), I think this is a fair generalization:

On Intel: avoid std::memory_order_seq_cst and std::memory_order_acq_rel; if you can.
On ARM: use std::memory_order_relaxed if you can; everything else is expensive and about the same.

Intel:

| Threads  | R&W:plain   | R&W:cst      | R:acq/W:rel  |R:cons/W:rel  | R&W:acqrel   | R&W:rlxd
| -------  | --------    | -------      | -----------  |------------  | -------      | -----------
| 1        | 33127114    | 137253027    | 14527492     |14390159      | 83419552     | 15516709
| 2        | 65544331    | 111611757    | 40904954     |29246743      | 118626628    | 51504383
| 4        | 76579883    | 324397517    | 88833451     |93580268      | 344201518    | 68938834
| 8        | 136478453   | 674507507    | 156113484    |160151688     | 730754759    | 142440185
| 16       | 343907409   | 1520126869   | 309321524    |307596361     | 1449977644   | 266249518
| 32       | 535688807   | 2960453288   | 630654026    |556672636     | 2865121020   | 605446567
| 64       | 1223356722  | 5768121877   | 1142276229   |1156412571    | 5832937534   | 1261543095
| 128      | 2539714413  | 11670599124  | 2204149831   |2106698462    | 11955427568  | 2328507084

_config.yml

ARM:

| Threads  | R&W:plain   | R&W:cst      | R:acq/W:rel  | R:cons/W:rel  | R&W:rlxd
| -------  | --------    | -------      | -----------  | ------------  | -----------
| 1        | 37050747    | 64841628     | 64885685     | 64855760      | 22342727
| 2        | 83245493    | 853507322    | 1024283791   | 1042490241    | 49977590
| 4        | 75488640    | 2167635118   | 2298913752   | 2350279620    | 78062916
| 8        | 149902079   | 4411031159   | 4679633671   | 4687182030    | 157835993
| 16       | 297785589   | 8860286175   | 9465185940   | 9225091012    | 296109828
| 32       | 588377972   | 17933268280  | 18381362453  | 18402923175   | 596372276 
| 64       | 1175653706  | 36222742891  | 36580019179  | 37389743218   | 1192288348
| 128      | 2342565804  | 73656199471  | 73646965131  | 72336900816   | 2353713656

Random Bits

A note on the performance of move-constructing

Locally turning off optimizations

Performance stuff

boost concurrent_flat_map performance

40 ms delays

constexpr issues of note

Just use const string ref as arg

Memory model verification tools

c11tester notes

Upml - a tool to help with formal verification of UML state machines

Half or double at the performance lottery

jemalloc - find leaks

TAU - Notes about the Tuning and Analysis Utilities Kit

ARM - why not to construct a shared L3 cache

Basic enum reflection

Use int as a loop variable

Lock timings (II)

Likely to be a bit unlikely

Some lock timings (I)

Barriers usage

A note on barriers costs

Lockfree gone wrong