ndza

Atomics in cache lines

Preamble

I wanted to understand how cache lines affect the performance of atomic variables. In particular, how bad can it get if we have false sharing going on. So, I wrote a few C++ benchmarks to check this out.

Simple benchmarks

The code here has structs that keep std::atomic<int> members together or apart in memory.

I've used alignas(std::hardware_destructive_interference_size) to keep the members apart. Each cache line is 64 bytes, this is given by std::hardware_destructive_interference_size which is explained here.

I do get the warning as well, which we can ignore for now. But, it's informing me that I should be careful when using it:

/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
   14 |     alignas(std::hardware_destructive_interference_size) std::atomic<int> v0 = 0;
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: the default value for the current CPU tuning is 64 bytes
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’

The benchmarks then use one, two or three threads, incrementing one, two ir three values within the structs 10 million times, independently, in parallel.

Result

You can clearly see the result reproduced here:

BM_CacheAtomicOneThread               42303676 ns        42831 ns          100
BM_CacheAtomicsTogetherTwoThreads    180052333 ns        50527 ns          100
BM_CacheAtomicsTogetherThreeThreads  277200032 ns        73196 ns           10
BM_CacheAtomicsApartTwoThreads        42150153 ns        54341 ns          100
BM_CacheAtomicsApartThreeThreads      42466279 ns        59058 ns          100

The behaviour was sort of what I was expecting. It's pretty normal as we have cache-coherency protocol kicking in that invalidates cache of other threads. But, it's nice to get a "feel" of how long things actually take on my machine.

In theory, as I have 16 logical cores (8 physical cores), incrementing 8 atomics that are apart should take a similar amount of time as incrementing 1 atomic, maybe a small overhead for every extra thread. Something to for a future post perhaps. We'll see.