Atomics in cache lines
Preamble
I wanted to understand how cache lines affect the performance of atomic variables. In particular, how bad can it get if we have false sharing going on. So, I wrote a few C++ benchmarks to check this out.
Simple benchmarks
The code here has structs that keep std::atomic<int> members together or apart in memory.
I've used alignas(std::hardware_destructive_interference_size) to keep the members apart. Each cache line is 64 bytes, this is given by std::hardware_destructive_interference_size which is explained here.
I do get the warning as well, which we can ignore for now. But, it's informing me that I should be careful when using it:
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
   14 |     alignas(std::hardware_destructive_interference_size) std::atomic<int> v0 = 0;
      |                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: the default value for the current CPU tuning is 64 bytes
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
The benchmarks then use one, two or three threads, incrementing one, two ir three values within the structs 10 million times, independently, in parallel.
Result
You can clearly see the result reproduced here:
BM_CacheAtomicOneThread               42303676 ns        42831 ns          100
BM_CacheAtomicsTogetherTwoThreads    180052333 ns        50527 ns          100
BM_CacheAtomicsTogetherThreeThreads  277200032 ns        73196 ns           10
BM_CacheAtomicsApartTwoThreads        42150153 ns        54341 ns          100
BM_CacheAtomicsApartThreeThreads      42466279 ns        59058 ns          100
The behaviour was sort of what I was expecting. It's pretty normal as we have cache-coherency protocol kicking in that invalidates cache of other threads. But, it's nice to get a "feel" of how long things actually take on my machine.
- It takes about 42 ms to increment a std::atomic<int>10 million times.
- For the benchmarks where the atomics are together, false sharing occurs and the time taken to increment these atomics increase (to 180 ms, 277 ms, ...) with increasing thread count. It takes ~4x, ~7x longer.
- For the benchmarks where the atomics are apart, false sharing does not occur. So, the time taken to increment these atomics remain similar with increasing thread count.
In theory, as I have 16 logical cores (8 physical cores), incrementing 8 atomics that are apart should take a similar amount of time as incrementing 1 atomic, maybe a small overhead for every extra thread. Something to for a future post perhaps. We'll see.
- ← Previous
 Cloudflare Pages to Workers migration
- Next →
 Developing with FreeBSD