Atomics in cache lines
Preamble
I wanted to understand how cache lines affect the performance of atomic variables. In particular, how bad can it get if we have false sharing going on. So, I wrote a few C++ benchmarks to check this out.
Simple benchmarks
The code here has structs that keep std::atomic<int> members together or apart in memory.
I've used alignas(std::hardware_destructive_interference_size) to keep the members apart. Each cache line is 64 bytes, this is given by std::hardware_destructive_interference_size which is explained here.
I do get the warning as well, which we can ignore for now. But, it's informing me that I should be careful when using it:
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
14 | alignas(std::hardware_destructive_interference_size) std::atomic<int> v0 = 0;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: the default value for the current CPU tuning is 64 bytes
/home/nolan-veed/nolan-veed/benchmarks/test_cache.cpp:14:18: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
The benchmarks then use one, two or three threads, incrementing one, two ir three values within the structs 10 million times, independently, in parallel.
Result
You can clearly see the result reproduced here:
BM_CacheAtomicOneThread 42303676 ns 42831 ns 100
BM_CacheAtomicsTogetherTwoThreads 180052333 ns 50527 ns 100
BM_CacheAtomicsTogetherThreeThreads 277200032 ns 73196 ns 10
BM_CacheAtomicsApartTwoThreads 42150153 ns 54341 ns 100
BM_CacheAtomicsApartThreeThreads 42466279 ns 59058 ns 100
The behaviour was sort of what I was expecting. It's pretty normal as we have cache-coherency protocol kicking in that invalidates cache of other threads. But, it's nice to get a "feel" of how long things actually take on my machine.
- It takes about 42 ms to increment a
std::atomic<int>10 million times. - For the benchmarks where the atomics are together, false sharing occurs and the time taken to increment these atomics increase (to 180 ms, 277 ms, ...) with increasing thread count. It takes ~4x, ~7x longer.
- For the benchmarks where the atomics are apart, false sharing does not occur. So, the time taken to increment these atomics remain similar with increasing thread count.
In theory, as I have 16 logical cores (8 physical cores), incrementing 8 atomics that are apart should take a similar amount of time as incrementing 1 atomic, maybe a small overhead for every extra thread. Something to for a future post perhaps. We'll see.
- ← Previous
Cloudflare Pages to Workers migration - Next →
Developing with FreeBSD