Adds, muls on my superscalar processor - part 2
The question
Since the findings in my previous post left me with more questions, I wanted try and get deeper into one of them in this post:
Why is Add3 faster than Add2?
These were the timing results:
Benchmark Time CPU Iterations
-----------------------------------------------------
BM_Add 56002369 ns 55988747 ns 13
BM_Add2 174727175 ns 174686940 ns 4
BM_Add3 137654951 ns 137638465 ns 5
For Add3
, in Compiler Explorer, you can see the instructions for adding up 1 billion ints, into 3 separate variables here:
There are 3 sections after label .L3
, for the actual accumulations in the loop. And Add2
is similar with 2 such sections for its 2 variables.
Yet, Add3
is faster that Add2
. So, I had to find out why.
Matching compiler flags
To avoid any discrepancies, one thing I should really do is replicate my benchmark project's compiler flags into Compiler Explorer. The project is a CMake project created through CLion, and I have a Debug and Release build configuration. The Release configuration is what I use to get the benchmark timings.
To see my compiler flags, I've enabled verbose makefile in cmake:
set(CMAKE_VERBOSE_MAKEFILE ON)
I've to ensure that Unix Makefiles
are generated for builds (not Ninja, or something else).
Then upon build, I can see the full command-line like so:
/usr/bin/c++ -DBENCHMARK_STATIC_DEFINE -I/home/nolan-veed/nolan-veed/benchmarks/cmake-build-release/_deps/benchmark-src/include -O3 -DNDEBUG -std=gnu++23 -fdiagnostics-color=always -MD -MT CMakeFiles/benchmarks.dir/test_superscalar.cpp.o -MF CMakeFiles/benchmarks.dir/test_superscalar.cpp.o.d -o CMakeFiles/benchmarks.dir/test_superscalar.cpp.o -c /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp
The main bits that could generate different machine code are, which is optimisation, non-debug preprocessor macro and C++ standard selection:
-O3 -DNDEBUG -std=gnu++23
But, when I put some of these flags into Compiler Explorer I don't get any output. It looks lik this is because the function is actually optimised out as there are no callers to it. We need a way to use those variables, so I'll just print it like this:
Aha! It's not as simple as we saw earlier. In "release" mode, the code generated is using vector instructions (more specifically, SSE2). For example, the paddd
is a the packed addition. The xmm registers are 128-bits wide, so can contain 4 ints and add them simultaneously. Only 250 million iterations are done instead of 1 billion.
That was Add3
, what about Add2
and Add
:


Look closely at the cmp
instruction!
Findings
Each loop iteration is essentially between the label .Lxxx
to the instruction jne .Lxxx
.
- The number of iterations in
Add3
are less than those ofAdd2
.- The
cmp eax, 250000000
inAdd3
is iterating half the number of times compared to thecmp eax, 500000000
inAdd2
Add3
is likely to be faster thanAdd2
because of this.
- The
- The number of iterations in
Add3
are the same as those ofAdd
.Add3
is likely to be slower thanAdd
because of more additions.
The actual instructions are emitted by the compiler are awesome. We'll need to refer to them manuals for them. May be another day.
Enabling assembly output locally
To be extra sure, I can look at the code generated on my machine, and then compare that with what Compiler Explorer shows me.
GCC gives you the ability to generate the assembly output for my generated instructions. We need these flags:
-save-temps -masm=intel -fverbose-asm -Wa,-adhlmn=main.lst
Here are a few snippets of the 3 functions.
For Add3
:
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:152: static void BM_Add3(benchmark::State &state) {
pxor xmm6, xmm6 # vect__18.418
movdqa xmm4, xmm10 # vect__18.418, vect__18.418
movdqa xmm2, xmm9 # vect_vec_iv_.416, vect_vec_iv_.416
xor eax, eax # ivtmp.421
movdqa xmm5, xmm6 # vect_sum3_47.417,
movdqa xmm1, xmm8 # vect_vec_iv_.415, vect_vec_iv_.415
movdqa xmm0, xmm7 # vect_vec_iv_.414, vect_vec_iv_.414
.p2align 4,,10
.p2align 3
.L128:
add eax, 1 # ivtmp.421,
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:27: sum3 += i;
paddd xmm4, xmm0 # vect__18.418, vect_vec_iv_.414
paddd xmm5, xmm1 # vect_sum3_47.417, vect_vec_iv_.415
paddd xmm6, xmm2 # vect__18.418, vect_vec_iv_.416
paddd xmm0, xmm3 # vect_vec_iv_.414, tmp157
paddd xmm1, xmm3 # vect_vec_iv_.415, tmp157
paddd xmm2, xmm3 # vect_vec_iv_.416, tmp157
cmp eax, 250000000 # ivtmp.421,
jne .L128 #,
For Add2
:
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:143: static void BM_Add2(benchmark::State &state) {
xor eax, eax # ivtmp.439
movdqa xmm1, xmm5 # vect_sum2_39.435, vect_sum2_39.435
movdqa xmm0, xmm4 # vect_vec_iv_.434, vect_vec_iv_.434
.p2align 4,,10
.p2align 3
.L143:
add eax, 1 # ivtmp.439,
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:16: sum2 += i;
paddd xmm1, xmm0 # vect_sum2_39.435, vect_vec_iv_.434
paddd xmm0, xmm2 # vect_vec_iv_.434, tmp111
cmp eax, 500000000 # ivtmp.439,
jne .L143 #,
For Add
:
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:135: static void BM_Add(benchmark::State &state) {
xor eax, eax # ivtmp.457
pxor xmm1, xmm1 # vect_sum_31.453
movdqa xmm0, xmm4 # vect_vec_iv_.452, vect_vec_iv_.452
.p2align 4,,10
.p2align 3
.L158:
movdqa xmm2, xmm0 # vect_vec_iv_.452, vect_vec_iv_.452
add eax, 1 # ivtmp.457,
paddd xmm0, xmm3 # vect_vec_iv_.452, tmp104
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:7: sum += i;
paddd xmm1, xmm2 # vect_sum_31.453, vect_vec_iv_.452
cmp eax, 250000000 # ivtmp.457,
jne .L158 #,
movdqa xmm0, xmm1 # tmp98, vect_sum_31.453
psrldq xmm0, 8 # tmp98,
paddd xmm1, xmm0 # _11, tmp98
The iteration loops are similar to what Compiler Explorer shows.
Phew!