Adds, muls on my superscalar processor - part 2

15 June 2025
benchmarks

The question

Since the findings in my previous post left me with more questions, I wanted try and get deeper into one of them in this post:

Why is Add3 faster than Add2?

These were the timing results:

Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_Add       56002369 ns     55988747 ns           13
BM_Add2     174727175 ns    174686940 ns            4
BM_Add3     137654951 ns    137638465 ns            5

For Add3, in Compiler Explorer, you can see the instructions for adding up 1 billion ints, into 3 separate variables here: A screenshot of Compiler Explorer

There are 3 sections after label .L3, for the actual accumulations in the loop. And Add2 is similar with 2 such sections for its 2 variables.

Yet, Add3 is faster that Add2. So, I had to find out why.

Matching compiler flags

To avoid any discrepancies, one thing I should really do is replicate my benchmark project's compiler flags into Compiler Explorer. The project is a CMake project created through CLion, and I have a Debug and Release build configuration. The Release configuration is what I use to get the benchmark timings.

To see my compiler flags, I've enabled verbose makefile in cmake:

set(CMAKE_VERBOSE_MAKEFILE ON)

I've to ensure that Unix Makefiles are generated for builds (not Ninja, or something else).

Then upon build, I can see the full command-line like so:

/usr/bin/c++ -DBENCHMARK_STATIC_DEFINE -I/home/nolan-veed/nolan-veed/benchmarks/cmake-build-release/_deps/benchmark-src/include -O3 -DNDEBUG -std=gnu++23 -fdiagnostics-color=always -MD -MT CMakeFiles/benchmarks.dir/test_superscalar.cpp.o -MF CMakeFiles/benchmarks.dir/test_superscalar.cpp.o.d -o CMakeFiles/benchmarks.dir/test_superscalar.cpp.o -c /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp

The main bits that could generate different machine code are, which is optimisation, non-debug preprocessor macro and C++ standard selection:

-O3 -DNDEBUG -std=gnu++23

But, when I put some of these flags into Compiler Explorer I don't get any output. It looks lik this is because the function is actually optimised out as there are no callers to it. We need a way to use those variables, so I'll just print it like this: A screenshot of Compiler Explorer - Add3

Aha! It's not as simple as we saw earlier. In "release" mode, the code generated is using vector instructions (more specifically, SSE2). For example, the paddd is a the packed addition. The xmm registers are 128-bits wide, so can contain 4 ints and add them simultaneously. Only 250 million iterations are done instead of 1 billion.

That was Add3, what about Add2 and Add:

A screenshot of Compiler Explorer - Add2

Look closely at the cmp instruction!

Findings

Each loop iteration is essentially between the label .Lxxx to the instruction jne .Lxxx.

The number of iterations in Add3 are less than those of Add2.
- The cmp eax, 250000000 in Add3 is iterating half the number of times compared to the cmp eax, 500000000 in Add2
- Add3 is likely to be faster than Add2 because of this.
The number of iterations in Add3 are the same as those of Add.
- Add3 is likely to be slower than Addbecause of more additions.

The actual instructions are emitted by the compiler are awesome. We'll need to refer to them manuals for them. May be another day.

Enabling assembly output locally

To be extra sure, I can look at the code generated on my machine, and then compare that with what Compiler Explorer shows me.

GCC gives you the ability to generate the assembly output for my generated instructions. We need these flags:

-save-temps -masm=intel -fverbose-asm -Wa,-adhlmn=main.lst

Here are a few snippets of the 3 functions.

For Add3:

# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:152: static void BM_Add3(benchmark::State &state) {
  pxor	xmm6, xmm6	# vect__18.418
  movdqa	xmm4, xmm10	# vect__18.418, vect__18.418
  movdqa	xmm2, xmm9	# vect_vec_iv_.416, vect_vec_iv_.416
  xor	eax, eax	# ivtmp.421
  movdqa	xmm5, xmm6	# vect_sum3_47.417,
  movdqa	xmm1, xmm8	# vect_vec_iv_.415, vect_vec_iv_.415
  movdqa	xmm0, xmm7	# vect_vec_iv_.414, vect_vec_iv_.414
  .p2align 4,,10
  .p2align 3
.L128:
  add	eax, 1	# ivtmp.421,
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:27:         sum3 += i;
  paddd	xmm4, xmm0	# vect__18.418, vect_vec_iv_.414
  paddd	xmm5, xmm1	# vect_sum3_47.417, vect_vec_iv_.415
  paddd	xmm6, xmm2	# vect__18.418, vect_vec_iv_.416
  paddd	xmm0, xmm3	# vect_vec_iv_.414, tmp157
  paddd	xmm1, xmm3	# vect_vec_iv_.415, tmp157
  paddd	xmm2, xmm3	# vect_vec_iv_.416, tmp157
  cmp	eax, 250000000	# ivtmp.421,
  jne	.L128	#,

For Add2:

# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:143: static void BM_Add2(benchmark::State &state) {
  xor	eax, eax	# ivtmp.439
  movdqa	xmm1, xmm5	# vect_sum2_39.435, vect_sum2_39.435
  movdqa	xmm0, xmm4	# vect_vec_iv_.434, vect_vec_iv_.434
  .p2align 4,,10
  .p2align 3
.L143:
  add	eax, 1	# ivtmp.439,
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:16:         sum2 += i;
  paddd	xmm1, xmm0	# vect_sum2_39.435, vect_vec_iv_.434
  paddd	xmm0, xmm2	# vect_vec_iv_.434, tmp111
  cmp	eax, 500000000	# ivtmp.439,
  jne	.L143	#,

For Add:

# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:135: static void BM_Add(benchmark::State &state) {
  xor	eax, eax	# ivtmp.457
  pxor	xmm1, xmm1	# vect_sum_31.453
  movdqa	xmm0, xmm4	# vect_vec_iv_.452, vect_vec_iv_.452
  .p2align 4,,10
  .p2align 3
.L158:
  movdqa	xmm2, xmm0	# vect_vec_iv_.452, vect_vec_iv_.452
  add	eax, 1	# ivtmp.457,
  paddd	xmm0, xmm3	# vect_vec_iv_.452, tmp104
# /home/nolan-veed/nolan-veed/benchmarks/test_superscalar.cpp:7:         sum += i;
  paddd	xmm1, xmm2	# vect_sum_31.453, vect_vec_iv_.452
  cmp	eax, 250000000	# ivtmp.457,
  jne	.L158	#,
  movdqa	xmm0, xmm1	# tmp98, vect_sum_31.453
  psrldq	xmm0, 8	# tmp98,
  paddd	xmm1, xmm0	# _11, tmp98

The iteration loops are similar to what Compiler Explorer shows.

Phew!

← Previous
Adds, muls on my superscalar processor
Next →
Cloudflare Pages to Workers migration