GCC SSE code optimization

Question

This post is closely related to another one I posted some days ago. This time, I wrote a simple code that just adds a pair of arrays of elements, multiplies the result by the values in another array and stores it in a forth array, all variables floating point double precision typed.

I made two versions of that code: one with SSE instructions, using calls to and another one without them I then compiled them with gcc and -O0 optimization level. I write them below:

// SSE VERSION #define N 10000 #define NTIMES 100000 #include <time.h> #include <stdio.h> #include <xmmintrin.h> #include <pmmintrin.h> double a[N] __attribute__((aligned(16))); double b[N] __attribute__((aligned(16))); double c[N] __attribute__((aligned(16))); double r[N] __attribute__((aligned(16))); int main(void){ int i, times; for( times = 0; times < NTIMES; times++ ){ for( i = 0; i <N; i+= 2){ __m128d mm_a = _mm_load_pd( &a[i] ); _mm_prefetch( &a[i+4], _MM_HINT_T0 ); __m128d mm_b = _mm_load_pd( &b[i] ); _mm_prefetch( &b[i+4] , _MM_HINT_T0 ); __m128d mm_c = _mm_load_pd( &c[i] ); _mm_prefetch( &c[i+4] , _MM_HINT_T0 ); __m128d mm_r; mm_r = _mm_add_pd( mm_a, mm_b ); mm_a = _mm_mul_pd( mm_r , mm_c ); _mm_store_pd( &r[i], mm_a ); } } } //NO SSE VERSION //same definitions as before int main(void){ int i, times; for( times = 0; times < NTIMES; times++ ){ for( i = 0; i < N; i++ ){ r[i] = (a[i]+b[i])*c[i]; } } }

When compiling them with -O0, gcc makes use of XMM/MMX registers and SSE intstructions, if not specifically given the -mno-sse (and others) options. I inspected the assembly code generated for the second code and I noticed that it makes use of movsd, addsd and mulsd instructions. So it makes use of SSE instructions but only of those that use the lowest part of the registers, if I am not wrong. The assembly code generated for the first C code made use, as expected, of the addp and mulpd instructions, though a pretty larger assembly code was generated.

Anyway, the first code should get better profit, as far as I know, of SIMD paradigm, since every iteration two result values are computed. Still that, the second code performs something such as a 25 per cent faster than the first one. I also made a test with single precision values and get similar results. What's the reason for that?

Comparing performance when compiling without optimizations is pretty meaningless. — interjay
– interjay, Commented Oct 27, 2011 at 16:43
You're doing 3 x loads and 1 x store for just 2 x arithmetic operations, so you'll most likely be bandwidth-limited. — Paul R
– Paul R, Commented Oct 27, 2011 at 16:56
What happens when you remove the _mm_prefetch calls? I think they may be hurting you — TJD
– TJD, Commented Oct 27, 2011 at 16:59
Those prefetch calls do indeed look pretty useless. The access pattern in the inner loop is sequential. (so the hardware prefetcher will pick it up) Furthermore, you're only prefetching one iteration ahead, and you have almost as many prefetch instructions as "work" instructions... — Mysticial
– Mysticial, Commented Oct 27, 2011 at 17:18
You were right, when removing the prefetch calls performance improves a little (not much). I guess prefetch should apply only when there isn't any such a access pattern. When compiling with O3 the first code performs fairly better. — Genís
– Genís, Commented Oct 27, 2011 at 18:55

Eric O. Lebigot · Accepted Answer · 2014-01-23 14:13:57Z

Vectorization in GCC is enabled at -O3. That's why at -O0, you see only the ordinary scalar SSE2 instructions (movsd, addsd, etc). Using GCC 4.6.1 and your second example:

#define N 10000 #define NTIMES 100000 double a[N] __attribute__ ((aligned (16))); double b[N] __attribute__ ((aligned (16))); double c[N] __attribute__ ((aligned (16))); double r[N] __attribute__ ((aligned (16))); int main (void) { int i, times; for (times = 0; times < NTIMES; times++) { for (i = 0; i < N; ++i) r[i] = (a[i] + b[i]) * c[i]; } return 0; }

and compiling with gcc -S -O3 -msse2 sse.c produces for the inner loop the following instructions, which is pretty good:

.L3: movapd a(%eax), %xmm0 addpd b(%eax), %xmm0 mulpd c(%eax), %xmm0 movapd %xmm0, r(%eax) addl $16, %eax cmpl $80000, %eax jne .L3

As you can see, with the vectorization enabled GCC emits code to perform two loop iterations in parallel. It can be improved, though - this code uses the lower 128 bits of the SSE registers, but it can use the full the 256-bit YMM registers, by enabling the AVX encoding of SSE instructions (if available on the machine). So, compiling the same program with gcc -S -O3 -msse2 -mavx sse.c gives for the inner loop:

.L3: vmovapd a(%eax), %ymm0 vaddpd b(%eax), %ymm0, %ymm0 vmulpd c(%eax), %ymm0, %ymm0 vmovapd %ymm0, r(%eax) addl $32, %eax cmpl $80000, %eax jne .L3

Note that v in front of each instruction and that instructions use the 256-bit YMM registers, four iterations of the original loop are executed in parallel.

I just ran this through gcc 4.7.2 on x86-64 with and without the -msse2 flags - both resulted in the same assembler output. So would it be safe to sse instructions are enabled by default on this platform?
Note that gcc 4.6's AVX output isn't safe: vmovapd ymm will fault if the address isn't 32B-aligned, but the source only asks for 16B alignment. gcc 4.8 and later get it right and makes setup / cleanup loops to handle parts of the array that aren't 32B-aligned. With -mavx2, the inner loop uses 16B loads with movapd / vinsertf128 for two arrays, and a 32B aligned memory operand for the 3rd src., but with -march=haswell it does 32B unaligned loads/stores for all arrays (after getting one of them 32B aligned). This is from the -mtune=haswell settings.

Community · Accepted Answer · 2017-05-23 11:46:47Z

I would like to extend chill's answer and draw your attention on the fact that GCC seems not to be able to do the same smart use of the AVX instructions when iterating backwards.

Just replace the inner loop in chill's sample code with:

for (i = N-1; i >= 0; --i) r[i] = (a[i] + b[i]) * c[i];

GCC (4.8.4) with options -S -O3 -mavx produces:

.L5: vmovsd a+79992(%rax), %xmm0 subq $8, %rax vaddsd b+80000(%rax), %xmm0, %xmm0 vmulsd c+80000(%rax), %xmm0, %xmm0 vmovsd %xmm0, r+80000(%rax) cmpq $-80000, %rax jne .L5

Interesting. Newer gcc auto-vectorize that hilariously, using a vpermpd 0b00011011 for each array input/output to reverse it after loading, so the data elements within each vector go from first to last in source order. That's 4 vpermpds per iteration! Interestingly, clang auto-vectorizes it nicely

Collectives™ on Stack Overflow

GCC SSE code optimization

2 Answers 2

3 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Linked

Related