1

It has been a while since I started working with SSE/AVX intrinsic functions. I recently began writing a header for matrix transposition. I used a lot of if constexpr branches so that the compiler always selects the optimal instruction set depending on some template parameters. Now I wanted to check if everything works as expected by looking into the local disassembly with objdump. When using Clang, I get a clear output which basically contains only the assembly instructions corresponding to the utilized intrinsic functions. However, if I use GCC, the disassembly is quite bloated with extra instructions. A quick check on Godbolt shows me that those extra instructions in the GCC disassembly shouldn't be there.

Here is a small example:

#include <x86intrin.h> #include <array> std::array<__m256, 1> Test(std::array<__m256, 1> a) { std::array<__m256, 1> b; b[0] = _mm256_unpacklo_ps(a[0], a[0]); return b; } 

I compile with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z. Then I use objdump -S -Mintel libassembly.a > libassembly.dump on the object file. For Clang (6.0.0), the result is:

In archive libassembly.a: libAssembly.cpp.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>: 0: c4 e3 7d 04 c0 50 vpermilps ymm0,ymm0,0x50 6: c3 ret 

which is the same as Godbolt returns: Godbolt - Clang 6.0.0

For GCC (7.4) the output is

In archive libassembly.a: libAssembly.cpp.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>: 0: 4c 8d 54 24 08 lea r10,[rsp+0x8] 5: 48 83 e4 e0 and rsp,0xffffffffffffffe0 9: c5 fc 14 c0 vunpcklps ymm0,ymm0,ymm0 d: 41 ff 72 f8 push QWORD PTR [r10-0x8] 11: 55 push rbp 12: 48 89 e5 mov rbp,rsp 15: 41 52 push r10 17: 48 83 ec 28 sub rsp,0x28 1b: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28 22: 00 00 24: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax 28: 31 c0 xor eax,eax 2a: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18] 2e: 64 48 33 04 25 28 00 xor rax,QWORD PTR fs:0x28 35: 00 00 37: 75 0c jne 45 <_Z4TestSt5arrayIDv8_fLm1EE+0x45> 39: 48 83 c4 28 add rsp,0x28 3d: 41 5a pop r10 3f: 5d pop rbp 40: 49 8d 62 f8 lea rsp,[r10-0x8] 44: c3 ret 45: c5 f8 77 vzeroupper 48: e8 00 00 00 00 call 4d <_Z4TestSt5arrayIDv8_fLm1EE+0x4d> 

As you can see, there are a lot of additional instructions. In contrast to that, Godbolt does not include all these extra instructions: Godbolt - GCC 7.4

So what is going on here? I have just started learning assembly, so maybe it is totally clear to someone with assembly experience, but I am a little bit confused why GCC creates those extra instructions on my machine.

Greetings and thank you in advance.

EDIT

To avoid further confusions, I just compiled using:

gcc-7 -I/usr/local/include -O3 -march=native -Wall -Wextra -Wpedantic -pthread -std=gnu++1z -o test.o -c /<PathToFolder>/libAssembly.cpp

Output remains the same. I am not sure if this is relevant, but it generates the warning: warning: ignoring attributes on template argument ‘__m256 {aka __vector(8) float}’ [-Wignored-attributes]

Usually I surpress this warning and it shouldn't be an issue:

Implication of GCC warning: ignoring attributes on template argument (-Wignored-attributes)

Processor is Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

Here is the gcc -v:

gcc-7 -v Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1) 
11
  • I cannot reproduce it with my local GCC install. Commented Oct 5, 2019 at 12:59
  • why GCC - wait, which gcc? You posted a link to godbolt to gcc 7.4 that generates vunpcklps ymm0, ymm0, ymm0. So what is the output you are presenting? Is it for your machine? You use -march=native, does your local machine support SSE/AVX ? Commented Oct 5, 2019 at 13:05
  • In the "bad" example, are you sure that you compiled with optimization enabled? You mentioned -O3 in the question, but that looks to me like unoptimized output. Commented Oct 5, 2019 at 13:08
  • @JasonR: Pretty sure. I am using CMake and the verbose output tells me that the file is built with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z -o CMakeFiles/assembly.dir/libAssembly.cpp.o Commented Oct 5, 2019 at 13:24
  • @KamilCuk The output was generated on my machine with gcc 7.4 and my machine supports AVX2 instructions. Commented Oct 5, 2019 at 13:38

1 Answer 1

8

Use -fno-stack-protector


Your local GCC defaults to -fstack-protector-strong but Godbolt's GCC install doesn't.

mov rax,QWORD PTR fs:0x28 is the telltale clue; Thread-local storage at fs:40 aka fs:0x28 is where GCC keeps its stack cookie constant. The call after the ret is call __stack_chk_fail (but you disassembled a .o without using objdump -dr to show relocations, so the placeholder +0 offset just looked like still a target within this function).

Since you have arrays (or a class containing an array), stack-protector-strong kicks in even though their sizes are compile-time constants. So you get the code to store the stack cookie, then check it and branch on stack overflow. (Even the array of size 1 in this MVCE is enough to trigger that.)

Making arrays on the stack with 32-byte alignment (for __m256) requires 32-byte alignment, and your GCC is older than GCC8 so you get the ridiculously clunky stack-alignment code that builds a full copy of the stack frame including a return address. Generated assembly for extended alignment of stack variables (To be clear, GCC8 still does align the stack here, just wasting fewer instructions on it.)

This is pretty much a missed optimization; gcc never actually spills or reloads to those arrays so it could have just optimized them away, along with the stack alignment, like it did without stack-protector.

More recent GCC is better at optimizing away stack alignment after optimizing away the memory for aligned locals in more cases, but this has been a persistent missed optimization in AVX code. Fortunately the cost is pretty negligible in a function that loops; as long as small helper functions inline.


Compiling on Godbolt with -fstack-protector-strong reproduces your output. Newer GCC, including current trunk pre-10, still has both missed optimizations, but stack alignment costs fewer instructions because it just uses RBP as a frame pointer and aligns RSP, then references locals relative to aligned RSP. It still checks the stack cookie (with no instructions between storing it and checking it).

On your desktop, compiling with -fno-stack-protector should make good asm.

Sign up to request clarification or add additional context in comments.

9 Comments

Thank you very much. Its awesome how much knowledge some people have around here :) . Compiling with -fno-stack-protector indeed solves the problem. However, I made a distribution upgrade to ubuntu 19.04 with GCC 8.3. The problem still remains when I compile without the flag. Shouldn't it disappear with GCC 8 +? --- The array sizes of 1 are actually an edge-case of the function template I am using where I thought that the compiler would optimize it away anyways.
@wychmaster: oops, you're right. I didn't look carefully and missed that stack alignment was still there even with GCC9. The asm output is shorter but that's because GCC8 has much simpler stack alignment in functions without VLAs / alloca, not because stack alignment is optimized away entirely. Silly me, fixed my answer.
@wychmaster: After inlining it only affects the caller once, not once per invocation of an inline function. Or if the caller needed stack alignment already, then there's no extra cost. But yes it could optimize away in the x86-64 System V calling convention. A class containing one __m256[1] member is passed and returned in vector registers, unlike in Windows x64 vectorcall where being inside a class would force pass/return by reference. godbolt.org/z/piojOa (Which would still optimize away when inlining, but the standalone version would have that guaranteed overhead.)
Thanks again for your explanation. I need the clean assembly only to check if my matrix transpose functions work as expected and to see if all optimizations I expect are actually applied. Apart from that, as long as there is no significant impact on my benchmarks, I can live with it. I am curious how big the effect of the MSVC overhead is. I think I have to finally start testing my code in Windows too. However, I can always use MinGW as an alternative to MSVC :p
@wychmaster: MinGW GCC (still AFAIK) isn't usable with AVX because of a bug where it doesn't align the stack before spilling __m256 locals, or something like that. Use clang (it optimizes better than MSVC anyway, in general, and implements GNU C/C++ extensions). But anyway, both of them will still use the Windows x64 calling convention when targeting Windows. As long as your function inlines there's no actual pass/return overhead, though.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.