It has been a while since I started working with SSE/AVX intrinsic functions. I recently began writing a header for matrix transposition. I used a lot of if constexpr branches so that the compiler always selects the optimal instruction set depending on some template parameters. Now I wanted to check if everything works as expected by looking into the local disassembly with objdump. When using Clang, I get a clear output which basically contains only the assembly instructions corresponding to the utilized intrinsic functions. However, if I use GCC, the disassembly is quite bloated with extra instructions. A quick check on Godbolt shows me that those extra instructions in the GCC disassembly shouldn't be there.
Here is a small example:
#include <x86intrin.h> #include <array> std::array<__m256, 1> Test(std::array<__m256, 1> a) { std::array<__m256, 1> b; b[0] = _mm256_unpacklo_ps(a[0], a[0]); return b; } I compile with -march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z. Then I use objdump -S -Mintel libassembly.a > libassembly.dump on the object file. For Clang (6.0.0), the result is:
In archive libassembly.a: libAssembly.cpp.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>: 0: c4 e3 7d 04 c0 50 vpermilps ymm0,ymm0,0x50 6: c3 ret which is the same as Godbolt returns: Godbolt - Clang 6.0.0
For GCC (7.4) the output is
In archive libassembly.a: libAssembly.cpp.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <_Z4TestSt5arrayIDv8_fLm1EE>: 0: 4c 8d 54 24 08 lea r10,[rsp+0x8] 5: 48 83 e4 e0 and rsp,0xffffffffffffffe0 9: c5 fc 14 c0 vunpcklps ymm0,ymm0,ymm0 d: 41 ff 72 f8 push QWORD PTR [r10-0x8] 11: 55 push rbp 12: 48 89 e5 mov rbp,rsp 15: 41 52 push r10 17: 48 83 ec 28 sub rsp,0x28 1b: 64 48 8b 04 25 28 00 mov rax,QWORD PTR fs:0x28 22: 00 00 24: 48 89 45 e8 mov QWORD PTR [rbp-0x18],rax 28: 31 c0 xor eax,eax 2a: 48 8b 45 e8 mov rax,QWORD PTR [rbp-0x18] 2e: 64 48 33 04 25 28 00 xor rax,QWORD PTR fs:0x28 35: 00 00 37: 75 0c jne 45 <_Z4TestSt5arrayIDv8_fLm1EE+0x45> 39: 48 83 c4 28 add rsp,0x28 3d: 41 5a pop r10 3f: 5d pop rbp 40: 49 8d 62 f8 lea rsp,[r10-0x8] 44: c3 ret 45: c5 f8 77 vzeroupper 48: e8 00 00 00 00 call 4d <_Z4TestSt5arrayIDv8_fLm1EE+0x4d> As you can see, there are a lot of additional instructions. In contrast to that, Godbolt does not include all these extra instructions: Godbolt - GCC 7.4
So what is going on here? I have just started learning assembly, so maybe it is totally clear to someone with assembly experience, but I am a little bit confused why GCC creates those extra instructions on my machine.
Greetings and thank you in advance.
EDIT
To avoid further confusions, I just compiled using:
gcc-7 -I/usr/local/include -O3 -march=native -Wall -Wextra -Wpedantic -pthread -std=gnu++1z -o test.o -c /<PathToFolder>/libAssembly.cpp
Output remains the same. I am not sure if this is relevant, but it generates the warning: warning: ignoring attributes on template argument ‘__m256 {aka __vector(8) float}’ [-Wignored-attributes]
Usually I surpress this warning and it shouldn't be an issue:
Implication of GCC warning: ignoring attributes on template argument (-Wignored-attributes)
Processor is Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
Here is the gcc -v:
gcc-7 -v Using built-in specs. COLLECT_GCC=gcc-7 COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.4.0-1ubuntu1~18.04.1' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)
why GCC- wait, which gcc? You posted a link to godbolt to gcc 7.4 that generatesvunpcklps ymm0, ymm0, ymm0. So what is the output you are presenting? Is it for your machine? You use-march=native, does your local machine support SSE/AVX ?-O3in the question, but that looks to me like unoptimized output.-march=native -Wall -Wextra -Wpedantic -pthread -O3 -DNDEBUG -std=gnu++1z -o CMakeFiles/assembly.dir/libAssembly.cpp.o