Wrong gcc generated assembly ordering, results in performance hit

Question

I have got the following code, which copies data from memory to DMA buffer:

for (; likely(l > 0); l-=128) { __m256i m0 = _mm256_load_si256( (__m256i*) (src) ); __m256i m1 = _mm256_load_si256( (__m256i*) (src+32) ); __m256i m2 = _mm256_load_si256( (__m256i*) (src+64) ); __m256i m3 = _mm256_load_si256( (__m256i*) (src+96) ); _mm256_stream_si256( (__m256i *) (dst), m0 ); _mm256_stream_si256( (__m256i *) (dst+32), m1 ); _mm256_stream_si256( (__m256i *) (dst+64), m2 ); _mm256_stream_si256( (__m256i *) (dst+96), m3 ); src += 128; dst += 128; }

That is how gcc assembly output looks like:

405280: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2 405285: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1 40528a: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0 40528f: c5 fd 6f 18 vmovdqa (%rax),%ymm3 405293: 48 83 e8 80 sub $0xffffffffffffff80,%rax 405297: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx) 40529c: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx) 4052a1: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx) 4052a6: c5 fd e7 1a vmovntdq %ymm3,(%rdx) 4052aa: 48 83 ea 80 sub $0xffffffffffffff80,%rdx 4052ae: 48 39 c8 cmp %rcx,%rax 4052b1: 75 cd jne 405280 <sender_body+0x6e0>

Note the reordering of last vmovdqa and vmovntdq instructions. With the gcc generated code above I am able to reach throughput of ~10 227 571 packets per second in my application.

Next, I reorder that instructions manually in hexeditor. That means now the loop looks the following way:

405280: c5 fd 6f 18 vmovdqa (%rax),%ymm3 405284: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2 405289: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1 40528e: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0 405293: 48 83 e8 80 sub $0xffffffffffffff80,%rax 405297: c5 fd e7 1a vmovntdq %ymm3,(%rdx) 40529b: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx) 4052a0: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx) 4052a5: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx) 4052aa: 48 83 ea 80 sub $0xffffffffffffff80,%rdx 4052ae: 48 39 c8 cmp %rcx,%rax 4052b1: 75 cd jne 405280 <sender_body+0x6e0>

With the properly ordered instructions I get ~13 668 313 packets per second. So it is obvious that reordering introduced by gcc reduces performance.

Have you come across that? Is this a known bug or should I fill a bug report?

Compilation flags:

-O3 -pipe -g -msse4.1 -mavx

My gcc version:

gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)

Not directly related to your question, but can src and dest overlap? If not, using the restrict keyword on both would probably allow the compiler to generate code that's more efficient than either version... — R.. GitHub STOP HELPING ICE
– R.. GitHub STOP HELPING ICE, Commented Sep 11, 2014 at 3:21
Good point, however restrict keyword would not change anything in case of simple one-to-one copying like that. — Piotr Jurkiewicz
– Piotr Jurkiewicz, Commented Sep 11, 2014 at 4:09
It doesn't seem like a bug to me, unless this causes the actual behaviour of the program to differ... Just a guess, but have you considered using volatile __m256i? — autistic
– autistic, Commented Dec 29, 2015 at 14:26
@Seb: Performance bugs are one class of compiler bugs. reported as gcc.gnu.org/bugzilla/show_bug.cgi?id=69622 — Peter Cordes
– Peter Cordes, Commented Feb 2, 2016 at 12:17

Community · Accepted Answer · 2017-05-23 11:58:57Z

I find this problem interesting. GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. In this particular case, I looked at three "tools" I use for such situations:

volatile: If it is important the memory accesses occur in specific order, then volatile is a suitable tool. Note that it can be overkill, and will lead to a separate load every time a volatile pointer is dereferenced.

SSE/AVX load/store intrinsics can't be used with volatile pointers, because they are functions. Using something like _mm256_load_si256((volatile __m256i *)src); implicitly casts it to const __m256i*, losing the volatile qualifier.

We can directly dereference volatile pointers, though. (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.)
```
m0 = ((volatile __m256i *)src)[0]; m1 = ((volatile __m256i *)src)[1]; m2 = ((volatile __m256i *)src)[2]; m3 = ((volatile __m256i *)src)[3]; 
```
Unfortunately this doesn't help with the stores, because we want to emit streaming stores. A *(volatile...)dst = tmp; won't give us what we want.
__asm__ __volatile__ (""); as a compiler reordering barrier.

This is the GNU C was of writing a compiler memory-barrier. (Stopping compile-time reordering without emitting an actual barrier instruction like mfence). It stops the compiler from reordering memory accesses across this statement.
Using an index limit for loop structures.

GCC is known for pretty poor register usage. Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results.

Combining all the above, I constructed the following function (after a few iterations):

#include <stdlib.h> #include <immintrin.h> #define likely(x) __builtin_expect((x), 1) #define unlikely(x) __builtin_expect((x), 0) void copy(void *const destination, const void *const source, const size_t bytes) { __m256i *dst = (__m256i *)destination; const __m256i *src = (const __m256i *)source; const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i); while (likely(src < end)) { const __m256i m0 = ((volatile const __m256i *)src)[0]; const __m256i m1 = ((volatile const __m256i *)src)[1]; const __m256i m2 = ((volatile const __m256i *)src)[2]; const __m256i m3 = ((volatile const __m256i *)src)[3]; _mm256_stream_si256( dst, m0 ); _mm256_stream_si256( dst + 1, m1 ); _mm256_stream_si256( dst + 2, m2 ); _mm256_stream_si256( dst + 3, m3 ); __asm__ __volatile__ (""); src += 4; dst += 4; } }

Compiling it (example.c) using GCC-4.8.4 using

gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c

yields (example.s):

 .file "example.c" .text .p2align 4,,15 .globl copy .type copy, @function copy: .LFB993: .cfi_startproc andq $-32, %rdx leaq (%rsi,%rdx), %rcx cmpq %rcx, %rsi jnb .L5 movq %rsi, %rax movq %rdi, %rdx .p2align 4,,10 .p2align 3 .L4: vmovdqa (%rax), %ymm3 vmovdqa 32(%rax), %ymm2 vmovdqa 64(%rax), %ymm1 vmovdqa 96(%rax), %ymm0 vmovntdq %ymm3, (%rdx) vmovntdq %ymm2, 32(%rdx) vmovntdq %ymm1, 64(%rdx) vmovntdq %ymm0, 96(%rdx) subq $-128, %rax subq $-128, %rdx cmpq %rax, %rcx ja .L4 vzeroupper .L5: ret .cfi_endproc .LFE993: .size copy, .-copy .ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4" .section .note.GNU-stack,"",@progbits

The disassembly of the actual compiled (-c instead of -S) code is

0000000000000000 <copy>: 0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx 4: 48 8d 0c 16 lea (%rsi,%rdx,1),%rcx 8: 48 39 ce cmp %rcx,%rsi b: 73 41 jae 4e <copy+0x4e> d: 48 89 f0 mov %rsi,%rax 10: 48 89 fa mov %rdi,%rdx 13: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 18: c5 fd 6f 18 vmovdqa (%rax),%ymm3 1c: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2 21: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1 26: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0 2b: c5 fd e7 1a vmovntdq %ymm3,(%rdx) 2f: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx) 34: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx) 39: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx) 3e: 48 83 e8 80 sub $0xffffffffffffff80,%rax 42: 48 83 ea 80 sub $0xffffffffffffff80,%rdx 46: 48 39 c1 cmp %rax,%rcx 49: 77 cd ja 18 <copy+0x18> 4b: c5 f8 77 vzeroupper 4e: c3 retq

Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. (The above uses -O2, which is generally the optimization level I use.)

If optimizing for size (-Os), the code looks excellent at first glance,

0000000000000000 <copy>: 0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx 4: 48 01 f2 add %rsi,%rdx 7: 48 39 d6 cmp %rdx,%rsi a: 73 30 jae 3c <copy+0x3c> c: c5 fd 6f 1e vmovdqa (%rsi),%ymm3 10: c5 fd 6f 56 20 vmovdqa 0x20(%rsi),%ymm2 15: c5 fd 6f 4e 40 vmovdqa 0x40(%rsi),%ymm1 1a: c5 fd 6f 46 60 vmovdqa 0x60(%rsi),%ymm0 1f: c5 fd e7 1f vmovntdq %ymm3,(%rdi) 23: c5 fd e7 57 20 vmovntdq %ymm2,0x20(%rdi) 28: c5 fd e7 4f 40 vmovntdq %ymm1,0x40(%rdi) 2d: c5 fd e7 47 60 vmovntdq %ymm0,0x60(%rdi) 32: 48 83 ee 80 sub $0xffffffffffffff80,%rsi 36: 48 83 ef 80 sub $0xffffffffffffff80,%rdi 3a: eb cb jmp 7 <copy+0x7> 3c: c3 retq

until you notice that the last jmp is to the comparison, essentially doing a jmp, cmp, and a jae at every iteration, which probably yields pretty poor results.

Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ ("");), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any.

Looking at Peter Cordes' excellent answer, I decided to iterate the function a bit further, just for fun.

As Ross Ridge mentions in the comments, when using _mm256_load_si256() the pointer is not dereferenced (prior to being re-cast to aligned __m256i * as a parameter to the function), thus volatile won't help when using _mm256_load_si256(). In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }), which supplies the function with a pointer to src by accessing the element via a volatile pointer and casting it to an array. For a simple aligned load, I prefer the direct volatile pointer; it matches my intent in the code. (I do aim for KISS, although often I hit only the stupid part of it.)

On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general.

GCC provides two options for this. One is the __builtin_assume_aligned() built-in, which allows a programmer to convey all sorts of alignment information to the compiler. The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32))), which can be used to convey the alignedness of function parameters for example. Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned()).

One way to mitigate the register shuffling GCC does, is to use a helper function. After some further iterations, I arrived at this, another.c:

#include <stdlib.h> #include <immintrin.h> #define likely(x) __builtin_expect((x), 1) #define unlikely(x) __builtin_expect((x), 0) #if (__clang_major__+0 >= 3) #define IS_ALIGNED(x, n) ((void *)(x)) #elif (__GNUC__+0 >= 4) #define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n)) #else #define IS_ALIGNED(x, n) ((void *)(x)) #endif typedef __m256i __m256i_aligned __attribute__((aligned (32))); void do_copy(register __m256i_aligned *dst, register volatile __m256i_aligned *src, register __m256i_aligned *end) { do { register const __m256i m0 = src[0]; register const __m256i m1 = src[1]; register const __m256i m2 = src[2]; register const __m256i m3 = src[3]; __asm__ __volatile__ (""); _mm256_stream_si256( dst, m0 ); _mm256_stream_si256( dst + 1, m1 ); _mm256_stream_si256( dst + 2, m2 ); _mm256_stream_si256( dst + 3, m3 ); __asm__ __volatile__ (""); src += 4; dst += 4; } while (likely(src < end)); } void copy(void *dst, const void *src, const size_t bytes) { if (bytes < 128) return; do_copy(IS_ALIGNED(dst, 32), IS_ALIGNED(src, 32), IS_ALIGNED((void *)((char *)src + bytes), 32)); }

which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c to essentially (comments and directives omitted for brevity):

do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 vzeroupper ret copy: cmpq $127, %rdx ja .L8 rep ret .L8: addq %rsi, %rdx jmp do_copy

Further optimization at -O3 just inlines the helper function,

do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 vzeroupper ret copy: cmpq $127, %rdx ja .L10 rep ret .L10: leaq (%rsi,%rdx), %rax .L8: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rsi, %rax ja .L8 vzeroupper ret

and even with -Os the generated code is very nice,

do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 ret copy: cmpq $127, %rdx jbe .L5 addq %rsi, %rdx jmp do_copy .L5: ret

Of course, without optimizations GCC-4.8.4 still produces pretty bad code. With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2 and -Os we get essentially

do_copy: .LBB0_1: vmovaps (%rsi), %ymm0 vmovaps 32(%rsi), %ymm1 vmovaps 64(%rsi), %ymm2 vmovaps 96(%rsi), %ymm3 vmovntps %ymm0, (%rdi) vmovntps %ymm1, 32(%rdi) vmovntps %ymm2, 64(%rdi) vmovntps %ymm3, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .LBB0_1 vzeroupper retq copy: cmpq $128, %rdx jb .LBB1_3 addq %rsi, %rdx .LBB1_2: vmovaps (%rsi), %ymm0 vmovaps 32(%rsi), %ymm1 vmovaps 64(%rsi), %ymm2 vmovaps 96(%rsi), %ymm3 vmovntps %ymm0, (%rdi) vmovntps %ymm1, 32(%rdi) vmovntps %ymm2, 64(%rdi) vmovntps %ymm3, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .LBB1_2 .LBB1_3: vzeroupper retq

I like the another.c code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1, -O2, -O3, and -Os on both, so I think it is good enough for me. (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.)

The volatile qualifier probably doesn't work because the qualifier is lost when passed as argument to _mm256_load_si256.
There's no actual dereference of the volatile pointer, so there's no actual access of a volatile object. Your volatile cast has no effect because the volatile qualifier is immediately lost as the pointer is converted to __m256i const * when passed as an argument to _mm256_load_si256.
@NominalAnimal Do you understand that your first example doesn't use a volatile access? Hence the reason it doesn't work, and hence my suggestion, which really does use a volatile access...
@Seb: I read that as constructing an array with a single member, the value of which is obtained via a volatile access, and passing that as a parameter to the function. AIUI, the risk there is compiler literally constructing the array (thus, superfluous copies).
@PeterCordes: Thanks for the edit; I think it works much better than the original. My intent here (and in all my other answers) is not to just state an answer, but show how and why I arrived at the answer. The reason I hoped volatile would work, is C11 says in 6.7.3p7 that what constitutes access to an object that has volatile-qualified type is implementation-defined, so GCC could tag the AST generated as an access to a volatile object, even if the function prototype did not use such qualifier. On the other hand, 6.7.3p6 says access to a volatile object through non-volatile lvalue is UB..

Community · Accepted Answer · 2017-05-23 12:14:44Z

First of all, normal people use gcc -O3 -march=native -S and then edit the .s to test small modifications to compiler output. I hope you had fun hex-editing that change. :P You could also use Agner Fog's excellent objconv to make disassembly that can be assembled back into a binary with your choice of NASM, YASM, MASM, or AT&T syntax.

Using some of the same ideas as Nominal Animal, I made a version that compiles to similarly good asm. I'm confident about why it compiles to good code though, and I have a guess about why the ordering matters so much:

CPUs only have a few (~10?) write-combining fill buffers for NT loads / stores.

See this article about copying from video memory with streaming loads, and writing to main memory with streaming stores. It's actually faster to bounce the data through a small buffer (much smaller than L1), to avoid having the streaming loads and streaming stores compete for fill buffers (esp. with out-of-order execution). Note that using "streaming" NT loads from normal memory is not useful. As I understand it, streaming loads are only useful for I/O (including stuff like video RAM, which is mapped into the CPU's address space in an Uncacheable Software-Write-Combining (USWC) region). Main-memory RAM is mapped WB (Writeback), so the CPU is allowed to speculatively pre-fetch it and cache it, unlike USWC. Anyway, so even though I'm linking an article about using streaming loads, I'm not suggesting using streaming loads. It's just to illustrate that contention for fill buffers is almost certainly the reason that gcc's weird code causes a big problem, where it wouldn't with normal non-NT stores.

Also see John McAlpin's comment at the end of this thread, as another source confirming that WC stores to multiple cache lines at once can be a big slowdown.

gcc's output for your original code (for some braindead reason I can't imagine) stored the 2nd half of the first cacheline, then both halves of the second cacheline, then the 1st half of the first cacheline. Probably sometimes the write-combining buffer for the 1st cacheline was getting flushed before both halves were written, resulting in less efficient use of external buses.

clang doesn't do any weird re-ordering with any of our 3 versions (mine, OP's, and Nominal Animal's).

Anyway, using compiler-only barriers that stop compiler reordering but don't emit a barrier instruction is one way to stop it. In this case, it's a way of hitting the compiler over the head and saying "stupid compiler, don't do that". I don't think you should normally need to do this everywhere, but clearly you can't trust gcc with write-combining stores (where ordering really matters). So it's probably a good idea to look at the asm at least with the compiler you're developing with when using NT loads and/or stores. I've reported this for gcc. Richard Biener points out that -fno-schedule-insns2 is a sort-of workaround.

Linux (the kernel) already has a barrier() macro that acts as a compiler memory barrier. It's almost certainly just a GNU asm volatile(""). Outside of Linux, you can keep using that GNU extension, or you can use the C11 stdatomic.h facilities. They're basically the same as the C++11 std::atomic facilities, with AFAIK identical semantics (thank goodness).

I put a barrier between every store, because they're free when there's no useful reordering possible anyway. It turns out just one barrier inside the loop keeps everything nicely in order, which is what Nominal Animal's answer is doing. It doesn't actually disallow the compiler from reordering stores that don't have a barrier separating them; the compiler just chose not to. This is why I barriered between every store.

I only asked the compiler for a write-barrier, because I expect only the ordering of the NT stores matters, not the loads. Even alternating load and store instructions probably wouldn't matter, since OOO execution pipelines everything anyway. (Note that the Intel copy-from-video-mem article even used mfence to avoid overlap between doing streaming stores and streaming loads.)

atomic_signal_fence doesn't directly document what all the different memory ordering options do with it. The C++ page for atomic_thread_fence is the one place on cppreference where there are examples and more on this.

This is the reason I didn't use Nominal Animal's idea of declaring src as pointer-to-volatile. gcc decides to keep the loads in the same order as stores.

Given that, unrolling only by 2 probably won't make any throughput difference in microbenchmarks, and will save uop cache space in production. Each iteration would still do a full cache line, which seems good.

SnB-family CPUs can't micro-fuse 2-reg addressing modes, so the obvious way to minimize loop overhead (get pointers to the end of src and dst, and then count a negative index up towards zero) doesn't work. The stores wouldn't micro-fuse. You'd very quickly fill up the fill-buffers to the point where the extra uops don't matter anyway, though. That loop probably runs nowhere near 4 uops per cycle.

Still, there is a way to reduce loop overhead: with my ridiculously ugly-and-unreadable-in-C hack to get the compiler to only do one sub (and a cmp/jcc) as loop overhead, no unrolling at all would make a 4-uop loop that should issue at one iteration per clock even on SnB. (Note that vmovntdq is AVX2, while vmovntps is only AVX1. Clang already uses vmovaps / vmovntps for the si256 intrinsics in this code! They have the same alignment requirement, and don't care what bits they store. It doesn't save any insn bytes, only compatibility.)

See the first paragraph for a godbolt link to this.

I guessed you were doing this inside the Linux kernel, so I put in appropriate #ifdefs so this should be correct as kernel code or when compiled for user-space.

#include <stdint.h> #include <immintrin.h> #ifdef __KERNEL__ // linux has it's own macro //#define compiler_writebarrier() __asm__ __volatile__ ("") #define compiler_writebarrier() barrier() #else // Use C11 instead of a GNU extension, for portability to other compilers #include <stdatomic.h> // unlike a single store-release, a release barrier is a StoreStore barrier. // It stops all earlier writes from being delayed past all following stores // Note that this is still only a compiler barrier, so no SFENCE is emitted, // even though we're using NT stores. So from another core's perpsective, our // stores can become globally out of order. #define compiler_writebarrier() atomic_signal_fence(memory_order_release) // this purposely *doesn't* stop load reordering. // In this case gcc loads in the same order it stores, regardless. load ordering prob. makes much less difference #endif void copy_pjc(void *const destination, const void *const source, const size_t bytes) { __m256i *dst = destination; const __m256i *src = source; const __m256i *dst_endp = (destination + bytes); // clang 3.7 goes berserk with intro code with this end condition // but with gcc it saves an AND compared to Nominal's bytes/32: // const __m256i *dst_endp = dst + bytes/sizeof(*dst); // force the compiler to mask to a round number #ifdef __KERNEL__ kernel_fpu_begin(); // or preferably higher in the call tree, so lots of calls are inside one pair #endif // bludgeon the compiler into generating loads with two-register addressing modes like [rdi+reg], and stores to [rdi] // saves one sub instruction in the loop. //#define ADDRESSING_MODE_HACK //intptr_t src_offset_from_dst = (src - dst); // generates clunky intro code because gcc can't assume void pointers differ by a multiple of 32 while (dst < dst_endp) { #ifdef ADDRESSING_MODE_HACK __m256i m0 = _mm256_load_si256( (dst + src_offset_from_dst) + 0 ); __m256i m1 = _mm256_load_si256( (dst + src_offset_from_dst) + 1 ); __m256i m2 = _mm256_load_si256( (dst + src_offset_from_dst) + 2 ); __m256i m3 = _mm256_load_si256( (dst + src_offset_from_dst) + 3 ); #else __m256i m0 = _mm256_load_si256( src + 0 ); __m256i m1 = _mm256_load_si256( src + 1 ); __m256i m2 = _mm256_load_si256( src + 2 ); __m256i m3 = _mm256_load_si256( src + 3 ); #endif _mm256_stream_si256( dst+0, m0 ); compiler_writebarrier(); // even one barrier is enough to stop gcc 5.3 reordering anything _mm256_stream_si256( dst+1, m1 ); compiler_writebarrier(); // but they're completely free because we are sure this store ordering is already optimal _mm256_stream_si256( dst+2, m2 ); compiler_writebarrier(); _mm256_stream_si256( dst+3, m3 ); compiler_writebarrier(); src += 4; dst += 4; } #ifdef __KERNEL__ kernel_fpu_end(); #endif }

It compiles to (gcc 5.3.0 -O3 -march=haswell):

copy_pjc: # one insn shorter than Nominal Animal's: doesn't mask the count to a multiple of 32. add rdx, rdi # dst_endp, destination cmp rdi, rdx # dst, dst_endp jnb .L7 #, .L5: vmovdqa ymm3, YMMWORD PTR [rsi] # MEM[base: src_30, offset: 0B], MEM[base: src_30, offset: 0B] vmovdqa ymm2, YMMWORD PTR [rsi+32] # D.26928, MEM[base: src_30, offset: 32B] vmovdqa ymm1, YMMWORD PTR [rsi+64] # D.26928, MEM[base: src_30, offset: 64B] vmovdqa ymm0, YMMWORD PTR [rsi+96] # D.26928, MEM[base: src_30, offset: 96B] vmovntdq YMMWORD PTR [rdi], ymm3 #* dst, MEM[base: src_30, offset: 0B] vmovntdq YMMWORD PTR [rdi+32], ymm2 #, D.26928 vmovntdq YMMWORD PTR [rdi+64], ymm1 #, D.26928 vmovntdq YMMWORD PTR [rdi+96], ymm0 #, D.26928 sub rdi, -128 # dst, sub rsi, -128 # src, cmp rdx, rdi # dst_endp, dst ja .L5 #, vzeroupper .L7:

Clang makes a very similar loop, but the intro is much longer: clang doesn't assume that src and dest are actually both aligned. Maybe it doesn't take advantage of the knowledge that the loads and stores will fault if not 32B-aligned? (It knows it can use ...aps instructions instead of ...dqa, so it certainly does more compiler-style optimization of intrinsics that gcc (where they more often always turn into the relevant instruction). clang can turn a pair of left/right vector shifts into a mask from a constant, for example.)

Very interesting! You delve much deeper into the CPU architecture than I do, and your comments are excellent. This spurred me into installing clang-3.5 just to see if I could rewrite my function (but still keeping to "my style", so to speak) but get good code out of both GCC and clang at all optimization levels (except -O0, at which GCC is hopeless). I can see I still have lots to learn about the C11 memory model and atomics. Thanks!

Collectives™ on Stack Overflow

Wrong gcc generated assembly ordering, results in performance hit

2 Answers 2

45 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

45 Comments

1 Comment

Linked

Related