I find this problem interesting. GCC is known for producing less than optimal code, but I find it fascinating to find ways to "encourage" it to produce better code (for hottest/bottleneck code only, of course), without micro-managing too heavily. In this particular case, I looked at three "tools" I use for such situations:
volatile: If it is important the memory accesses occur in specific order, then volatile is a suitable tool. Note that it can be overkill, and will lead to a separate load every time a volatile pointer is dereferenced.
SSE/AVX load/store intrinsics can't be used with volatile pointers, because they are functions. Using something like _mm256_load_si256((volatile __m256i *)src); implicitly casts it to const __m256i*, losing the volatile qualifier.
We can directly dereference volatile pointers, though. (load/store intrinsics are only needed when we need to tell the compiler that the data might be unaligned, or that we want a streaming store.)
m0 = ((volatile __m256i *)src)[0]; m1 = ((volatile __m256i *)src)[1]; m2 = ((volatile __m256i *)src)[2]; m3 = ((volatile __m256i *)src)[3];
Unfortunately this doesn't help with the stores, because we want to emit streaming stores. A *(volatile...)dst = tmp; won't give us what we want.
__asm__ __volatile__ (""); as a compiler reordering barrier.
This is the GNU C was of writing a compiler memory-barrier. (Stopping compile-time reordering without emitting an actual barrier instruction like mfence). It stops the compiler from reordering memory accesses across this statement.
Using an index limit for loop structures.
GCC is known for pretty poor register usage. Earlier versions made a lot of unnecessary moves between registers, although that is pretty minimal nowadays. However, testing on x86-64 across many versions of GCC indicate that in loops, it is better to use an index limit, rather than a independent loop variable, for best results.
Combining all the above, I constructed the following function (after a few iterations):
#include <stdlib.h> #include <immintrin.h> #define likely(x) __builtin_expect((x), 1) #define unlikely(x) __builtin_expect((x), 0) void copy(void *const destination, const void *const source, const size_t bytes) { __m256i *dst = (__m256i *)destination; const __m256i *src = (const __m256i *)source; const __m256i *end = (const __m256i *)source + bytes / sizeof (__m256i); while (likely(src < end)) { const __m256i m0 = ((volatile const __m256i *)src)[0]; const __m256i m1 = ((volatile const __m256i *)src)[1]; const __m256i m2 = ((volatile const __m256i *)src)[2]; const __m256i m3 = ((volatile const __m256i *)src)[3]; _mm256_stream_si256( dst, m0 ); _mm256_stream_si256( dst + 1, m1 ); _mm256_stream_si256( dst + 2, m2 ); _mm256_stream_si256( dst + 3, m3 ); __asm__ __volatile__ (""); src += 4; dst += 4; } }
Compiling it (example.c) using GCC-4.8.4 using
gcc -std=c99 -mavx2 -march=x86-64 -mtune=generic -O2 -S example.c
yields (example.s):
.file "example.c" .text .p2align 4,,15 .globl copy .type copy, @function copy: .LFB993: .cfi_startproc andq $-32, %rdx leaq (%rsi,%rdx), %rcx cmpq %rcx, %rsi jnb .L5 movq %rsi, %rax movq %rdi, %rdx .p2align 4,,10 .p2align 3 .L4: vmovdqa (%rax), %ymm3 vmovdqa 32(%rax), %ymm2 vmovdqa 64(%rax), %ymm1 vmovdqa 96(%rax), %ymm0 vmovntdq %ymm3, (%rdx) vmovntdq %ymm2, 32(%rdx) vmovntdq %ymm1, 64(%rdx) vmovntdq %ymm0, 96(%rdx) subq $-128, %rax subq $-128, %rdx cmpq %rax, %rcx ja .L4 vzeroupper .L5: ret .cfi_endproc .LFE993: .size copy, .-copy .ident "GCC: (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4" .section .note.GNU-stack,"",@progbits
The disassembly of the actual compiled (-c instead of -S) code is
0000000000000000 <copy>: 0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx 4: 48 8d 0c 16 lea (%rsi,%rdx,1),%rcx 8: 48 39 ce cmp %rcx,%rsi b: 73 41 jae 4e <copy+0x4e> d: 48 89 f0 mov %rsi,%rax 10: 48 89 fa mov %rdi,%rdx 13: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) 18: c5 fd 6f 18 vmovdqa (%rax),%ymm3 1c: c5 fd 6f 50 20 vmovdqa 0x20(%rax),%ymm2 21: c5 fd 6f 48 40 vmovdqa 0x40(%rax),%ymm1 26: c5 fd 6f 40 60 vmovdqa 0x60(%rax),%ymm0 2b: c5 fd e7 1a vmovntdq %ymm3,(%rdx) 2f: c5 fd e7 52 20 vmovntdq %ymm2,0x20(%rdx) 34: c5 fd e7 4a 40 vmovntdq %ymm1,0x40(%rdx) 39: c5 fd e7 42 60 vmovntdq %ymm0,0x60(%rdx) 3e: 48 83 e8 80 sub $0xffffffffffffff80,%rax 42: 48 83 ea 80 sub $0xffffffffffffff80,%rdx 46: 48 39 c1 cmp %rax,%rcx 49: 77 cd ja 18 <copy+0x18> 4b: c5 f8 77 vzeroupper 4e: c3 retq
Without any optimizations at all, the code is completely disgusting, full of unnecessary moves, so some optimization is necessary. (The above uses -O2, which is generally the optimization level I use.)
If optimizing for size (-Os), the code looks excellent at first glance,
0000000000000000 <copy>: 0: 48 83 e2 e0 and $0xffffffffffffffe0,%rdx 4: 48 01 f2 add %rsi,%rdx 7: 48 39 d6 cmp %rdx,%rsi a: 73 30 jae 3c <copy+0x3c> c: c5 fd 6f 1e vmovdqa (%rsi),%ymm3 10: c5 fd 6f 56 20 vmovdqa 0x20(%rsi),%ymm2 15: c5 fd 6f 4e 40 vmovdqa 0x40(%rsi),%ymm1 1a: c5 fd 6f 46 60 vmovdqa 0x60(%rsi),%ymm0 1f: c5 fd e7 1f vmovntdq %ymm3,(%rdi) 23: c5 fd e7 57 20 vmovntdq %ymm2,0x20(%rdi) 28: c5 fd e7 4f 40 vmovntdq %ymm1,0x40(%rdi) 2d: c5 fd e7 47 60 vmovntdq %ymm0,0x60(%rdi) 32: 48 83 ee 80 sub $0xffffffffffffff80,%rsi 36: 48 83 ef 80 sub $0xffffffffffffff80,%rdi 3a: eb cb jmp 7 <copy+0x7> 3c: c3 retq
until you notice that the last jmp is to the comparison, essentially doing a jmp, cmp, and a jae at every iteration, which probably yields pretty poor results.
Note: If you do something similar for real-world code, please do add comments (especially for the __asm__ __volatile__ ("");), and remember to periodically check with all compilers available, to make sure the code is not compiled too badly by any.
Looking at Peter Cordes' excellent answer, I decided to iterate the function a bit further, just for fun.
As Ross Ridge mentions in the comments, when using _mm256_load_si256() the pointer is not dereferenced (prior to being re-cast to aligned __m256i * as a parameter to the function), thus volatile won't help when using _mm256_load_si256(). In another comment, Seb suggests a workaround: _mm256_load_si256((__m256i []){ *(volatile __m256i *)(src) }), which supplies the function with a pointer to src by accessing the element via a volatile pointer and casting it to an array. For a simple aligned load, I prefer the direct volatile pointer; it matches my intent in the code. (I do aim for KISS, although often I hit only the stupid part of it.)
On x86-64, the start of the inner loop is aligned to 16 bytes, so the number of operations in the function "header" part is not really important. Still, avoiding the superfluous binary AND (masking the five least significant bits of the amount to copy in bytes) is certainly useful in general.
GCC provides two options for this. One is the __builtin_assume_aligned() built-in, which allows a programmer to convey all sorts of alignment information to the compiler. The other is typedef'ing a type that has extra attributes, here __attribute__((aligned (32))), which can be used to convey the alignedness of function parameters for example. Both of these should be available in clang (although support is recent, not in 3.5 yet), and may be available in others such as icc (although ICC, AFAIK, uses __assume_aligned()).
One way to mitigate the register shuffling GCC does, is to use a helper function. After some further iterations, I arrived at this, another.c:
#include <stdlib.h> #include <immintrin.h> #define likely(x) __builtin_expect((x), 1) #define unlikely(x) __builtin_expect((x), 0) #if (__clang_major__+0 >= 3) #define IS_ALIGNED(x, n) ((void *)(x)) #elif (__GNUC__+0 >= 4) #define IS_ALIGNED(x, n) __builtin_assume_aligned((x), (n)) #else #define IS_ALIGNED(x, n) ((void *)(x)) #endif typedef __m256i __m256i_aligned __attribute__((aligned (32))); void do_copy(register __m256i_aligned *dst, register volatile __m256i_aligned *src, register __m256i_aligned *end) { do { register const __m256i m0 = src[0]; register const __m256i m1 = src[1]; register const __m256i m2 = src[2]; register const __m256i m3 = src[3]; __asm__ __volatile__ (""); _mm256_stream_si256( dst, m0 ); _mm256_stream_si256( dst + 1, m1 ); _mm256_stream_si256( dst + 2, m2 ); _mm256_stream_si256( dst + 3, m3 ); __asm__ __volatile__ (""); src += 4; dst += 4; } while (likely(src < end)); } void copy(void *dst, const void *src, const size_t bytes) { if (bytes < 128) return; do_copy(IS_ALIGNED(dst, 32), IS_ALIGNED(src, 32), IS_ALIGNED((void *)((char *)src + bytes), 32)); }
which compiles with gcc -march=x86-64 -mtune=generic -mavx2 -O2 -S another.c to essentially (comments and directives omitted for brevity):
do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 vzeroupper ret copy: cmpq $127, %rdx ja .L8 rep ret .L8: addq %rsi, %rdx jmp do_copy
Further optimization at -O3 just inlines the helper function,
do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 vzeroupper ret copy: cmpq $127, %rdx ja .L10 rep ret .L10: leaq (%rsi,%rdx), %rax .L8: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rsi, %rax ja .L8 vzeroupper ret
and even with -Os the generated code is very nice,
do_copy: .L3: vmovdqa (%rsi), %ymm3 vmovdqa 32(%rsi), %ymm2 vmovdqa 64(%rsi), %ymm1 vmovdqa 96(%rsi), %ymm0 vmovntdq %ymm3, (%rdi) vmovntdq %ymm2, 32(%rdi) vmovntdq %ymm1, 64(%rdi) vmovntdq %ymm0, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .L3 ret copy: cmpq $127, %rdx jbe .L5 addq %rsi, %rdx jmp do_copy .L5: ret
Of course, without optimizations GCC-4.8.4 still produces pretty bad code. With clang-3.5 -march=x86-64 -mtune=generic -mavx2 -O2 and -Os we get essentially
do_copy: .LBB0_1: vmovaps (%rsi), %ymm0 vmovaps 32(%rsi), %ymm1 vmovaps 64(%rsi), %ymm2 vmovaps 96(%rsi), %ymm3 vmovntps %ymm0, (%rdi) vmovntps %ymm1, 32(%rdi) vmovntps %ymm2, 64(%rdi) vmovntps %ymm3, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .LBB0_1 vzeroupper retq copy: cmpq $128, %rdx jb .LBB1_3 addq %rsi, %rdx .LBB1_2: vmovaps (%rsi), %ymm0 vmovaps 32(%rsi), %ymm1 vmovaps 64(%rsi), %ymm2 vmovaps 96(%rsi), %ymm3 vmovntps %ymm0, (%rdi) vmovntps %ymm1, 32(%rdi) vmovntps %ymm2, 64(%rdi) vmovntps %ymm3, 96(%rdi) subq $-128, %rsi subq $-128, %rdi cmpq %rdx, %rsi jb .LBB1_2 .LBB1_3: vzeroupper retq
I like the another.c code (it suits my coding style), and I'm happy with the code generated by GCC-4.8.4 and clang-3.5 at -O1, -O2, -O3, and -Os on both, so I think it is good enough for me. (Note, however, that I haven't actually benchmarked any of this, because I don't have the relevant code. We use both temporal and non-temporal (nt) memory accesses, and cache behaviour (and cache interaction with the surrounding code) is paramount for things like this, so it would make no sense to microbenchmark this, I think.)
srcanddestoverlap? If not, using therestrictkeyword on both would probably allow the compiler to generate code that's more efficient than either version...restrictkeyword would not change anything in case of simple one-to-one copying like that.volatile __m256i?