The fastest way to move a block of memory is going to be memcpy() from <string.h>. If you memcpy() from a to temp, memmove() from b to a, then memcpy() from temp to b, you’ll have a swap that uses the optimized library routines, which the compiler probably inlines. You wouldn’t want to copy the entire block at once, but in vector-sized chunks.
In practice, if you write a tight loop, the compiler can probably tell that you’re swapping every element of the arrays and optimize accordingly. On most modern CPUs, you want to generate vector instructions. It might be able to generate faster code if you make sure all three buffers are aligned.
However, what you really want to do is make things easier for the optimizer. Take this program:
#include <stddef.h> void swap_blocks_with_loop( void* const a, void* const b, const size_t n ) { unsigned char* p; unsigned char* q; unsigned char* const sentry = (unsigned char*)a + n; for ( p = a, q = b; p < sentry; ++p, ++q ) { const unsigned char t = *p; *p = *q; *q = t; } }
If you translate that into machine code as literally written, it’s a terrible algorithm, copying one byte at a time, doing two increments per iteration, and so on. In practice, though, the compiler sees what you’re really trying to do.
In clang 5.0.1 with -std=c11 -O3, it produces (in part) the following inner loop on x86_64:
.LBB0_7: # =>This Inner Loop Header: Depth=1 movups (%rcx,%rax), %xmm0 movups 16(%rcx,%rax), %xmm1 movups (%rdx,%rax), %xmm2 movups 16(%rdx,%rax), %xmm3 movups %xmm2, (%rcx,%rax) movups %xmm3, 16(%rcx,%rax) movups %xmm0, (%rdx,%rax) movups %xmm1, 16(%rdx,%rax) movups 32(%rcx,%rax), %xmm0 movups 48(%rcx,%rax), %xmm1 movups 32(%rdx,%rax), %xmm2 movups 48(%rdx,%rax), %xmm3 movups %xmm2, 32(%rcx,%rax) movups %xmm3, 48(%rcx,%rax) movups %xmm0, 32(%rdx,%rax) movups %xmm1, 48(%rdx,%rax) addq $64, %rax addq $2, %rsi jne .LBB0_7
Whereas gcc 7.2.0 with the same flags also vectorizes, unrolling the loop less:
.L7: movdqa (%rcx,%rax), %xmm0 addq $1, %r9 movdqu (%rdx,%rax), %xmm1 movaps %xmm1, (%rcx,%rax) movups %xmm0, (%rdx,%rax) addq $16, %rax cmpq %r9, %rbx ja .L7
Convincing the compiler to produce instructions that work on a single word at a time, instead of vectorizing the loop, is the opposite of what you want!
aandbindexes to the elements to be swapped?