This question is an extension of this one. Here I present two possible solutions and I want to known their feasibility. I am using a Haswell microarchitecture with GCC/ICC compilers. I also assume that memory is aligned.
OPTION 1 - I have a memory position already allocated and do 3 memory moves. (I use .memmove instead of memcpy to avoid the copy constructor)
void swap_memory(void *A, void* B, size_t TO_MOVE){ memmove(aux, B, TO_MOVE); memmove(B, A, TO_MOVE); memmove(A, aux, TO_MOVE); } OPTION 2 - Use AVX or AVX2 loads and stores, taking advantage of the aligned memory. To this solution I consider that I swap int data types.
void swap_memory(int *A, int* B, int NUM_ELEMS){ int i, STOP_VEC = NUM_ELEMS - NUM_ELEMS%8; __m256i data_A, data_B; for (i=0; i<STOP_VEC; i+=8) { data_A = _mm256_load_si256((__m256i*)&A[i]); data_B = _mm256_load_si256((__m256i*)&B[i]); _mm256_store_si256((__m256i*)&A[i], data_B); _mm256_store_si256((__m256i*)&B[i], data_A); } for (; i<NUM_ELEMS; i++) { std::swap(A[i], B[i]); } } Is the option 2 the fastest? Is there another faster implementation that I din't mention?
__restrict__, I would expect gcc/icc to vectorize the loops for you. Without__restrict__, I'm not sure how many compilers these days will add tests for non-overlapping ranges to check whether it's safe to reorder the operations or not.