2

This question is an extension of this one. Here I present two possible solutions and I want to known their feasibility. I am using a Haswell microarchitecture with GCC/ICC compilers. I also assume that memory is aligned.

OPTION 1 - I have a memory position already allocated and do 3 memory moves. (I use memmove instead of memcpy to avoid the copy constructor).

void swap_memory(void *A, void* B, size_t TO_MOVE){ memmove(aux, B, TO_MOVE); memmove(B, A, TO_MOVE); memmove(A, aux, TO_MOVE); } 

OPTION 2 - Use AVX or AVX2 loads and stores, taking advantage of the aligned memory. To this solution I consider that I swap int data types.

void swap_memory(int *A, int* B, int NUM_ELEMS){ int i, STOP_VEC = NUM_ELEMS - NUM_ELEMS%8; __m256i data_A, data_B; for (i=0; i<STOP_VEC; i+=8) { data_A = _mm256_load_si256((__m256i*)&A[i]); data_B = _mm256_load_si256((__m256i*)&B[i]); _mm256_store_si256((__m256i*)&A[i], data_B); _mm256_store_si256((__m256i*)&B[i], data_A); } for (; i<NUM_ELEMS; i++) { std::swap(A[i], B[i]); } } 

Is the option 2 the fastest? Is there another faster implementation that I din't mention?

8
  • 11
    Have you profiled it? Commented May 19, 2016 at 16:42
  • 1
    I would have guessed that (with optimizations turned on) gcc/icc would vectorize the loops for you, rather than requiring you to do it manually. Commented May 19, 2016 at 16:46
  • 6
    OP: "I use memmove instead of memcpy to avoid the copy constructor" → what? Both those functions work only with raw bytes, neither uses copy (or move) constructors or assignment operators. The first works correctly with overlapping ranges, though. Commented May 19, 2016 at 16:47
  • 2
    It's probably more of a design issue if you need to do all this copying - you should just be able to swap two pointers, surely ? Commented May 19, 2016 at 16:52
  • ... scratch that -- if the pointers were marked __restrict__, I would expect gcc/icc to vectorize the loops for you. Without __restrict__, I'm not sure how many compilers these days will add tests for non-overlapping ranges to check whether it's safe to reorder the operations or not. Commented May 19, 2016 at 16:56

3 Answers 3

2

If you know for sure that the memory is aligned, using AVX may be best. Note that doing it explicitly may not be portable - it might be better to decorate the pointers such that they're known to be aligned (e.g. using an aligned attribute or similar.)

Most likely option 2 (or something semantically doing that) may be faster, since the pointers aren't restricted or anything. The compiler may not know that it's safe to reorder the memory or leave "aux" untouched.

Further, option 2 may be more threadsafe depending on how aux is set up.

It might be fine to use a local temporary and memcpy to/from that temporary in blocks or even all at once, as gcc might be able to vectorize that. Avoid using external temporaries, and make sure all of your structures are decorated as aligned.

Sign up to request clarification or add additional context in comments.

4 Comments

I don't think you can use alignas to tell gcc that a pointer point to aligned memory. alignas only seems to work for aligning data itself (e.g. alignas(32) foo[32]). Option2 is better because the compiler almost certainly won't optimize the memcpy version into an interleaved loop that doesn't touch aux. Your last paragraph will probably end up making multiple calls to memcpy with a small count.
Perhaps a pointer to aligned_storage then? I usually just use gcc attributes or defines, etc. and forgot that alignas isn't a thing for pointed to memory.
The gcc-specific way is A = __builtin_assume_align(A, 32). You can also typedef __attribute__((aligned(32))) int align32_int, and then void foo(align32_int *A) {}. IIRC, that does work, while alignas in that same typedef doesn't
Right, the typedef method is the way I typically would do it. Shame alignas can't be used that way.
0

Option 2 does less reads, so I would expect it to be faster (of course it all depends on the size of the data, the performance advantage will be much less if everything fits in the cache).

You can also use the AVX intrinsic _mm256_stream_si256 instead of the stores (then you'll need a fence before reading the memory again).

1 Comment

NT stores are worse if the buffer is small and you're going to read it soon. They evict the destination even if it was hot in cache.
0

I would just do the following:

unsigned char t; unsigned char *da = A, *db = B; while(TO_MOVE--) { t = *da; *da++ = *db; *db++ = t; } 

On the basis that it's super clear and the optimiser is going have a good chance of doing a good job.

1 Comment

This is good with auto-vectorization (-O3), but absolute garbage without (-O2). You can use -fopenmp and #pragma omp simd even at -O2, though, to get good results with gcc. Since you have ints, and an element count of ints, it's silly to cast it to char. That makes the scalar cleanup loop worse.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.