3

I am just looking more closely at OpenMP simd construct, and have three loops which seem not to be vectorized by gcc (brief performance tests), but I think that they could. So I was wondering, whether it is safe to add simd pragma and why gcc is not vectorizing them.

First is a matrix multiplication with values stored as single array:

#pragma omp parallel for for(size_t row = 0; row < 100; ++row){ {#pragma omp simd} for(size_t col = 0; col < 100; ++col){ float sum = c[row * 100 + col]; for(size_t k = 0; k < 100; k++){ sum += a[rows * 100 + k] * b[k * 100 + col]; } c[row * 100 + col] = sum; } 

I am aware that b is not transposed, which hinders performance. By adding simd pragma the code gets way faster. Is auto-vectorization not possible because of the inner loop?

For the second example I was trying the custom reduction declaration feature of OpenMP, which is not actually needed.

#pragma omp declare reduction(sum : double : omp_out += omp_in) initializer(omp_priv = omp_orig) double red_result = 0; #pragma omp parallel for {simd} reduction(sum:red_result) for(size_t i = 0; i < 100; ++i){ red_result = red_result + a[i]; } 

Does the reduction prevent vectorization? Because I would think that it should work fine?

The last example is a complex loop, with another inner loop and function calls. Simplified it looks something like this:

#pragma omp parallel for {simd} for(size_t i = 0; i < 100; ++i){ [..] for(size_t j = 0; j < 100; j++){ if(j != i){ float k2 = a[i] - b[j]; k = std::sqrt(k2); } } [do more with k] } 

So here the problem is probably the sqrt call, which cannot be vectorized? But should the performance be better with the simd pragma? Some brief test suggests that this is the case, but if the auto-vectorization is not possible because of std::sqrt, why should it be possible with the pragma?

Thank you for your help! :)

3
  • 1
    FP math is not associative. Compilers can't autovectorize FP reductions without -ffast-math or an OpenMP pragma that gives them permission to sum in a different order. Commented Apr 6, 2018 at 18:16
  • x86 has hardware support for SIMD sqrt. sqrtpd has as good throughput as sqrtsd on most CPUs, but does 2 double square roots in parallel. agner.org/optimize. Commented Apr 9, 2018 at 21:49
  • In the past, gcc ignored simd in the case of omp parallel simd, it would be reasonable to say parallel disables vectorization (at least where simd would be needed). The post above implies that this changed with gcc 7.1. Even with icc, my experience was that explicit nested loops were needed to accomplish parallel simd. Commented Apr 10, 2018 at 13:24

1 Answer 1

3

For math functions in math.h your compiler needs to implement vectorized versions of the math functions. GCC does this with libmvec and ICC does this with SVML. As far as I know Clang does not have native support for vectorized math functions.

Let's consider the following code:

void foo(float * __restrict a, float * __restrict b) { a = (float*)__builtin_assume_aligned(a, 16); b = (float*)__builtin_assume_aligned(b, 16); for(int i = 0; i < 100; ++i) { b[i] = sqrtf(a[i]); } } void foo2(float * __restrict a, float * __restrict b) { a = (float*)__builtin_assume_aligned(a, 16); b = (float*)__builtin_assume_aligned(b, 16); for(int i = 0; i < 100; ++i) { b[i] = sinf(a[i]); } } 

GCC, ICC, and Clang vectorize sqrtf (using one iteration of Newton's method). GCC and ICC vectorize sinf with libmvec (_ZGVbN4v_sinf) and SVML (__svml_sinf4) respectively. Clang does not vectorize sinf. See godbolt. sqrt is a special case (since the x86 instruction set has vectorized sqrt instructions) which can be inlined without a vectorized math library.

Sign up to request clarification or add additional context in comments.

9 Comments

As you haven't declared float *__restrct c you need omp simd to assert no aliasing. This should vectorize with simd sqrt (no Newton step) if you set appropriate options. Compilers use inline code for efficiency in such simple cases
@tim18, thanks, that's a much better solution. Now GCC, ICC, and Clang all vectorize this without OpenMP godbolt.org/g/Lsznwh. I think sqrt is a special case. If used sin then libmvec or SVML would be necessary.
Yes of course sqrt is special as it can use built in sqrt or Newton iteration. The latter wouldn't make sense in svml
The case is somewhat nonsensical. Are you testing whether compilers know to eliminate the inner loop by discarding 99 results?
@tim18, I was just trying to answer the third question in the OP's question since that's the most interesting to me. I'm not trying to eliminate the inner loop. I think sqrt was maybe a bad choice by the OP because it has a special solution. I think the OP wants a general answer for functions in math.h.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.