I am just looking more closely at OpenMP simd construct, and have three loops which seem not to be vectorized by gcc (brief performance tests), but I think that they could. So I was wondering, whether it is safe to add simd pragma and why gcc is not vectorizing them.
First is a matrix multiplication with values stored as single array:
#pragma omp parallel for for(size_t row = 0; row < 100; ++row){ {#pragma omp simd} for(size_t col = 0; col < 100; ++col){ float sum = c[row * 100 + col]; for(size_t k = 0; k < 100; k++){ sum += a[rows * 100 + k] * b[k * 100 + col]; } c[row * 100 + col] = sum; } I am aware that b is not transposed, which hinders performance. By adding simd pragma the code gets way faster. Is auto-vectorization not possible because of the inner loop?
For the second example I was trying the custom reduction declaration feature of OpenMP, which is not actually needed.
#pragma omp declare reduction(sum : double : omp_out += omp_in) initializer(omp_priv = omp_orig) double red_result = 0; #pragma omp parallel for {simd} reduction(sum:red_result) for(size_t i = 0; i < 100; ++i){ red_result = red_result + a[i]; } Does the reduction prevent vectorization? Because I would think that it should work fine?
The last example is a complex loop, with another inner loop and function calls. Simplified it looks something like this:
#pragma omp parallel for {simd} for(size_t i = 0; i < 100; ++i){ [..] for(size_t j = 0; j < 100; j++){ if(j != i){ float k2 = a[i] - b[j]; k = std::sqrt(k2); } } [do more with k] } So here the problem is probably the sqrt call, which cannot be vectorized? But should the performance be better with the simd pragma? Some brief test suggests that this is the case, but if the auto-vectorization is not possible because of std::sqrt, why should it be possible with the pragma?
Thank you for your help! :)
-ffast-mathor an OpenMP pragma that gives them permission to sum in a different order.sqrtpdhas as good throughput assqrtsdon most CPUs, but does 2doublesquare roots in parallel. agner.org/optimize.