openmp parallelizing code with an internal for loop

Question

I'm trying to write a code that runs in parallel hardware using mpi and openmp. I have the following code piece:

#pragma omp parallel for private(k, temp_r) for(j=0; j<size; j++){ temp_r = b[j]; for(k=0; k<rows; k++){ temp_r = temp_r - A[j*rows + k] * x[k]; } r[j] = temp_r; }

I know this code could be further improved because the internal for loop is a reduction. I can do the reduction for one for loop. But I'm not sure how to go about this since there are two for loops involved here. Any insight would be helpful.

What do you really want to do? What is your goal? What is a typical size / rows. On what system are you executing the code? Eventually, you will have to provide a minimal reproducible example to get a good answer. — Zulan
– Zulan, Commented May 3, 2017 at 14:06
As far as what you show is concerned, most compilers should handle it well. Depending on the compiler and options, #pragma omp simd reduction(+: temp_r) on the inner loop may or may not help produce simd optimization. If your compiler produces a warning about the usage of j, making it local by for(int j=o;... should help. — tim18
– tim18, Commented May 3, 2017 at 17:08

Liran Funaro · Accepted Answer · 2017-05-04 16:11:48Z

If size >> #CPUs then using a reduction for the inner loop will only reduce the performance. Reduction needs an extra log(#CPUs) steps compared to serial for. Thus parallelizing this code any further will not gain improvement and will probably harm it. It would, however, improve performance if size < #CPUs. This is because you will have fewer work-chunks than CPUs.

Cache optimizations are also not viable. Each basic op (temp_r = temp_r - A[j*rows + k] * x[k]) requires reading two values (A[j][k] and x[k]), one of which is exclusive for that op (A[j][k]), which means it is not in the cache. If you are working on an Out-of-Order-Exectution CPU (which you probably are), you will not gain any improvement from trying to improve the cache locality over the reading of the x array because the CPU will also have to wait for the second read and it will do it simultaneously (it will only start the op once both values are ready).

Collectives™ on Stack Overflow

openmp parallelizing code with an internal for loop

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related