With OpenMP parallelized nested loops run slow

Question

I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.

integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col double complex, dimension(4,4):: mat double complex, dimension(:), allocatable :: vecin,vecout nstates = 2 N = 24 allocate(vecin(nstates**N), vecout(nstates**N)) vecin = ...some data vecout = 0 mat = reshape([...some data...],[4,4]) dimb=nstates**2 !$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col) do i=1,N-1 dima=nstates**(i-1) dimc=nstates**(N-i-1) do a_row = 1, dima do b_row = 1,dimb do c_row = 1,dimc row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row do b_col = 1,dimb col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row !$OMP ATOMIC vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col) end do end do end do end do end do !$OMP END PARALLEL DO

The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.

You should take a look at the reduction clause for vecout. This should speed things up ;-) — Alexander Vogt
– Alexander Vogt, Commented Apr 1, 2015 at 8:14
Using an atomic instruction in the most inner loop cannot be a good idea ! — Francois Jacq
– Francois Jacq, Commented Apr 1, 2015 at 10:15
Thanks for your answers. Why is the atomic instruction a bad idea? I deleted the atomic instruction and used REDUCTION(+:vecout) instead, which leads to a segmentation fault. Is there something special about reduction and arrays? — Yannick
– Yannick, Commented Apr 1, 2015 at 10:37
The reduction works without segmentation fault just for small arrays. — Yannick
– Yannick, Commented Apr 1, 2015 at 10:50
Then you need to increase the stack size: ulimit -s unlimited — Alexander Vogt
– Alexander Vogt, Commented Apr 1, 2015 at 11:06

Vladimir F Героям слава · Accepted Answer · 2015-04-01 12:28:11Z

If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.

As Francois Jacq pointed out, you also have a race condition caused by dima and dimb which should be private.

double complex, dimension(:), allocatable :: tmp !$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp) allocate(tmp(size(vecout))) tmp = 0 !$OMP DO do i=1,N-1 dima=nstates**(i-1) dimc=nstates**(N-i-1) do a_row = 1, dima do b_row = 1,dimb do c_row = 1,dimc row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row do b_col = 1,dimb col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col) end do end do end do end do end do !$OMP END DO !$OMP CRITICAL vecout = vecout + tmp !$OMP END CRITICAL !$OMP END PARALLEL

Thanks, that worked. But why is this method faster than using atomic?
Yes, I recognized this before and changed it now in my question.
From my point of view, they are still race conditions because two threads could have the same row index. See my proposed solution.
@FrancoisJacq They can have the same row index, but they write to a different array. Your solution is good, but there is a lot of synchronization on each !$omp parallel do. The final decision should come from performance measurement.

Francois Jacq · Accepted Answer · 2015-04-01 12:12:19Z

1

Could you try something like :

do b_col=1,dimb do i=1,N-1 dima=nstates**(i-1) dimc=nstates**(N-i-1) !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row) do a_row = 1, dima do b_row = 1,dimb do c_row = 1,dimc row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col) enddo enddo enddo enddo enddo

The advantage is that the // loop does not cause collision now : all the indexes row are different.

answered Apr 1, 2015 at 12:12

Francois Jacq

1,37410 silver badges19 bronze badges

5 Comments

Yannick Over a year ago

Thanks for the answer. I will measure the time of your and Vladimir F's solution to compare them. What I can say until now is, that your solution runs faster for me without COLLAPSE. Can you tell my why you put the b_col loop out of the others? The row index should be different for every thread also when the b_col loop is inside, shouldn't it?

Francois Jacq Over a year ago

the loop b_col is outside the // loop two avoid that threads could compute the same row index => race situation when modifying vecout (risk of random mistake). Why did you delete the COLLAPSE clause ? It just indicates that the three next loops may be merged in a single // one.

Yannick Over a year ago

row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row , so row does not depend on b_col. To your collapse question: First I tried with collapse and then without. Without it was faster so why should I use it?

Francois Jacq Over a year ago

because row does not depend on b_col, the same row index may be obtained b_col times. About COLLAPSE, if you have measured with and without then all is OK : the judge is always the measurement !

Francois Jacq Over a year ago

In my previous message, read "dimb times" instead of "b_col times"

Collectives™ on Stack Overflow

With OpenMP parallelized nested loops run slow

2 Answers 2

6 Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Linked

Related