I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.
integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col double complex, dimension(4,4):: mat double complex, dimension(:), allocatable :: vecin,vecout nstates = 2 N = 24 allocate(vecin(nstates**N), vecout(nstates**N)) vecin = ...some data vecout = 0 mat = reshape([...some data...],[4,4]) dimb=nstates**2 !$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col) do i=1,N-1 dima=nstates**(i-1) dimc=nstates**(N-i-1) do a_row = 1, dima do b_row = 1,dimb do c_row = 1,dimc row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row do b_col = 1,dimb col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row !$OMP ATOMIC vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col) end do end do end do end do end do !$OMP END PARALLEL DO The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.
reductionclause forvecout. This should speed things up ;-)ulimit -s unlimited