Optimization of OpenMP parallel do loop in Fortran

Question

Background

I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).

Method

I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.

I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)

integer, parameter :: ia(C) = [1,5,9,13] integer, parameter :: ib(C) = [4,8,12,16]

Then Coulomb is computed as follows

!$omp parallel default(none) & !$omp private(im, i,j,rji,r2inv) & !$omp firstprivate(r_x,r_y,r_z, N, ia, ib) & !$omp shared(a2_x, a2_y, a2_z) im = omp_get_thread_num() + 1 ! How much threads ! Coulomb forces between each ion pair ! Compute the Coulomb force applied to ion i do i = ia(im,1), ib(im,1) ! loop over threads do j = 1, N ! loop over all ions rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x rji(2) = r_y(j) - r_y(i) ! over y rji(3) = r_z(j) - r_z(i) ! over z ! then compute the inverse square root of distance between the current ion i and the neighbor j r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening) r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0 ! computation of the accelerations a2_x(i) = a2_x(i) - rji(1)*r2inv a2_y(i) = a2_y(i) - rji(2)*r2inv a2_z(i) = a2_z(i) - rji(3)*r2inv enddo enddo !$omp end parallel

Problematics

I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.

I have been told I should have the positions r_x, r_y and r_z as private variables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i use firstprivate. Is that right ?
I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?

For any additional detail feel free to ask.

Edit (Remarks)

My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.

Edit (Test of the propositions)

Using !$omp do in combination with schedule(dynamic,1), or schedule(guided) or schedule(nonmonotonic:dynamic) and/or collapse(2) did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.
Using !$omp do rather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.
Replacing the inverse dsqrt by **(-1/2) showed to be equivalent in term of run time.
Delaying the square root and combining it with the third power of r2inv was also equivalent. So I replace the whole series of operation by **(-1.5).
Same idea with rji(1)*r2inv, I do rji*r2inv before and only use the result in the next lines.

Edit 2 (Test with !$omp reduction(+:...))

I have tested the program with the following instructions

original which is the program I present in my question.
!$omp do schedule(dynamic,1)
!$omp reduction(+:a2_x,a2_y,a2_z) with `schedule(dynamic,1)'.
!$omp reduction(+:a2_x,a2_y,a2_z) with schedule(guided) and do i = 2, N do j = 1, i-1 for the loop (half work).

for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.

Version	N = 1024	N = 16384
original	84 s	15194 s
`schedule(dynamic,1)`	1300 s	not tested
`reduction` and `schedule(dynamic,1)`	123 s	24860 s
`reduction` and `schedule(guided)` (2)	121 s	24786 s

What is weird is that @PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?

Hypothesis

The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
I use double precision, maybe single precision are faster ?
Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of where and rng perhaps I should work on this.

Just a comment on the style - dsqrt is very Fortran66. Simply sqrt has been enough for almost the last 50 years. — Ian Bush
– Ian Bush, Commented Nov 23, 2022 at 14:07
One comment: I assume that N=16 is just for illustration, because for such a small number of particles there is no chance to observe a significant speed-up with multithreading. What would be your typical N in real applications? — PierU
– PierU, Commented Nov 23, 2022 at 15:12
@PierU You assumption is right. I will rather run simulations with 1024 ions, but I would like to try more, such as 4096 but I would appreciate an optimization of the code because it would be much more time-consuming with 4096 ions. At 1024 ion wall time can be 30 to 60 minutes, it is fine, but at 4096 would be much longer. — Aldehyde
– Aldehyde, Commented Nov 24, 2022 at 16:39
@PierU I have edited with run times tested with and without reduction for N=1024 and N=16384. Please be aware I changed the time step to make it quicker compared to the previous tests so try not to compare those run times (edit 2) with previous tests (especially for 1024 ions). — Aldehyde
– Aldehyde, Commented Dec 2, 2022 at 10:35
Maybe you have other bottlenecks in your code, but overall your code should not be slower with OpenMP — PierU
– PierU, Commented Dec 3, 2022 at 12:13

PierU · Accepted Answer · 2022-11-29 15:21:57Z

Generally speaking, the variables that you just need to read in the parallel region can be shared. However, having firstprivate copies for each threads can give better performances in some cases (the copies can be in the local cache of each core), particularly for variables that are repeatedly read.
definitely not! If you do that, there will be a race condition on these variables
looks ok, but it is generally simpler (and at worst as efficient) to use an !$OMP DO directive instead of manually distributing the work to the different threads

 !$OMP DO do i = 1, N ! loop over all ions do j = 1, N ! loop over all ions

why not, provided that you are able to choose a softening value that doesn't alter your simulation (this is something that you have to test against the if solution)
it is somehow, but at some point you cannot avoid an exponentiation. I would delay the sqrt and the division like this:

r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening) r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0

Dividing the work by 2

The forces are symmetrical, and can be computed only once for a given (i,j) pair. This also naturally avoids the i==j case and the softening value. A reduction clause is needed on the a2* arrays as the same elements can be updated by different threads. The workload between iterations is highly unbalanced, though, and a dynamic clause is needed. This is actually a case were manually distributing the iterations to the threads can be more efficient ;) ...

!$omp parallel default(none) & !$omp private(im, i,j,rji,r2inv) & !$omp firstprivate(r_x,r_y,r_z, N, ia, ib) & !$omp reduction(+:a2_x, a2_y, a2_z) ! Coulomb forces between each ion pair ! Compute the Coulomb force applied to ion i !$omp do schedule(dynamic,1) do i = 1, N-1 ! loop over all ions do j = i+1, N ! loop over some ions rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x rji(2) = r_y(j) - r_y(i) ! over y rji(3) = r_z(j) - r_z(i) ! over z ! then compute the inverse square root of distance between the current ion i and the neighbor j r2inv = (rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3)) r2inv = r2inv**(-1.5) * alpha(1) ! alpha is 1/4.pi.eps0 ! computation of the accelerations rji(:) = rji(:)*r2inv a2_x(i) = a2_x(i) - rji(1) a2_y(i) = a2_y(i) - rji(2) a2_z(i) = a2_z(i) - rji(3) a2_x(j) = a2_x(j) + rji(1) a2_y(j) = a2_y(j) + rji(2) a2_z(j) = a2_z(j) + rji(3) enddo enddo !$omp end do !$omp end parallel

Alternatively, a guided clause could be used, with some changes in the iterations to have the low workloads in the first ones:

 !$omp do schedule(guided) do i = 2, N ! loop over all ions do j = 1, i-1 ! loop over some ions

TIMING

I have timed the latter code (divided by 2) on a old core i5 from 2011 (4 cores). Code compiled with gfortran 12.

No OpenMP / OpenMP with 1 thread / 4 threads no explicit schedule (that is static by default) / schedule(dynamic) / schedule(nonmonotonic:dynamic) / schedule(guided). guided timed with 2 code versions : (1) with do i=1,N-1; do j=i+1,N, (2) with do i=2,N; do j=1,i-1

	N=256	N=1204	N=4096	N=16384	N=65536
no omp	0.0016	0.026	0.41	6.9	116
1 thread	0.0019	0.027	0.48	8.0	118
4 threads	0.0014	0.013	0.20	3.4	55
dynamic	0.0018	0.020	0.29	5.3	84
nonmonotonic			0.29	5.2	85
guided (1)	0.0014	0.013	0.21	3.7	61
guided (2)	0.0009	0.093	0.13	2.2	38

The guided schedule with low workload iterations first wins. And I have some speed-up even for low values of N. It's important to note however that the behavior can differ on a different CPU, and with a different compiler.

I have also timed the code with do i=1,N; do j=1,N (as the work is balanced between iterations there's no need of sophisticated schedule clauses):

	N=256	N=1204	N=4096	N=16384	N=65536
no omp	0.0028	0.047	0.72	11.5	183
4 threads	0.0013	0.019	0.25	4.0	71

it's also worth trying a schedule(nonmonotonic:dynamic). It can reduce the cost of the dynamic schedule significantly. While it is now allowed to be the default for schedule(dynamic), many implementations choose not to make the default nonmonotonic as it could break old code that assumes a monotonic implementation. You could also try collapse(2) on the loops, possibly with a chunk size to increase the items being scheduled a little.
@PierU I implemented the modifications suggested, didn't worked as expected. First, tested my original version for N=1024 ions : wall time is 417 s. Adding !omp$ do and the delayed sqrt gives 417s also. Nevertheless, rearranging the loop as proposed i=1, N-1 ... and using schedule(dynamic,1) gave wall time above 15 minutes (>900s). Using schedule(guided) I have wall time above 30 minutes. I will double check tomorrow. I have this Coulomb routine in my code that also runs other stuff, notably random number generation for each ion at each time step, explain why change has no effect ?
I would also add that the !omp$ do alone with softening and my original do loop i = 1, N ... gives wall time of 417 s also. The mess does come from the schedule(dynamic,1) and the rearranging of the loop with ` do i = 1, N-1... do j = i+1, N` ? May this be related to how variables are defined ? The private and shared attributes ?
@Aldehyde The dynamic schedule has significant overheads, and the workload per iteration is maybe too low here compared to the overheads. Could be worth trying the suggestion of @JimCownie schedule(nonmonotonic:dynamic). Beyond that, it's possible that even with N=1024 the total workload is not big enough (1024**2 iterations is not that much) to take a significant advantage of OpenMP, all the more if the serial code that is around represents a significant percentage of the total time. You should time the parallel part only to assess the multithreading speed-up (see omp_get_wtime())
@PierU @Jim Cownie . Using schedule(nonmonotonic:dynamic) with or without collapse(2), with the halved loop, did not improve the run time. It was above 30 minutes. Those instructions were written in the same line as the !$omp do. Execution time were measured using omp_get_wtime(), but not for the Coulomb loop only, because it is integrated in a bigger program.

johncampbell · Accepted Answer · 2023-01-10 09:17:09Z

I did not see how you limited the number of threads. This could be an additional !$omp directive. The following could be effective with schedule(static). Should "if ( i==j ) cycle" be included ? Is N = 16 for your code example ?

 dimension rjm(3) !$omp parallel do default(none) & !$omp private (i,j, rji, r2inv, rjm) & !$omp shared (a2_x, a2_y, a2_z, r_x,r_y,r_z, N, softening, alpha ) & !$omp schedule (static) ! Coulomb forces between each ion pair ! Compute the Coulomb force applied to ion i do i = 1,16 ! loop over threads , ignore ia(im,1), ib(im,1), is N=16 ? rjm = 0 do j = 1, N ! loop over all ions if ( i==j ) cycle rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x rji(2) = r_y(j) - r_y(i) ! over y rji(3) = r_z(j) - r_z(i) ! over z ! then compute the inverse square root of distance between the current ion i and the neighbor j r2inv = sqrt ( rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening ) r2inv = alpha(1) / ( r2inv * r2inv * r2inv ) ! alpha is 1/4.pi.eps0 ! computation of the accelerations ! rjm = rjm + rji*r2inv rjm(1) = rjm(1) + rji(1)*r2inv rjm(2) = rjm(2) + rji(2)*r2inv rjm(3) = rjm(3) + rji(3)*r2inv end do a2_x(i) = a2_x(i) - rjm(1) a2_y(i) = a2_y(i) - rjm(2) a2_z(i) = a2_z(i) - rjm(3) end do !$omp end parallel do

I have no experience of using firstprivate to shift shared variables into the stack for improved performance. Is it worthwile ?

1) N is not necessarily 16; 2) your code is essentially the same as the one I proposed at the beginning of my answer (using a omp do); 3) the advantage of firstprivate is that each thread can have a private copy in the L3 cache, and the downside is that the copy comes with an initial cost, so it can be overall faster or slower depending on the cases.
@Pieru, I agree that omp do is a much better way to distribute the workload. Schedule(static) helps with this. The available code does not identify how the threads were limited to 4. L3 cache is a "group" shared cache so a local copy might need to be in L1 cache. I would alternatively suggest that including the private accumulator "rjm" would defer touching a2_x/y/z to outside the do j loop and so reduce the memory and cache consistency overheads for "a2_." I would recommend this approach as "rjm" is more likely to be in L1 cache.

Collectives™ on Stack Overflow

Optimization of OpenMP parallel do loop in Fortran

Background

Method

Problematics

Edit (Remarks)

Edit (Test of the propositions)

Edit 2 (Test with !$omp reduction(+:...))

Hypothesis

2 Answers 2

Dividing the work by 2

TIMING

9 Comments

3 Comments

Hot Network Questions

Collectives™ on Stack Overflow

Background

Method

Problematics

Edit (Remarks)

Edit (Test of the propositions)

Edit 2 (Test with !$omp reduction(+:...))

Hypothesis

2 Answers 2

Dividing the work by 2

TIMING

9 Comments

3 Comments

Related