Background
I am simulating the motion of N charged particles in molecular dynamics with Fortran90 and OpenMP. The analytical expression of forces applied to each ion i is known and is a function of the position of the ion i and the other ions (r_x,r_y,r_z). I compute the Coulomb interaction between each pair of ion using a parallelised 2-nested do loop. I can determine the acceleration (a2_x,a2_y,a2_z) of each ion at the end of the loop (then update velocity and position with velocity-Verlet).
Method
I use the following code in my program to compute the Coulomb forces applied to each ion. I compute the acceleration (a2_x) at the next time step, starting from the position (r_x) at the current time step. It is a 3D problem, I put all the lines but most of them are just same thing for x, y and z so at first read you can just consider the _x variables to see how this works.
I parallelize my loop over C threads, ia and ib are arrays used to split the N ions into C parts. For instance for C=4 threads and N=16 ions (Se edit remarks below)
integer, parameter :: ia(C) = [1,5,9,13] integer, parameter :: ib(C) = [4,8,12,16] Then Coulomb is computed as follows
!$omp parallel default(none) & !$omp private(im, i,j,rji,r2inv) & !$omp firstprivate(r_x,r_y,r_z, N, ia, ib) & !$omp shared(a2_x, a2_y, a2_z) im = omp_get_thread_num() + 1 ! How much threads ! Coulomb forces between each ion pair ! Compute the Coulomb force applied to ion i do i = ia(im,1), ib(im,1) ! loop over threads do j = 1, N ! loop over all ions rji(1) = r_x(j) - r_x(i) ! distance between the ion i and j over x rji(2) = r_y(j) - r_y(i) ! over y rji(3) = r_z(j) - r_z(i) ! over z ! then compute the inverse square root of distance between the current ion i and the neighbor j r2inv = 1.d0/dsqrt(rji(1)*rji(1) + rji(2)*rji(2) + rji(3)*rji(3) + softening) r2inv = r2inv * r2inv * r2inv * alpha(1) ! alpha is 1/4.pi.eps0 ! computation of the accelerations a2_x(i) = a2_x(i) - rji(1)*r2inv a2_y(i) = a2_y(i) - rji(2)*r2inv a2_z(i) = a2_z(i) - rji(3)*r2inv enddo enddo !$omp end parallel Problematics
I am trying to optimize this time consuming part of my program. The number of operation is quite high, scales quickly with N. Can you tell me your opinion on this program ? I have some specific questions.
I have been told I should have the positions
r_x,r_yandr_zasprivatevariables, which seems counter-intuitive to me because I want to enter that loop using the previously defined positions of the ions, so i usefirstprivate. Is that right ?I am not sure that the parallelisation is optimal regarding the other variables. Shouldn't rji and r2inv be shared ? Because to compute the distance between ions i and j, I go "beyond" threads, you see what I mean ? I need info between ions spread over two different threads.
Is the way I split the ions in the first do optimal ?
I loop over all ions respectively for each ion, which will induce a division by zero when the distance between ion i and i is computed. To prevent this I have a softening variable defined at very small value so it is not exactly zero. I do this to avoid an if i==i that would be time consuming.
Also the square root is maybe also time consuming ?
For any additional detail feel free to ask.
Edit (Remarks)
My computer have a 10 core CPU Xeon W2155, 32 Go RAM. I intend to render around 1000 ions, while thinking about 4000, which requires a lot of time.
I have this Coulomb subroutine among other subroutine that may consume some CPU time. For instance one routine that may be time consuming is devoted to generating random numbers for each ion depending they are already excited or not, and applying the correct effect whether they absorb or not a photon. So that is a lot of RNG and if for each ion.
Edit (Test of the propositions)
Using
!$omp doin combination withschedule(dynamic,1), orschedule(guided)orschedule(nonmonotonic:dynamic)and/orcollapse(2)did not improve the run time. It made it at least three time longer. It is suggested the number of element in my simulations (N) is too low to see a significant improve. If I ever try to render much higher number of elements (4096, 8192 ...) I will try those options.Using
!$omp dorather than a home made ion distribution among cores did show equivalent in term of run time. It is easier to implement I will keep this.Replacing the inverse
dsqrtby**(-1/2)showed to be equivalent in term of run time.Delaying the square root and combining it with the third power of
r2invwas also equivalent. So I replace the whole series of operation by**(-1.5).Same idea with
rji(1)*r2inv, I dorji*r2invbefore and only use the result in the next lines.
Edit 2 (Test with !$omp reduction(+:...))
I have tested the program with the following instructions
originalwhich is the program I present in my question.!$omp do schedule(dynamic,1)!$omp reduction(+:a2_x,a2_y,a2_z)with `schedule(dynamic,1)'.!$omp reduction(+:a2_x,a2_y,a2_z)withschedule(guided)anddo i = 2, N do j = 1, i-1for the loop (half work).
for 1024 and 16384 ions. Turns out my original version is still faster for me but the reduction version is not as much "catastrophic" as the previous test without reduction.
| Version | N = 1024 | N = 16384 |
|---|---|---|
| original | 84 s | 15194 s |
schedule(dynamic,1) | 1300 s | not tested |
reduction and schedule(dynamic,1) | 123 s | 24860 s |
reduction and schedule(guided) (2) | 121 s | 24786 s |
What is weird is that @PierU still has a faster computation with reduction, while for me it is not optimal. Where should such difference come from ?
Hypothesis
- The fact I have a 10 core make the workload so lighter on each core for a given number of ion ?
- I use double precision, maybe single precision are faster ?
- Do you have AVX-512 instruction set ? It has a specific hardware to compute inverse square root much faster (see this article).
- The bottleneck is elsewhere in my program. I am aware I should only test the Coulomb part. But I wanted to test it in the context of my program, see if it really shorten computation time. I have a section with a lot of
whereand rng perhaps I should work on this.
dsqrtis very Fortran66. Simplysqrthas been enough for almost the last 50 years.reductionforN=1024andN=16384. Please be aware I changed the time step to make it quicker compared to the previous tests so try not to compare those run times (edit 2) with previous tests (especially for 1024 ions).