2

A small example serial code, which has the same structure as my code, is shown below.

PROGRAM MAIN IMPLICIT NONE INTEGER :: i, j DOUBLE PRECISION :: en,ei,es DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000) OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN') DO i = 1, 1000, 1 DO j = 1, 2000, 1 ki(i,j) = DBLE(i) + DBLE(j) END DO END DO DO i = 1, 200, 1 en = 2.0d0/DBLE(200)*(i-1)-1.0d0 et(i) = en es = 0.0d0 DO j = 1, 1000, 1 kn=ki(j,:) CALL CAL(en,kn,ei) es = es + ei END DO WRITE (UNIT=3, FMT=*) et(i), es END DO CLOSE(UNIT=3) STOP END PROGRAM MAIN SUBROUTINE CAL (en,kn,ei) IMPLICIT NONE INTEGER :: i DOUBLE PRECISION :: en, ei, gf,p DOUBLE PRECISION :: kn(2000) p = 3.14d0 ei = 0.0d0 DO i = 1, 2000, 1 gf = 1.0d0 / (en - kn(i) * p) ei = ei + gf END DO RETURN END SUBROUTINE CAL 

I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.

Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).

I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?

Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?

On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?

Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.

11
  • I'm afraid to answer this question well really requires a lot more detail. Hybrid MPI+OpenMP may well be a good way to do this, but to say for certain you need to provide more detail, especially on memory usage and data dependencies, and a minimal example illustrating what you are trying to achieve would really help. Commented Jun 8, 2020 at 11:51
  • Note that MPI would require you to rewrite the entire loop and possibly even the entire code because it would need different start and end values of j on each processor. Have you tried any compiler switches? Commented Jun 8, 2020 at 12:10
  • @Ian Bush, High Performance Mark and wander95 Thank you very much for the reply. I have already modified my post with a small example serial code and information of the cluster were I am running my code. I would really appreciate that you could provide any solution for the parallelization. Or would you please just modify this small serial example code with hybrid MPI and OpenMP method? Thank you so much again. Commented Jun 8, 2020 at 15:34
  • 1
    Thanks for the example. If it gets reopened I will try to find time to answer. But one thing I should point out before you even think about parallelism is that the serial performance will be poor due to you accessing the elements of ki in the wrong order - you should really try to write your code so your fastest moving index in the first one, not the last. Thus before parallelism I suggest you rewrite the code to deal with ki transposed rather than as written above. Commented Jun 8, 2020 at 15:44
  • 1
    I won't do it by email - I try to help, but I am not a code writing service. I am preparing some teaching currently, if I can find time once that is done I will take a look. But the idea is to use MPI for the outer loop, and OpenMP for the inner on the little code you wrote above. Should be fairly easy, why don't you give it a go? Commented Jun 9, 2020 at 14:45

1 Answer 1

3

As mentioned in the comments, a good answer will require more detailed question. However, at a first sight it seems that parallelizing the internal loop

DO j = 1, 1000, 1 kn=ki(j,:) CALL CAL(en,kn,ei) es = es + ei END DO 

should be enough to solve your problem, or at least it will be a good starter. First of all I guess that there is an error on the loop

DO i = 1, 1000, 1 DO j = 1, 2000, 1 ki(j,k) = DBLE(j) + DBLE(k) END DO END Do 

since the k is set to 0 and and there is no cell with address corresponding to 0 (see your variable declaration). Also ki is declared ki(1000,2000) array while ki(j,i) is (2000,1000) array. Beside these error, I guess that ki should be calculated as

ki(i,j) = DBLE(j) + DBLE(i) 

if true, I suggest you the following solution

PROGRAM MAIN IMPLICIT NONE INTEGER :: i, j, k,icr,icr0,icr1 DOUBLE PRECISION :: en,ei,es,timerRate DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000) INTEGER,PARAMETER:: nthreads=1 call system_clock(count_rate=icr) timerRate=real(icr) call system_clock(icr0) call omp_set_num_threads(nthreads) OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN') DO i = 1, 1000, 1 DO j = 1, 2000, 1 ki(i,j) = DBLE(j) + DBLE(i) END DO END DO DO i = 1, 200, 1 en = 2.0d0/DBLE(200)*(i-1)-1.0d0 et(i) = en es = 0.0d0 !$OMP PARALLEL DO private(j,kn,ei) firstpribate(en) shared(ki) reduction(+:es) DO j = 1, 1000, 1 kn=ki(j,:) CALL CAL(en,kn,ei) es = es + ei END DO !$OMP END PARALLEL DO WRITE (UNIT=3, FMT=*) et(i), es END DO CLOSE(UNIT=3) call system_clock(icr1) write (*,*) (icr1-icr0)/timerRate ! return computing time STOP END PROGRAM MAIN SUBROUTINE CAL (en,kn,ei) IMPLICIT NONE INTEGER :: i DOUBLE PRECISION :: en, ei, gf,p DOUBLE PRECISION :: kn(2000) p = 3.14d0 ei = 0.0d0 DO i = 1, 2000, 1 gf = 1.0d0 / (en - kn(i) * p) ei = ei + gf END DO RETURN END SUBROUTINE CAL 

I add some variables to check the computing time ;-).

This solution is computed in 5.14 s, for nthreads=1, and in 2.75 s, for nthreads=2. It does not divide the computing time by 2, but it seems to be a good deal for a first shot. Unfortunately, on this machine I have a core i3 proc. So I can't do better than nthreads=2. However, I wonder, how the code will behave with nthreads=16 ???

Please let me know

I hope that this helps you.

Finally, I warn about the choice of variables status (private, firstprivate and shared) that might be consider carefully in the real code.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you so much for your help. I have modified the example code in my post, regarding the variables ki. Your answer is very helpful to me. Can I ask one more question? You used OpenMP technique to parallelize the inner loop. It is truly enough only to parallelize the inner loop. However, the real code that I am programming is much more complex this example and I want to parallelize both inner and outer loop. The real code has the same structure as this example code. Is it possible to parallelize both inner and outer loops? Thank you very much again for your help.
@Kieran By concept of OpenMP, it is not possible to parallelize both loops. The best is always to parallelize the outer loops, but for the case you have mentionned it seems to me enough efficient to prallelize the inner one. Indeed, since you do a sum on "es" for each index "i" while "es" is not a vector, it seems more convenient to parallelize the inner loops in order to take benifits of the reduction clause without using ATOMIC or CRITICAL synchronization option. Note that reduction is the most efficient. Now to parallelize the outer loops you have to rethink the structur of your variables.
@Kieran Also if you parallelize the outer loops, you might write "et(i),es" in an order that is not the desired one.
Thank you very much for your help. I am trying to use MPI to parallelize the outer loop and OpenMP to parallelize the inner loop now. Your suggestion on the 'reduction clause' is really useful to me.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.