A small example serial code, which has the same structure as my code, is shown below.
PROGRAM MAIN IMPLICIT NONE INTEGER :: i, j DOUBLE PRECISION :: en,ei,es DOUBLE PRECISION :: ki(1000,2000), et(200),kn(2000) OPEN(UNIT=3, FILE='output.dat', STATUS='UNKNOWN') DO i = 1, 1000, 1 DO j = 1, 2000, 1 ki(i,j) = DBLE(i) + DBLE(j) END DO END DO DO i = 1, 200, 1 en = 2.0d0/DBLE(200)*(i-1)-1.0d0 et(i) = en es = 0.0d0 DO j = 1, 1000, 1 kn=ki(j,:) CALL CAL(en,kn,ei) es = es + ei END DO WRITE (UNIT=3, FMT=*) et(i), es END DO CLOSE(UNIT=3) STOP END PROGRAM MAIN SUBROUTINE CAL (en,kn,ei) IMPLICIT NONE INTEGER :: i DOUBLE PRECISION :: en, ei, gf,p DOUBLE PRECISION :: kn(2000) p = 3.14d0 ei = 0.0d0 DO i = 1, 2000, 1 gf = 1.0d0 / (en - kn(i) * p) ei = ei + gf END DO RETURN END SUBROUTINE CAL I am running my code on the cluster, which has 32 CPUs on one node, and there are totally 250 GB memory shared by 32 CPUs on one node. I can use 32 nodes maximumly.
Every time when the inner Loop is done, there is one data to be collected. After all outer Loops are done, there are totally 200 data to be collected. If only the inner Loop is executed by one CPU, it would take more than 3 days (more than 72 hours).
I want to do the parallelization for both inner Loop and outer Loop respectively? Would anyone please suggest how to parallelize this code?
Can I use MPI technique for both inner Loop and outer Loop respectively? If so, how to differentiate different CPUs that execute different Loops (inner Loop and outer Loop)?
On the other hand, I saw someone mention the parallelization with hybrid MPI and OpenMP method. Can I use MPI technique for the outer Loop and OpenMP technique for the inner Loop? If so, how to collect one data to the CPU after every inner Loop is done each time and collect 200 data in total to CPU after all outer Loops are done. How to differentiate different CPUs that execute inner Loop and outer Loop respectively?
Alternatively, would anyone provide any other suggestion on parallelizing the code and enhance the efficiency? Thank you very much in advance.
jon each processor. Have you tried any compiler switches?