OpenMP parallelization inside for loops takes too long

Question

I am preparing a program which must use OpenMP parallelization. The program is supposed to compare two frames, inside of which both frames must be compared block by block, and OpenMP must be applied in two ways: one where frame work must be split across threads, and the other where the work must be split between the threads by a block level, finding the minimum cost of each comparison.

The main idea behind the skeleton of the code would look as follows:

int main() { // code for () { for () { searchBlocks(); } } // code } searchBlocks() { for () { for () { getCost() } } } getCost() { for () { for () { // operations } } }

Then, considering parallelization at a frame level, I can simply change the main nested loop to this

int main() { // code omp_set_num_threads(threadNo); #pragma omp parallel for collapse(2) if (isFrame) for () { for () { searchBlocks(); } } // code }

Where threadNo is specified upon running and isFrame is obtained via a parameter to specify if frame level parallelization is needed. This works and the execution time of the program becomes shorter as the number of threads used becomes bigger. However, as I try block level parallelization, I attempted the following:

getCost() { #pragma omp parallel for collapse(2) if (isFrame) for () { for () { // operations } } }

I'm doing this in getCost() considering that it is the innermost function where the comparison of each corresponding block happens, but as I do this the program takes really long to execute, so much so that if I were to run it without OpenMP support (so 1 single thread) against OpenMP support with 10 threads, the former would finish first.

Is there something that I'm not declaring right here? I'm setting the number of threads right before the nested loops of the main function, just like I had in frame level parallelization.

Please let me know if I need to explain this better, or what it is I could change in order to manage to run this parallelization successfully, and thanks to anyone who may provide help.

Alexey S. Larionov · Accepted Answer · 2020-08-04 17:53:14Z

Remember that every time your program executes #pragma omp parallel directive, it spawns new threads. Spawning threads is very costly, and since you do getCost() many many times, and each call is not that computationally heavy, you end up wasting all the time on spawning and joining threads (which is essentially making costly system calls).

On the other hand, when #pragma omp for directive is executed, it doesn't spawn any threads, but it lets all the existing threads (which are spawned by previous parallel directive) to execute in parallel on separate pieces of data.

So what you want is to spawn threads on the top level of your computation by doing: (notice no for)

int main() { // code omp_set_num_threads(threadNo); #pragma omp parallel for () { for () { searchBlocks(); } } // code }

and then later to split loops by doing (notice no parallel)

getCost() { #pragma omp for collapse(2) if (isFrame) for () { for () { // operations } } }

Thank you. I wrote #pragma omp parallel if (isBlock) right before the nested loops of main, and #pragma omp for collapse(2) right before the nested loops of getCost. While the program does run faster (270 sec, with 10 threads, against the 400 sec without optimizations), I noticed it stops right before finishing the nested loops in the main function, so it gives the appearance of running forever (I noticed this happens with printf statements and once reaching the end it simply stops but the program doesn't finish). Is it that the compiler fails to unite the results obtained?
There is an implicit barrier at the end of parallel block. So all threads wait there untill all other threads finished executing the block. Maybe some thread gets lost in the loop? (maybe waiting for a signal, while there're no other threads to give such signal)?
Thank you very much. It seems to have been exactly as you said. From this page ppc.cs.aalto.fi/ch3/nowait a pragma omp for will not continue until all threads are done and, since they modify a counter variable for total cost declared right before the nested loops in getCost, it's likely at least two of them are waiting on one another to see who increments that variable's value, creating the infinite loop I saw. While adding a nowait clause solves the issue, the cost seems to decrease dramatically. I'll see if I have to redefine this with some form of reduction. But thanks!!

Anton Anisimov · Accepted Answer · 2020-08-03 21:50:45Z

You get cascading parallelization. Take the limit values in the main cycles as I,J, and in the getcost cycles as K,L: you get I * J * K * L threads. Here any operating system will go crazy. So not long before fork bomb to reach...

Well, and "collapse" is also not clear why. It's still two cycles inside, and I don't see much point in this parameter. Try removing parallelism in Getcost.

Thank you. What you and @Alex Larionov said makes sense. I declared the parallel directive outside the main nested loop, and the for one right outside the getCost loop. Though it is comparatively faster, it hangs right at the end somehow

Collectives™ on Stack Overflow

OpenMP parallelization inside for loops takes too long

2 Answers 2

3 Comments

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Related