0

Basically I have a program that needs to go over several individual pictures I do this by:

#pragma omp paralell num_threads(4) #pragma omp paralell for for(picture = 0; picture < 4; picture++){ for(int row = 0; row < 1000; row++){ for(int col = 0; col < 1000; col++){ //do stuff with pixel[picture][row][col] } } } 

I just want to split the work among 4 cores (1 core per picture) so that each core/thread is working on a specific picture. That way core 0 is working on picture 0, core 1 on picture 1, and so on. The machine it is being tested on only has 4 cores as well. What is the best way to use openmp declarations for this scenario. The one I posted is what I think would be the best performance for this scenario.

keep in mind this is pseudo code. The goal of the program is not important, parallelizing these loops efficiently is the goal.

3
  • Really 10^12 pixels? With 4 pictures and 3 byte/pixel would be 12 Terabyte? Where do you store these? Commented Mar 11, 2017 at 9:44
  • I don't i'm using it to show what the scenario might look like. Changed it to realistic levels Commented Mar 11, 2017 at 9:55
  • 1) You real code may want to spell "parallel" correctly :-). 2) Do you intend to use nested parallelism? It's probably a bad idea... 3) Parallelizing over the middle loop (or collapsing the two innner loops) would give you much more available parallelism, so the ability to exploit larger machines... Commented Mar 13, 2017 at 16:08

1 Answer 1

1

Just adding a simple

#pragma omp parallel for 

is a good starting point for your problem. Don't bother with statically writing in the how many threads it should use. The runtime will usually do the right thing.

However, it is impossible to generally say what is most efficient. There are many performance factors that are impossible to tell from your limited general example. Your code may be memory bound and benefit only very little from parallelization on desktop CPUs. You may have a load imbalance which means you need to split the work in to more chunks and process them dynamically. That could be done by parallelizing the middle loop or using nested parallelism. Whether the middle loop parallelization works well depends on the amount of work done by the inner loop (and hence the ratio of useful work / overhead). The memory layout also heavily influence the efficieny of the parallelization. Or maybe you even have data dependencies in the inner loop preventing parallelization there...

The only general recommendation once can give is to always measure, never guess. Learn to use the powerful available parallel performance analysis tools and incoperate that into your workflow.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.