2

I am trying to parallelize a code using OpenMP, the serial time for my current input size is around 9 seconds, I have a code of the following form:

int main() { /* do some stuff*/ myfunction(); } void myfunction() { for (int i=0; i<n; i++) { //it has some parameters but that is beyond the point I guess int rand = custom_random_generator(); compute(rand); } } 

so here the random generator can be executed in parallel since there are no dependencies, and the same goes for the compute function so I was attempting to parallel this piece but all my attempts resulted in a failure, the first thought was to put these functions as task so they get executed in parallel but resulted in a slower result, here is what I did

void myfunction() { for (int i=0; i<n; i++) { #pragma omp task { //it has some parameters but that is beyond the point I guess int rand=custom_random_generator(); compute(rand); } } } 

Result: 23 seconds, more than double the serial time

Putting task on compute() only resulted in the same

Even worse attempt:

void myfunction() { #pragma omp parallel for for (int i=0; i<n; i++) { //it has some parameters but that is beyond the point I guess int rand=custom_random_generator(); compute(rand); } } 

Result: 45 seconds

Theoretically speaking, why could this happen? I know that for anyone to tell my exact problem they would need a minimum reproducible example but my goal from the question is to understand the different theories that could explain my problem and apply them myself, why would parallelizing an "embarrassingly parallel" piece of code result in way worse performance?

10
  • Something else is causing all this overhead cannot only be the thread creation, I would suspect false sharing, cache invalidation and so on, how many cores do you have? Commented May 1, 2021 at 9:52
  • I have a macbook air 2019 with intel core i5 so I have 2 cores but run 4 threads in parallel as it supports hyperthreading @dreamcrash Commented May 1, 2021 at 9:54
  • Can you tell me the times with #pragma omp parallel for num_threads(1), then num_threads(2) ? Commented May 1, 2021 at 9:55
  • 1
    No problem, look for race conditions shared states this include external function calls Commented May 1, 2021 at 10:53
  • 1
    Your task code has no parallelism! (There is no parallel directive anywhere in the code). You may also want to look at papers on parallel random number generation, it is not a trivial task. (E.g. Parallel Random Numbers: As Easy as 1, 2, 3 - The Salmons thesalmons.org/john/random123/papers/random123sc11.pdf ) Commented May 3, 2021 at 8:28

1 Answer 1

1

One theory could be the overhead that is associated with creating and maintaining multiple threads.

The advantges of parallel programming can only be seen when each iteration has to perform more complicated processor intensive tasks.

A simple for loop with some simple routine inside would not take advantage of it.

Sign up to request clarification or add additional context in comments.

6 Comments

my random generator is pretty complex but my compute function just returns the result of a simple equation, I think I will try to only parallelize it then
task failed successfully! It improved the time to 18 seconds but it is still 2 times worse than the serial but I guess that confirms that task creating is creating overhead
Something else is causing all this overhead cannot only be the thread creation, I would suspect false sharing, cache invalidation and so on
@Sergio 🤣 great throwback, windows still has some gems though, check the return for winAPI successful function calls.
:D in all fairness I think the problem is that OP example does not accurate represent the code where the parallelizations, hence one can only speculate
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.