I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).
int fib(int n) { if(n == 0 || n == 1) return n; if (n < 20) //EDITED CODE TO INCLUDE CUTOFF return fib(n-1)+fib(n-2); int res, a, b; #pragma omp task shared(a) a = fib(n-1); #pragma omp task shared(b) b = fib(n-2); #pragma omp taskwait res = a+b; return res; } int main(){ omp_set_nested(1); omp_set_num_threads(4); double start_time = omp_get_wtime(); #pragma omp parallel { #pragma omp single { cout << fib(25) << endl; } } double time = omp_get_wtime() - start_time; std::cout << "Time(ms): " << time*1000 << std::endl; return 0; }
omp_set_nested(1)? I don't see any nested parallel sections.fib(48), lower inputs were too quick to get relevant results.