1

I'm trying to understand why the following runs much faster on 1 thread than on 4 threads on OpenMP. The following code is actually based on a similar question: OpenMP recursive tasks but when trying to implement one of the suggested answers, I don't get the intended speedup, which suggests I've done something wrong (and not sure what it is). Do people get better speed when running the below on 4 threads than on 1 thread? I'm getting a 10 times slowdown when running on 4 cores (I should be getting moderate speedup rather than significant slowdown).

int fib(int n) { if(n == 0 || n == 1) return n; if (n < 20) //EDITED CODE TO INCLUDE CUTOFF return fib(n-1)+fib(n-2); int res, a, b; #pragma omp task shared(a) a = fib(n-1); #pragma omp task shared(b) b = fib(n-2); #pragma omp taskwait res = a+b; return res; } int main(){ omp_set_nested(1); omp_set_num_threads(4); double start_time = omp_get_wtime(); #pragma omp parallel { #pragma omp single { cout << fib(25) << endl; } } double time = omp_get_wtime() - start_time; std::cout << "Time(ms): " << time*1000 << std::endl; return 0; } 
6
  • Every number in the fibonacci sequence depends on the previous numbers. Why would this lend itself to multithreading ? You can't expect any performance gain. Commented May 15, 2018 at 4:16
  • One thing you could do to speed it up is to cache fibonacci values already calculated. That would give you a significant speedup. Commented May 15, 2018 at 4:26
  • Why omp_set_nested(1)? I don't see any nested parallel sections. Commented May 15, 2018 at 5:35
  • Please provide 1) specific performance results 2) information about your system 3) a minimal reproducible example 4) your compiler, version, and command Commented May 15, 2018 at 5:36
  • I did some experiments on 2x12 core Haswell E5 / GCC, and the speedup strongly depended on the cutoff parameter. For cutoff 20, I got speedup around 5, for cutoff 42 around 10 with 24 threads. I calculated fib(48), lower inputs were too quick to get relevant results. Commented May 15, 2018 at 5:55

2 Answers 2

2

Have you tried it with a large number?

In multi-threading, it takes some time to initialize work on CPU cores. For smaller jobs, which is done very fast on a single core, threading slows the job down because of this.

Multi-threading shows increase in speed if the job normally takes time longer than second, not milliseconds.

There is also another bottleneck for threading. If your codes try to create too many threads, mostly by recursive methods, this may cause a delay to all running threads causing a massive set back.

In this OpenMP/Tasks wiki page, it is mentioned and a manual cut off is suggested. There need to be 2 versions of the function and when the thread goes too deep, it continues the recursion with single threading.

EDIT: cutoff variable needs to be increased before entering OMP zone.


the following code is for test purposes for the OP to test

#define CUTOFF 5 int fib_s(int n) { if (n == 0 || n == 1) return n; int res, a, b; a = fib_s(n - 1); b = fib_s(n - 2); res = a + b; return res; } int fib_m(int n,int co) { if (co >= CUTOFF) return fib_s(n); if (n == 0 || n == 1) return n; int res, a, b; co++; #pragma omp task shared(a) a = fib_m(n - 1,co); #pragma omp task shared(b) b = fib_m(n - 2,co); #pragma omp taskwait res = a + b; return res; } int main() { omp_set_nested(1); omp_set_num_threads(4); double start_time = omp_get_wtime(); #pragma omp parallel { #pragma omp single { cout << fib_m(25,1) << endl; } } double time = omp_get_wtime() - start_time; std::cout << "Time(ms): " << time * 1000 << std::endl; return 0; } 

RESULT: With CUTOFF value set to 10, it was under 8 seconds to calculate 45th term.

co=1 14.5s co=2 9.5s co=3 6.4s co=10 7.5s co=15 7.0s co=20 8.5s co=21 >18.0s co=22 >40.0s 
Sign up to request clarification or add additional context in comments.

12 Comments

Even for large numbers (> 35), the fibonacci is much faster on one thread than on several.
@user308485, even 35 is not such a big number as you think. I have tested this nested method and the time seems to be relevant after 40. try it again in this domain.
Still no gain using more cores. Have you tested this code just now? Are you able to get some speedup with more cores at some particular size?
look at the fib() subroutine. it performs one addition (e.g. compute) and "spawns" two threads (e.g. overhead), so imho, it is no surprise there is no benefit in using OpenMP here (a single for loop would even be faster than a recursive implementation too)
@user308485, I have an edit to solve your problem, but kept the previous parts for possible future references, you need to add a cut-off to your function. Don't forget, you need two version of your function. Oh, by the way, your code does not do parallel looping but makes a tree of threads, that is why it takes too much time.
|
0

I believe I do not know how to tell the compiler not to create parallel task after a certain depth as: omp_set_max_active_levels seems to have no effect and omp_set_nested is deprecated (though it also has no effect).

So I have to manually specify after which level not to create more tasks. Which IMHO is sad. I still believe there should be a way to do this (if somebody know, kindly let me know). Here is how I attempted it, and after input size of 20 parallel version runs a bit faster than serial (like in 70-80% time). Ref: Code taken from an assignment from course (solution was not provided, so I don't know how to do it efficiently): https://www.cs.iastate.edu/courses/2018/fall/com-s-527x

#include <stdio.h> #include <omp.h> #include <math.h> int fib(int n, int rec_height) { int x = 1, y = 1; if (n < 2) return n; int tCount = 0; if (rec_height > 0) //Surprisingly without this check parallel code is slower than serial one (I believe it is not needed, I just don't know how to use OpneMP) { rec_height -= 1; #pragma omp task shared(x) x = fib(n - 1, rec_height); #pragma omp task shared(y) y = fib(n - 2, rec_height); #pragma omp taskwait } else{ x = fib(n - 1, rec_height); y = fib(n - 2, rec_height); } return x+y; } int main() { int tot_thread = 16; int recDepth = (int)log2f(tot_thread); if( ((int)pow(2, recDepth)) < tot_thread) recDepth += 1; printf("\nrecDepth: %d\n",recDepth); omp_set_max_active_levels(recDepth); omp_set_nested(recDepth-1); int n,fibonacci; double starttime; printf("\nPlease insert n, to calculate fib(n): %d\n",n); scanf("%d",&n); omp_set_num_threads(tot_thread); starttime=omp_get_wtime(); #pragma omp parallel { #pragma omp single { fibonacci=fib(n, recDepth); } } printf("\n\nfib(%d)=%d \n",n,fibonacci); printf("calculation took %lf sec\n",omp_get_wtime()-starttime); return 0; } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.