3

I have this C++ code.

Loop goes throgh the matrix, finds the min element in each row and subtracts it from each element of corresponding row. Variable myr is a summ of all min elements

Trying to parallel for:

int min = 0; int myr = 0; int temp[SIZE][SIZE]; int size = 0; ...//some initialization omp_set_num_threads(1); start_time = omp_get_wtime(); #ifdef _OPENMP #pragma omp parallel for firstprivate(min, size) reduction(+:myr) #endif for(int i = 0; i < size; i++){ min = INFINITY; for(int j = 0; j < size; j++){ if (temp[i][j] < min) min = temp[i][j]; } myr+=min; for(int j = 0; j < size; j++) temp[i][j]-=min; } end_time = omp_get_wtime(); 

if I set omp_set_num_threads(2); this part of code starts working slower.

My proc has 2 cores

Why code works slower with 2 threads?

7
  • 1
    First of all, OMP doesn't mean that automagically you get increased speed. Second thing, probably the conditional branch acts as a barrier, so the overhead is bigger. Commented Sep 10, 2012 at 13:19
  • 1
    The ultimate question is: is your algorithm suitable for data parrallellism? Can thread A run an iteration of your outer for loop and thread B another iteration of the outer loop without them having to wait on each other? Commented Sep 10, 2012 at 13:20
  • 1
    From first looks, it cannot. So your adding a thread is futile. Commented Sep 10, 2012 at 13:21
  • @Tony The Lion: Why do you say it cannot? The only part where they clash is the reduction variable, which is one addition done at the end. Commented Sep 10, 2012 at 13:22
  • 1
    Ive seen lots of questions on the multithreading tag where people just assume that more threads equals better performance. Maybe we should create a wiki or something explaining why its not always the case. Commented Sep 10, 2012 at 13:22

2 Answers 2

3

There must be some aliasing or something going on. Make things simpler for OpenMP:

int const size0 = size; #ifdef _OPENMP #pragma omp parallel for reduction(+:myr) #endif for(int i = 0; i < size0; i++){ int min = INFINITY; int * tmp = temp[i]; for(int j = 0; j < size0; j++){ if (tmp[j] < min) min = tmp[j]; } for(int j = 0; j < size0; j++) tmp[j]-=min; myr+=min; } 

That is, have most of the variables local and const if you may.

Sign up to request clarification or add additional context in comments.

Comments

0

The parallel part can be reinterpreted as follows (I have used the snippet by @jens-gustedt, but to my experience it didn't make much difference):

#pragma omp parallel private(myr_private) shared(myr) { myr_private = 0; #pragma omp for for(int i = 0; i < size; i++){ int min = INFINITY; int * tmp = temp[i]; for(int j = 0; j < size; j++){ if (tmp[j] < min) min = tmp[j]; } for(int j = 0; j < size; j++) tmp[j]-=min; myr_private+=min; } #pragma omp critical { myr+=myr_private; } } 

(This interpretation is straight from http://www.openmp.org/mp-documents/OpenMP3.1.pdf Example A.36.2c). If number of threads is n>1, there is overhead when #pragma omp parallel creates additional thread(s) and then in critical section, which all of the threads should wait for.

I have experimented with different matrix sizes and in my limited tests two threads are considerably faster with sizes above 1000, and start lagging behind with sizes below 500.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.