4

I am trying to speed up my MPI project by utilizing openMP. I have a dataset of 1000 2d points, and am using a brute force algorithm to find the minimum distance in the 2d graph. When I try to split the thread of execution, however, it greatly hurts the performance.How can I properly utilize openMP?

Here is my attempt:

double calcDistance(double input[][2], int start, int stop){ double temp; //declare and initialize minimum double minimum = pow (((input[start+1][0]) - (input[start][0])),2) + pow(((input[start+1][1]) - (input[start][1])),2); minimum = sqrt(minimum); closestIndex1 = start; closestIndex2 = start+1; //Brute Force Algorithm to find minimum distance in dataset. #pragma omp parallel for for(int i=start; i<stop;i++){ for(int j=start; j<stop;j++){ temp = pow(((input[j][0]) - (input[i][0])),2) + pow(((input[j][1]) - (input[i][1])),2); temp = sqrt(temp); if(temp < minimum && i < j){ minimum = temp; closestIndex1 = i; closestIndex2 = j; }//endif }//end j }//end i return minimum; } 

I have to say WOW. Thank you, that was incredibly helpful, and really cleared up a bunch of questions that I had. Again, thank you, gha.st.

3
  • Does the program give the right answer? Don't you need something like #pragma omp critical before return minimum to force it to wait until all threads are complete? It also seems like you need some private variables - does the program give the same answer every time it is run? Commented May 3, 2014 at 3:22
  • Yes, the program runs fine, and the answers are correct and always the same. Its just the performance suffers. With just mpi the program exeutes in ~.06-.08s. Adding openmp, ~.5-.8s. Commented May 3, 2014 at 3:34
  • I added some serial optimizations, giving a 5x speedup in the test case Commented May 3, 2014 at 5:49

1 Answer 1

13

Analysis

First, it is pure luck that your program seems to work like this. You do indeed have a data race, that causes invalid results on my machine. Consider the following test harness for this post:

::std::cout << ::xtd::target_info() << "\n\n"; // [target os] [target architecture] with [compiler] static const int count = 30000; auto gen = ::std::bind(::std::normal_distribution<double>(0, 1000), ::std::mt19937_64(42)); std::unique_ptr<double[][2]> input(new double[count][2]); for(size_t i = 0; i < count; ++i) { input[i][0] = gen(); input[i][1] = gen(); } ::xtd::stopwatch sw; // does what its name suggests sw.start(); double minimum = calcDistance(input.get(), 0, count); sw.stop(); ::std::cout << minimum << "\n"; ::std::cout << sw << "\n"; 

When executing your function with the omp pragma removed, its output is:

Windows x64 with icc 14.0 0.0559233 7045 ms 

or

Windows x64 with msvc VS 2013 (18.00.21005) 0.0559233 7272 ms 

When executing it with the omp pragma intact, its output is:

Windows x64 with icc 14.0 0.324085 675.9 ms 

or

Windows x64 with msvc VS 2013 (18.00.21005) 0.0559233 4338 ms 

Since the machine uses 24 threads (on 12 cores with HT enabled), the speedup is evident, but could possibly be better, at least for msvc. The compiler that generates a faster program (icc) also shows the data race by giving wrong results which are different each run.

Note: I was also able to see an incorrect result from msvc, when compiling a debug version for x86 with 10k iterations.

Proper usage of iteration-local variables

The temp variable in your code has a lifetime of one iteration of the innermost loop. By moving its scope to match its lifetime, we can eliminate one source of data races. I also took the liberty of removing two unused variables and changing the initialization of minimum to a constant:

double calcDistance(double input[][2], int start, int stop){ double minimum = ::std::numeric_limits<double>::infinity(); //#pragma omp parallel for // still broken for(int i = start; i < stop; i++){ for(int j = start; j < stop; j++) { double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][2]) - (input[i][3])), 2); temp = sqrt(temp); if(temp < minimum && i < j) minimum = temp; } } return minimum; } 

A proper OMP minimum computation

OMP has support for reductions, which will in all probability perform fairly well. To try it, we will use the following pragma, which ensures that each thread works on its own minimum variable which are combined using the minimum operator:

#pragma omp parallel for reduction(min: minimum) 

The results validate the approach for ICC:

Windows x64 with icc 14.0 0.0559233 622.1 ms 

But MSVC howls error C3036: 'min' : invalid operator token in OpenMP 'reduction' clause, because it does not support minimum reductions. To define our own reduction, we will use a technique called double-checked locking:

double calcDistance(double input[][2], int start, int stop){ double minimum = ::std::numeric_limits<double>::infinity(); #pragma omp parallel for for(int i = start; i < stop; i++){ for(int j = start; j < stop; j++) { double temp = pow(((input[j][0]) - (input[i][0])), 2) + pow(((input[j][1]) - (input[i][1])), 2); temp = sqrt(temp); if(temp < minimum && i < j) { #pragma omp critical if(temp < minimum && i < j) minimum = temp; } } } return minimum; } 

This is not only correct, but also leads to comparable performance for MSVC (note that this is significantly faster than the incorrect version!):

Windows x64 with msvc VS 2013 (18.00.21005) 0.0559233 653.1 ms 

The performance of ICC does not suffer significantly:

Windows x64 with icc 14.0 0.0559233 636.8 ms 

Serial optimizations

While the above is a proper parallelization of your serial code, it can be significantly optimized by considering that you are computing a whole bunch of temp results you are never going to use due to your i < j condition.

By simply changing the start point of the inner loop, not only will this computation be completely avoided, it also simplifies the loop conditions.

Another trick we use is delaying the sqrt computation until the last possible second, since it is a homomorphic transformation, we can just sort on the square of the distance.

Finally, calling pow for a square is fairly inefficient, since it incurs a ton of overhead that we do not need.

This leads to the final code

double calcDistance(double input[][2], int start, int stop){ double minimum = ::std::numeric_limits<double>::infinity(); #pragma omp parallel for for(int i = start; i < stop; i++) { for(int j = i + 1; j < stop; j++) { double dx = input[j][0] - input[i][0]; dx *= dx; double dy = input[j][1] - input[i][1]; dy *= dy; double temp = dx + dy; if(temp < minimum) { #pragma omp critical if(temp < minimum) minimum = temp; } } } return sqrt(minimum); } 

Leading to the final performance:

Windows x64 with icc 14.0 0.0559233 132.7 ms 

or

Windows x64 with msvc VS 2013 (18.00.21005) 0.0559233 120.1 ms 
Sign up to request clarification or add additional context in comments.

7 Comments

+1 for teaching me about double-checked locking. That's interesting! However, I don't think it's the most efficient way to do this reduction. In the worst case each thread would still have to wait every iteration (making it effectively serial). A more efficient method is to declare private version of min for each thread and then merge them in a critical section.
@Zboson Correct, in fact, that is similar to what the reduction (min: minimum) solution does internally. In the average case, this simple variant works almost as well with a lot less effort however (as can be seen from the timings).
I think the double-checked locking only works if writes to minimum are atomic. In that case, you don't need the critical pragma.
@Patrick Yes, atomic writes are guaranteed and necessary for double checked locking. The critical pragma however is needed to guarantee that the second check is valid for the write. One could use omp atomic to ensure the atomicity of the write or a CAS instead of the critical section.
@gha.st Yes, you're right about the critical pragma. Thank you for the explanation.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.