0

I've encountered a bug when using Clang[1] with libomp[2] whereby using omp_priv = omp_orig in the initializer of a custom OpenMP reduction silently gives erroneous output. For example:

/* file.cpp */ #include <iostream> #include <complex> #include <vector> // alias for concision; does not affect bug typedef std::complex<double> comp; // define custom OpenMP reduction for 'comp' #pragma omp declare \ reduction(+ : comp : omp_out += omp_in ) \ initializer( omp_priv = omp_orig ) int main() { // comp-vector should sum to 1000+1000i std::vector<comp> vec(1000, comp(1,1)); comp total = 0; // reduce vec using custom reduction #pragma omp parallel for reduction(+:total) for (int i=0; i<1000; i++) total += vec[i]; // behold; erroneous (!=1000+1000i) total for #threads>1 std::cout << "total = " << total << std::endl; return 0; } 

compiled with

clang++ -lstdc++ -Xclang -fopenmp -lomp file.cpp 

will correctly output total = (1000,1000) when run serially, but output seemingly arbitrary erroneously values like (3725,3725) when run in parallel (e.g. via export OMP_NUM_THREADS=4). The same code runs fine on all other tested compilers (except MSVC where the custom reduction syntax is unrecognised, grr).

I can work around this by explicitly initialising the custom reduction to zero, i.e. setting omp_priv = 0, so that the reduction reads:

#pragma omp declare \ reduction(+ : comp : omp_out += omp_in ) \ initializer( omp_priv = 0 ) 

This works in all tested settings. It seems omp_orig is not zero when OpenMP attempts to initialize a thread-private complex<double>. Alas, I am a bit afraid of this solution since I am unsure what omp_orig is, and why it is behaving strangely with clang and libomp.

The OpenMP spec seems a bit terse on the subject:

"The special identifier omp_orig can also appear in the initializer-clause and it will refer to the storage of the original variable to be reduced."

What is the "storage of the original variable"? Certainly it does not seem to be the value of the reduced variable before the parallel region, since the workaround above works fine even when we choose a non-zero starting total, e.g.

qcomp total = 1234 

Will this workaround pose any unforeseen issues?

[1] clang v15.0.0 (specifically arm64-apple-darwin23.5.0)

[2] cannot find the version, but it was installed via brew install libomp in July 2024

6
  • 2
    Using 0 is not a workaround. It is the right choice for a sum reduction. omp_priv needs to be initiated with the neutral element of the operation. omp_orig is not the neutral element. You can use omp_orig, if you for example need to access the size of a container for initializing the private instances. The OpenMP examples document shows an example implementing maxloc. Commented Jan 29 at 3:53
  • 1
    The initialization and reduction step for different threads are not ordered. What you observe is a race condition between one thread already writing out it's results and another thread starting the reduction and loading this value as it's initial value. Commented Jan 29 at 4:05
  • Oh great! Alas I'm still unclear on omp_orig and was unable to find the maxloc example. Do you have a link handy? Commented Jan 30 at 8:43
  • Page 349 of this document. Commented Jan 30 at 9:33
  • Or page 455 in the latest version of the document: openmp.org/wp-content/uploads/openmp-examples-6.0.pdf Commented Jan 30 at 10:22

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.