- Notifications
You must be signed in to change notification settings - Fork 15.3k
Description
Hi there, gpu API's for parallel programming often support three nested levels of parallelism.
In OpenAcc, this is reflected by the preprocessor constructs gang, parallel and vector.
OpenMP has the three parallelization levels: teams distribute, parallel for and simd.
In gcc-15.2, the target simd construct now corresponds to acc vector.
I.e. the construct
#pragma acc parallel loop gang vector
is equivalent to
#pragma omp target teams distribute parallel for simd
for gcc-15.2
The code (note the element access over a strides array, which is common for blas routines)
#include <omp.h> #include <stdio.h> int main(int argc, char** argv) { int x[600]; const int stride[2]={1,2}; #pragma omp target data map (tofrom: x[0:600]) map( to:stride[0:2])device(omp_get_default_device()) #pragma omp target simd device(omp_get_default_device()) for(size_t i=0;i<200;i++) x[i*stride[0]+i*stride[1]]=-1; printf("%d",x[597]); return 0; } when compiled with gcc-15.2 and the following options
-fopenmp -foffload=nvptx-none -fno-stack-protector -O3 -Wall
yields no warning and compiles fine and runs on device.
If compiled with clang 21.1.3 and the following options
-fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -O3 -Wall
one gets the following warning from clang
warning: loop not vectorized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering [-Wpass-failed=transform-warning]
The cuda code generated by clang is very efficient when compared to gcc. However, when you already implement openacc with its vector construct, and if openacc is implemented by the openmp runtime, then it would make sense to turn omp target simd into an acc vector equivalent, and allow the constructs omp target parallel for simd, and omp target teams distribute parallel for simd as well as collapse, reduction and tile constructs for this