Skip to content

Conversation

@ergawy
Copy link
Member

@ergawy ergawy commented Oct 30, 2025

Adds initial support for GPU by-ref reductions. The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves.

In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:codegen IR generation bugs: mangling, exceptions, etc. mlir:llvm mlir flang Flang issues not falling into any other category mlir:openmp flang:fir-hlfir flang:openmp clang:openmp OpenMP related changes to Clang labels Oct 30, 2025
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-clang-codegen
@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-flang-fir-hlfir

Author: Kareem Ergawy (ergawy)

Changes

Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.

Patch is 54.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165714.diff

34 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+2-2)
  • (modified) flang/include/flang/Optimizer/Dialect/FIROps.td (+2-1)
  • (modified) flang/lib/Lower/Support/ReductionProcessor.cpp (+6-1)
  • (modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+2-1)
  • (modified) flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction3.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/sections-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 (+2-2)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 (+1-1)
  • (modified) flang/test/Lower/do_concurrent_reduce_allocatable.f90 (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+17-7)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+119-35)
  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+3-1)
  • (modified) mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp (+2-1)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+20-4)
  • (added) mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir (+92)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-block-reduction.mlir (+2-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-reduction.mlir (+4-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction.mlir (+1-1)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-reduction.mlir (+1-1)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp index fddeba98adccc..9a8c75073aa4c 100644 --- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp +++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp @@ -1784,8 +1784,8 @@ void CGOpenMPRuntimeGPU::emitReduction( llvm::OpenMPIRBuilder::InsertPointTy AfterIP = cantFail(OMPBuilder.createReductionsGPU( - OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction, - llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, + OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, {}, false, + TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, CGF.getTarget().getGridValue(), C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc)); CGF.Builder.restoreIP(AfterIP); diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td index 58a317cf5d691..ff4dab1136ee9 100644 --- a/flang/include/flang/Optimizer/Dialect/FIROps.td +++ b/flang/include/flang/Optimizer/Dialect/FIROps.td @@ -3743,7 +3743,8 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove, }]; let arguments = (ins SymbolNameAttr:$sym_name, - TypeAttr:$type); + TypeAttr:$type, + OptionalAttr<TypeAttr>:$byref_element_type); let regions = (region MaxSizedRegion<1>:$allocRegion, AnyRegion:$initializerRegion, diff --git a/flang/lib/Lower/Support/ReductionProcessor.cpp b/flang/lib/Lower/Support/ReductionProcessor.cpp index 605a5b6b20b94..e02cd8fac823b 100644 --- a/flang/lib/Lower/Support/ReductionProcessor.cpp +++ b/flang/lib/Lower/Support/ReductionProcessor.cpp @@ -573,10 +573,15 @@ OpType ReductionProcessor::createDeclareReduction( mlir::OpBuilder modBuilder(module.getBodyRegion()); mlir::Type valTy = fir::unwrapRefType(type); + mlir::TypeAttr boxedTy{}; + if (!isByRef) type = valTy; - decl = OpType::create(modBuilder, loc, reductionOpName, type); + if (isByRef) + boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy)); + + decl = OpType::create(modBuilder, loc, reductionOpName, type, boxedTy); createReductionAllocAndInitRegions(converter, loc, decl, redId, type, isByRef); diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp index 1229018bd9b3e..11609ea7b6040 100644 --- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp +++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp @@ -851,7 +851,8 @@ class DoConcurrentConversion if (!ompReducer) { ompReducer = mlir::omp::DeclareReductionOp::create( rewriter, firReducer.getLoc(), ompReducerName, - firReducer.getTypeAttr().getValue()); + firReducer.getTypeAttr().getValue(), + firReducer.getByrefElementTypeAttr()); cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(), ompReducer.getAllocRegion()); diff --git a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 index 4b6a643f94059..4c7b6ac5f5f9b 100644 --- a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 +++ b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 @@ -22,7 +22,7 @@ subroutine red_and_delayed_private ! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32 ! CHECK-LABEL: omp.declare_reduction -! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc +! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc ! CHECK-LABEL: _QPred_and_delayed_private ! CHECK: omp.parallel diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 index 41c7d69ebb3ba..f56875dcb518b 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 index aa91e1e0e8b15..d9ba3bed464f8 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 @@ -12,7 +12,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>> ! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 index 59595de338d50..636660f279e85 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 @@ -17,7 +17,7 @@ program reduce print *,i end program -! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc { ! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CPU-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 index 14338c6f50817..9cf8a63427ed1 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 @@ -13,7 +13,7 @@ program reduce print *,i end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 index 36344458d1cae..3de2ba8f61f8e 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 @@ -19,7 +19,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction3.f90 b/flang/test/Lower/OpenMP/parallel-reduction3.f90 index 9af18378f0ae0..da337378862be 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction3.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction3.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 index 8b94d51f986f5..4a0593ff9eca4 100644 --- a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 +++ b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 @@ -9,7 +9,7 @@ subroutine max_array_reduction(l, r) !$omp end parallel end subroutine -! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/sections-array-reduction.f90 b/flang/test/Lower/OpenMP/sections-array-reduction.f90 index 2f2808cebfc0c..0dbe9e3673395 100644 --- a/flang/test/Lower/OpenMP/sections-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/sections-array-reduction.f90 @@ -14,7 +14,7 @@ subroutine sectionsReduction(x) end subroutine -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 index 18a4f75b86309..3a63bb09c59de 100644 --- a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 index 2cd953de0dffa..ed81577ecce16 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 @@ -32,7 +32,7 @@ program reduce15 print *,"min: ", mins end program -! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { @@ -93,7 +93,7 @@ program reduce15 ! CHECK: omp.yield ! CHECK: } -! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 index 663851cba46c6..d8c0a36db126e 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> attributes {byref_element_type = i32} alloc { ! CHECK: %[[VAL_2:.*]] = fir.alloca !fir.box<!fir.heap<i32>> ! CHECK: omp.yield(%[[VAL_2]] : !fir.ref<!fir.box<!fir.heap<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 index 209ee9a4e0cef..28acb8f19531f 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 @@ -22,7 +22,7 @@ subroutine reduce(r) end subroutine end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xf64>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 index 2233a74600948..ec448cf20f111 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 @@ -11,7 +11,7 @@ program reduce !$omp end parallel do end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<2xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 index 211bde19da8db..9da05a290ec21 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 @@ -19,7 +19,7 @@ subroutine sub(a, lb, ub) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<?xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 index afaeba27c5eae..14b657c8e180d 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> attributes {byref_element_type = !fir.array<2xi32>} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 index 25b2e97a1b7f7..d0a0c38e4ccb1 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 index edd2bcb1d6be8..60a162d8f8002 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 @@ -24,7 +24,7 @@ program main endprogram -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<3x3xf64>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<3x3xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 index 27b726376fbeb..f640f5caddf76 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 @@ -18,7 +18,7 @@ program reduce_pointer deallocate(v) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<i32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 index 873fd10dd1b97..4fb67c094b594 100644 --- a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 +++ b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 @@ -8,7 +8,7 @@ subroutine do_concurrent_allocatable end do end subroutine -! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] alloc { +! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] attributes {byref_element_type = !fir.array<?x?xf32>} alloc { ! CHECK: %[[ALLOC:.*]] = fir.alloca ! CHECK: fir.yield(%[[ALLOC]] : ![[RED_TYPE]]) ! CHECK: } init { diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h index 5331cb5abdc6f..f4192f9b49fd9 100644 --- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h +++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h @@ -1448,11 +1448,15 @@ class OpenMPIRBuilder { ReductionInfo(Type *ElementType, Value *Variable, Value *PrivateVariable, EvalKind EvaluationKind, ReductionGenCBTy ReductionGen, ReductionGenClangCBTy ReductionGenClang, - ReductionGenAtomicCBTy AtomicReductionGen) + ReductionGenAtomicCBTy AtomicReductionGen, + Type *ByRefAllocatedType = nullptr, + Type *ByRefElementType = nullptr) : ElementType(ElementType), Vari... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-clang

Author: Kareem Ergawy (ergawy)

Changes

Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.

Patch is 54.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165714.diff

34 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+2-2)
  • (modified) flang/include/flang/Optimizer/Dialect/FIROps.td (+2-1)
  • (modified) flang/lib/Lower/Support/ReductionProcessor.cpp (+6-1)
  • (modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+2-1)
  • (modified) flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction3.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/sections-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 (+2-2)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 (+1-1)
  • (modified) flang/test/Lower/do_concurrent_reduce_allocatable.f90 (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+17-7)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+119-35)
  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+3-1)
  • (modified) mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp (+2-1)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+20-4)
  • (added) mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir (+92)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-block-reduction.mlir (+2-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-reduction.mlir (+4-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction.mlir (+1-1)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-reduction.mlir (+1-1)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp index fddeba98adccc..9a8c75073aa4c 100644 --- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp +++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp @@ -1784,8 +1784,8 @@ void CGOpenMPRuntimeGPU::emitReduction( llvm::OpenMPIRBuilder::InsertPointTy AfterIP = cantFail(OMPBuilder.createReductionsGPU( - OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction, - llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, + OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, {}, false, + TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, CGF.getTarget().getGridValue(), C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc)); CGF.Builder.restoreIP(AfterIP); diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td index 58a317cf5d691..ff4dab1136ee9 100644 --- a/flang/include/flang/Optimizer/Dialect/FIROps.td +++ b/flang/include/flang/Optimizer/Dialect/FIROps.td @@ -3743,7 +3743,8 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove, }]; let arguments = (ins SymbolNameAttr:$sym_name, - TypeAttr:$type); + TypeAttr:$type, + OptionalAttr<TypeAttr>:$byref_element_type); let regions = (region MaxSizedRegion<1>:$allocRegion, AnyRegion:$initializerRegion, diff --git a/flang/lib/Lower/Support/ReductionProcessor.cpp b/flang/lib/Lower/Support/ReductionProcessor.cpp index 605a5b6b20b94..e02cd8fac823b 100644 --- a/flang/lib/Lower/Support/ReductionProcessor.cpp +++ b/flang/lib/Lower/Support/ReductionProcessor.cpp @@ -573,10 +573,15 @@ OpType ReductionProcessor::createDeclareReduction( mlir::OpBuilder modBuilder(module.getBodyRegion()); mlir::Type valTy = fir::unwrapRefType(type); + mlir::TypeAttr boxedTy{}; + if (!isByRef) type = valTy; - decl = OpType::create(modBuilder, loc, reductionOpName, type); + if (isByRef) + boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy)); + + decl = OpType::create(modBuilder, loc, reductionOpName, type, boxedTy); createReductionAllocAndInitRegions(converter, loc, decl, redId, type, isByRef); diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp index 1229018bd9b3e..11609ea7b6040 100644 --- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp +++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp @@ -851,7 +851,8 @@ class DoConcurrentConversion if (!ompReducer) { ompReducer = mlir::omp::DeclareReductionOp::create( rewriter, firReducer.getLoc(), ompReducerName, - firReducer.getTypeAttr().getValue()); + firReducer.getTypeAttr().getValue(), + firReducer.getByrefElementTypeAttr()); cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(), ompReducer.getAllocRegion()); diff --git a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 index 4b6a643f94059..4c7b6ac5f5f9b 100644 --- a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 +++ b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 @@ -22,7 +22,7 @@ subroutine red_and_delayed_private ! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32 ! CHECK-LABEL: omp.declare_reduction -! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc +! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc ! CHECK-LABEL: _QPred_and_delayed_private ! CHECK: omp.parallel diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 index 41c7d69ebb3ba..f56875dcb518b 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 index aa91e1e0e8b15..d9ba3bed464f8 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 @@ -12,7 +12,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>> ! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 index 59595de338d50..636660f279e85 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 @@ -17,7 +17,7 @@ program reduce print *,i end program -! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc { ! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CPU-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 index 14338c6f50817..9cf8a63427ed1 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 @@ -13,7 +13,7 @@ program reduce print *,i end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 index 36344458d1cae..3de2ba8f61f8e 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 @@ -19,7 +19,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction3.f90 b/flang/test/Lower/OpenMP/parallel-reduction3.f90 index 9af18378f0ae0..da337378862be 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction3.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction3.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 index 8b94d51f986f5..4a0593ff9eca4 100644 --- a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 +++ b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 @@ -9,7 +9,7 @@ subroutine max_array_reduction(l, r) !$omp end parallel end subroutine -! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/sections-array-reduction.f90 b/flang/test/Lower/OpenMP/sections-array-reduction.f90 index 2f2808cebfc0c..0dbe9e3673395 100644 --- a/flang/test/Lower/OpenMP/sections-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/sections-array-reduction.f90 @@ -14,7 +14,7 @@ subroutine sectionsReduction(x) end subroutine -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 index 18a4f75b86309..3a63bb09c59de 100644 --- a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 index 2cd953de0dffa..ed81577ecce16 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 @@ -32,7 +32,7 @@ program reduce15 print *,"min: ", mins end program -! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { @@ -93,7 +93,7 @@ program reduce15 ! CHECK: omp.yield ! CHECK: } -! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 index 663851cba46c6..d8c0a36db126e 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> attributes {byref_element_type = i32} alloc { ! CHECK: %[[VAL_2:.*]] = fir.alloca !fir.box<!fir.heap<i32>> ! CHECK: omp.yield(%[[VAL_2]] : !fir.ref<!fir.box<!fir.heap<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 index 209ee9a4e0cef..28acb8f19531f 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 @@ -22,7 +22,7 @@ subroutine reduce(r) end subroutine end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xf64>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 index 2233a74600948..ec448cf20f111 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 @@ -11,7 +11,7 @@ program reduce !$omp end parallel do end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<2xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 index 211bde19da8db..9da05a290ec21 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 @@ -19,7 +19,7 @@ subroutine sub(a, lb, ub) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<?xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 index afaeba27c5eae..14b657c8e180d 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> attributes {byref_element_type = !fir.array<2xi32>} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 index 25b2e97a1b7f7..d0a0c38e4ccb1 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 index edd2bcb1d6be8..60a162d8f8002 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 @@ -24,7 +24,7 @@ program main endprogram -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<3x3xf64>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<3x3xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 index 27b726376fbeb..f640f5caddf76 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 @@ -18,7 +18,7 @@ program reduce_pointer deallocate(v) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<i32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 index 873fd10dd1b97..4fb67c094b594 100644 --- a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 +++ b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 @@ -8,7 +8,7 @@ subroutine do_concurrent_allocatable end do end subroutine -! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] alloc { +! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] attributes {byref_element_type = !fir.array<?x?xf32>} alloc { ! CHECK: %[[ALLOC:.*]] = fir.alloca ! CHECK: fir.yield(%[[ALLOC]] : ![[RED_TYPE]]) ! CHECK: } init { diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h index 5331cb5abdc6f..f4192f9b49fd9 100644 --- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h +++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h @@ -1448,11 +1448,15 @@ class OpenMPIRBuilder { ReductionInfo(Type *ElementType, Value *Variable, Value *PrivateVariable, EvalKind EvaluationKind, ReductionGenCBTy ReductionGen, ReductionGenClangCBTy ReductionGenClang, - ReductionGenAtomicCBTy AtomicReductionGen) + ReductionGenAtomicCBTy AtomicReductionGen, + Type *ByRefAllocatedType = nullptr, + Type *ByRefElementType = nullptr) : ElementType(ElementType), Vari... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-flang-openmp

Author: Kareem Ergawy (ergawy)

Changes

Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.

Patch is 54.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165714.diff

34 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+2-2)
  • (modified) flang/include/flang/Optimizer/Dialect/FIROps.td (+2-1)
  • (modified) flang/lib/Lower/Support/ReductionProcessor.cpp (+6-1)
  • (modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+2-1)
  • (modified) flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction3.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/sections-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 (+2-2)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 (+1-1)
  • (modified) flang/test/Lower/do_concurrent_reduce_allocatable.f90 (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+17-7)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+119-35)
  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+3-1)
  • (modified) mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp (+2-1)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+20-4)
  • (added) mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir (+92)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-block-reduction.mlir (+2-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-reduction.mlir (+4-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction.mlir (+1-1)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-reduction.mlir (+1-1)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp index fddeba98adccc..9a8c75073aa4c 100644 --- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp +++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp @@ -1784,8 +1784,8 @@ void CGOpenMPRuntimeGPU::emitReduction( llvm::OpenMPIRBuilder::InsertPointTy AfterIP = cantFail(OMPBuilder.createReductionsGPU( - OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction, - llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, + OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, {}, false, + TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, CGF.getTarget().getGridValue(), C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc)); CGF.Builder.restoreIP(AfterIP); diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td index 58a317cf5d691..ff4dab1136ee9 100644 --- a/flang/include/flang/Optimizer/Dialect/FIROps.td +++ b/flang/include/flang/Optimizer/Dialect/FIROps.td @@ -3743,7 +3743,8 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove, }]; let arguments = (ins SymbolNameAttr:$sym_name, - TypeAttr:$type); + TypeAttr:$type, + OptionalAttr<TypeAttr>:$byref_element_type); let regions = (region MaxSizedRegion<1>:$allocRegion, AnyRegion:$initializerRegion, diff --git a/flang/lib/Lower/Support/ReductionProcessor.cpp b/flang/lib/Lower/Support/ReductionProcessor.cpp index 605a5b6b20b94..e02cd8fac823b 100644 --- a/flang/lib/Lower/Support/ReductionProcessor.cpp +++ b/flang/lib/Lower/Support/ReductionProcessor.cpp @@ -573,10 +573,15 @@ OpType ReductionProcessor::createDeclareReduction( mlir::OpBuilder modBuilder(module.getBodyRegion()); mlir::Type valTy = fir::unwrapRefType(type); + mlir::TypeAttr boxedTy{}; + if (!isByRef) type = valTy; - decl = OpType::create(modBuilder, loc, reductionOpName, type); + if (isByRef) + boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy)); + + decl = OpType::create(modBuilder, loc, reductionOpName, type, boxedTy); createReductionAllocAndInitRegions(converter, loc, decl, redId, type, isByRef); diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp index 1229018bd9b3e..11609ea7b6040 100644 --- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp +++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp @@ -851,7 +851,8 @@ class DoConcurrentConversion if (!ompReducer) { ompReducer = mlir::omp::DeclareReductionOp::create( rewriter, firReducer.getLoc(), ompReducerName, - firReducer.getTypeAttr().getValue()); + firReducer.getTypeAttr().getValue(), + firReducer.getByrefElementTypeAttr()); cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(), ompReducer.getAllocRegion()); diff --git a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 index 4b6a643f94059..4c7b6ac5f5f9b 100644 --- a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 +++ b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 @@ -22,7 +22,7 @@ subroutine red_and_delayed_private ! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32 ! CHECK-LABEL: omp.declare_reduction -! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc +! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc ! CHECK-LABEL: _QPred_and_delayed_private ! CHECK: omp.parallel diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 index 41c7d69ebb3ba..f56875dcb518b 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 index aa91e1e0e8b15..d9ba3bed464f8 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 @@ -12,7 +12,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>> ! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 index 59595de338d50..636660f279e85 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 @@ -17,7 +17,7 @@ program reduce print *,i end program -! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc { ! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CPU-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 index 14338c6f50817..9cf8a63427ed1 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 @@ -13,7 +13,7 @@ program reduce print *,i end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 index 36344458d1cae..3de2ba8f61f8e 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 @@ -19,7 +19,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction3.f90 b/flang/test/Lower/OpenMP/parallel-reduction3.f90 index 9af18378f0ae0..da337378862be 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction3.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction3.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 index 8b94d51f986f5..4a0593ff9eca4 100644 --- a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 +++ b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 @@ -9,7 +9,7 @@ subroutine max_array_reduction(l, r) !$omp end parallel end subroutine -! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/sections-array-reduction.f90 b/flang/test/Lower/OpenMP/sections-array-reduction.f90 index 2f2808cebfc0c..0dbe9e3673395 100644 --- a/flang/test/Lower/OpenMP/sections-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/sections-array-reduction.f90 @@ -14,7 +14,7 @@ subroutine sectionsReduction(x) end subroutine -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 index 18a4f75b86309..3a63bb09c59de 100644 --- a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 index 2cd953de0dffa..ed81577ecce16 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 @@ -32,7 +32,7 @@ program reduce15 print *,"min: ", mins end program -! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { @@ -93,7 +93,7 @@ program reduce15 ! CHECK: omp.yield ! CHECK: } -! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 index 663851cba46c6..d8c0a36db126e 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> attributes {byref_element_type = i32} alloc { ! CHECK: %[[VAL_2:.*]] = fir.alloca !fir.box<!fir.heap<i32>> ! CHECK: omp.yield(%[[VAL_2]] : !fir.ref<!fir.box<!fir.heap<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 index 209ee9a4e0cef..28acb8f19531f 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 @@ -22,7 +22,7 @@ subroutine reduce(r) end subroutine end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xf64>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 index 2233a74600948..ec448cf20f111 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 @@ -11,7 +11,7 @@ program reduce !$omp end parallel do end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<2xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 index 211bde19da8db..9da05a290ec21 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 @@ -19,7 +19,7 @@ subroutine sub(a, lb, ub) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<?xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 index afaeba27c5eae..14b657c8e180d 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> attributes {byref_element_type = !fir.array<2xi32>} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 index 25b2e97a1b7f7..d0a0c38e4ccb1 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 index edd2bcb1d6be8..60a162d8f8002 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 @@ -24,7 +24,7 @@ program main endprogram -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<3x3xf64>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<3x3xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 index 27b726376fbeb..f640f5caddf76 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 @@ -18,7 +18,7 @@ program reduce_pointer deallocate(v) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<i32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 index 873fd10dd1b97..4fb67c094b594 100644 --- a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 +++ b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 @@ -8,7 +8,7 @@ subroutine do_concurrent_allocatable end do end subroutine -! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] alloc { +! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] attributes {byref_element_type = !fir.array<?x?xf32>} alloc { ! CHECK: %[[ALLOC:.*]] = fir.alloca ! CHECK: fir.yield(%[[ALLOC]] : ![[RED_TYPE]]) ! CHECK: } init { diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h index 5331cb5abdc6f..f4192f9b49fd9 100644 --- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h +++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h @@ -1448,11 +1448,15 @@ class OpenMPIRBuilder { ReductionInfo(Type *ElementType, Value *Variable, Value *PrivateVariable, EvalKind EvaluationKind, ReductionGenCBTy ReductionGen, ReductionGenClangCBTy ReductionGenClang, - ReductionGenAtomicCBTy AtomicReductionGen) + ReductionGenAtomicCBTy AtomicReductionGen, + Type *ByRefAllocatedType = nullptr, + Type *ByRefElementType = nullptr) : ElementType(ElementType), Vari... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-mlir-llvm

Author: Kareem Ergawy (ergawy)

Changes

Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.

Patch is 54.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165714.diff

34 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+2-2)
  • (modified) flang/include/flang/Optimizer/Dialect/FIROps.td (+2-1)
  • (modified) flang/lib/Lower/Support/ReductionProcessor.cpp (+6-1)
  • (modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+2-1)
  • (modified) flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction3.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/sections-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 (+2-2)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 (+1-1)
  • (modified) flang/test/Lower/do_concurrent_reduce_allocatable.f90 (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+17-7)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+119-35)
  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+3-1)
  • (modified) mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp (+2-1)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+20-4)
  • (added) mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir (+92)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-block-reduction.mlir (+2-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-reduction.mlir (+4-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction.mlir (+1-1)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-reduction.mlir (+1-1)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp index fddeba98adccc..9a8c75073aa4c 100644 --- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp +++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp @@ -1784,8 +1784,8 @@ void CGOpenMPRuntimeGPU::emitReduction( llvm::OpenMPIRBuilder::InsertPointTy AfterIP = cantFail(OMPBuilder.createReductionsGPU( - OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction, - llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, + OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, {}, false, + TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, CGF.getTarget().getGridValue(), C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc)); CGF.Builder.restoreIP(AfterIP); diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td index 58a317cf5d691..ff4dab1136ee9 100644 --- a/flang/include/flang/Optimizer/Dialect/FIROps.td +++ b/flang/include/flang/Optimizer/Dialect/FIROps.td @@ -3743,7 +3743,8 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove, }]; let arguments = (ins SymbolNameAttr:$sym_name, - TypeAttr:$type); + TypeAttr:$type, + OptionalAttr<TypeAttr>:$byref_element_type); let regions = (region MaxSizedRegion<1>:$allocRegion, AnyRegion:$initializerRegion, diff --git a/flang/lib/Lower/Support/ReductionProcessor.cpp b/flang/lib/Lower/Support/ReductionProcessor.cpp index 605a5b6b20b94..e02cd8fac823b 100644 --- a/flang/lib/Lower/Support/ReductionProcessor.cpp +++ b/flang/lib/Lower/Support/ReductionProcessor.cpp @@ -573,10 +573,15 @@ OpType ReductionProcessor::createDeclareReduction( mlir::OpBuilder modBuilder(module.getBodyRegion()); mlir::Type valTy = fir::unwrapRefType(type); + mlir::TypeAttr boxedTy{}; + if (!isByRef) type = valTy; - decl = OpType::create(modBuilder, loc, reductionOpName, type); + if (isByRef) + boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy)); + + decl = OpType::create(modBuilder, loc, reductionOpName, type, boxedTy); createReductionAllocAndInitRegions(converter, loc, decl, redId, type, isByRef); diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp index 1229018bd9b3e..11609ea7b6040 100644 --- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp +++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp @@ -851,7 +851,8 @@ class DoConcurrentConversion if (!ompReducer) { ompReducer = mlir::omp::DeclareReductionOp::create( rewriter, firReducer.getLoc(), ompReducerName, - firReducer.getTypeAttr().getValue()); + firReducer.getTypeAttr().getValue(), + firReducer.getByrefElementTypeAttr()); cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(), ompReducer.getAllocRegion()); diff --git a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 index 4b6a643f94059..4c7b6ac5f5f9b 100644 --- a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 +++ b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 @@ -22,7 +22,7 @@ subroutine red_and_delayed_private ! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32 ! CHECK-LABEL: omp.declare_reduction -! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc +! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc ! CHECK-LABEL: _QPred_and_delayed_private ! CHECK: omp.parallel diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 index 41c7d69ebb3ba..f56875dcb518b 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 index aa91e1e0e8b15..d9ba3bed464f8 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 @@ -12,7 +12,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>> ! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 index 59595de338d50..636660f279e85 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 @@ -17,7 +17,7 @@ program reduce print *,i end program -! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc { ! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CPU-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 index 14338c6f50817..9cf8a63427ed1 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 @@ -13,7 +13,7 @@ program reduce print *,i end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 index 36344458d1cae..3de2ba8f61f8e 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 @@ -19,7 +19,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction3.f90 b/flang/test/Lower/OpenMP/parallel-reduction3.f90 index 9af18378f0ae0..da337378862be 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction3.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction3.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 index 8b94d51f986f5..4a0593ff9eca4 100644 --- a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 +++ b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 @@ -9,7 +9,7 @@ subroutine max_array_reduction(l, r) !$omp end parallel end subroutine -! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/sections-array-reduction.f90 b/flang/test/Lower/OpenMP/sections-array-reduction.f90 index 2f2808cebfc0c..0dbe9e3673395 100644 --- a/flang/test/Lower/OpenMP/sections-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/sections-array-reduction.f90 @@ -14,7 +14,7 @@ subroutine sectionsReduction(x) end subroutine -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 index 18a4f75b86309..3a63bb09c59de 100644 --- a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 index 2cd953de0dffa..ed81577ecce16 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 @@ -32,7 +32,7 @@ program reduce15 print *,"min: ", mins end program -! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { @@ -93,7 +93,7 @@ program reduce15 ! CHECK: omp.yield ! CHECK: } -! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 index 663851cba46c6..d8c0a36db126e 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> attributes {byref_element_type = i32} alloc { ! CHECK: %[[VAL_2:.*]] = fir.alloca !fir.box<!fir.heap<i32>> ! CHECK: omp.yield(%[[VAL_2]] : !fir.ref<!fir.box<!fir.heap<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 index 209ee9a4e0cef..28acb8f19531f 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 @@ -22,7 +22,7 @@ subroutine reduce(r) end subroutine end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xf64>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 index 2233a74600948..ec448cf20f111 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 @@ -11,7 +11,7 @@ program reduce !$omp end parallel do end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<2xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 index 211bde19da8db..9da05a290ec21 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 @@ -19,7 +19,7 @@ subroutine sub(a, lb, ub) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<?xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 index afaeba27c5eae..14b657c8e180d 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> attributes {byref_element_type = !fir.array<2xi32>} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 index 25b2e97a1b7f7..d0a0c38e4ccb1 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 index edd2bcb1d6be8..60a162d8f8002 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 @@ -24,7 +24,7 @@ program main endprogram -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<3x3xf64>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<3x3xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 index 27b726376fbeb..f640f5caddf76 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 @@ -18,7 +18,7 @@ program reduce_pointer deallocate(v) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<i32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 index 873fd10dd1b97..4fb67c094b594 100644 --- a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 +++ b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 @@ -8,7 +8,7 @@ subroutine do_concurrent_allocatable end do end subroutine -! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] alloc { +! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] attributes {byref_element_type = !fir.array<?x?xf32>} alloc { ! CHECK: %[[ALLOC:.*]] = fir.alloca ! CHECK: fir.yield(%[[ALLOC]] : ![[RED_TYPE]]) ! CHECK: } init { diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h index 5331cb5abdc6f..f4192f9b49fd9 100644 --- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h +++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h @@ -1448,11 +1448,15 @@ class OpenMPIRBuilder { ReductionInfo(Type *ElementType, Value *Variable, Value *PrivateVariable, EvalKind EvaluationKind, ReductionGenCBTy ReductionGen, ReductionGenClangCBTy ReductionGenClang, - ReductionGenAtomicCBTy AtomicReductionGen) + ReductionGenAtomicCBTy AtomicReductionGen, + Type *ByRefAllocatedType = nullptr, + Type *ByRefElementType = nullptr) : ElementType(ElementType), Vari... [truncated] 
@llvmbot
Copy link
Member

llvmbot commented Oct 30, 2025

@llvm/pr-subscribers-mlir-openmp

Author: Kareem Ergawy (ergawy)

Changes

Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in target regions. For example:

 integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target

This PR supports by-ref reductions on the intra- and inter-warp levels.

So far, there are still steps to be takens for full support of by-ref reductions, for example:

  • Support inter-block value combination is still not supported. Therefore, target teams distribute parallel do is still not supported.
  • Support for dynamically-sized arrays still needs to be added.
  • Support for more than one allocatable/array on the same reduction clause.

Patch is 54.34 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/165714.diff

34 Files Affected:

  • (modified) clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp (+2-2)
  • (modified) flang/include/flang/Optimizer/Dialect/FIROps.td (+2-1)
  • (modified) flang/lib/Lower/Support/ReductionProcessor.cpp (+6-1)
  • (modified) flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp (+2-1)
  • (modified) flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/parallel-reduction3.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/sections-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 (+2-2)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 (+1-1)
  • (modified) flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 (+1-1)
  • (modified) flang/test/Lower/do_concurrent_reduce_allocatable.f90 (+1-1)
  • (modified) llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h (+17-7)
  • (modified) llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp (+119-35)
  • (modified) mlir/include/mlir/Dialect/OpenMP/OpenMPOps.td (+3-1)
  • (modified) mlir/lib/Conversion/SCFToOpenMP/SCFToOpenMP.cpp (+2-1)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+20-4)
  • (added) mlir/test/Target/LLVMIR/allocatable_gpu_reduction.mlir (+92)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-block-reduction.mlir (+2-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-multi-reduction.mlir (+4-4)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-distribute-reduction.mlir (+1-1)
  • (modified) mlir/test/Target/LLVMIR/omptarget-teams-reduction.mlir (+1-1)
diff --git a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp index fddeba98adccc..9a8c75073aa4c 100644 --- a/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp +++ b/clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp @@ -1784,8 +1784,8 @@ void CGOpenMPRuntimeGPU::emitReduction( llvm::OpenMPIRBuilder::InsertPointTy AfterIP = cantFail(OMPBuilder.createReductionsGPU( - OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, false, TeamsReduction, - llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, + OmpLoc, AllocaIP, CodeGenIP, ReductionInfos, {}, false, + TeamsReduction, llvm::OpenMPIRBuilder::ReductionGenCBKind::Clang, CGF.getTarget().getGridValue(), C.getLangOpts().OpenMPCUDAReductionBufNum, RTLoc)); CGF.Builder.restoreIP(AfterIP); diff --git a/flang/include/flang/Optimizer/Dialect/FIROps.td b/flang/include/flang/Optimizer/Dialect/FIROps.td index 58a317cf5d691..ff4dab1136ee9 100644 --- a/flang/include/flang/Optimizer/Dialect/FIROps.td +++ b/flang/include/flang/Optimizer/Dialect/FIROps.td @@ -3743,7 +3743,8 @@ def fir_DeclareReductionOp : fir_Op<"declare_reduction", [IsolatedFromAbove, }]; let arguments = (ins SymbolNameAttr:$sym_name, - TypeAttr:$type); + TypeAttr:$type, + OptionalAttr<TypeAttr>:$byref_element_type); let regions = (region MaxSizedRegion<1>:$allocRegion, AnyRegion:$initializerRegion, diff --git a/flang/lib/Lower/Support/ReductionProcessor.cpp b/flang/lib/Lower/Support/ReductionProcessor.cpp index 605a5b6b20b94..e02cd8fac823b 100644 --- a/flang/lib/Lower/Support/ReductionProcessor.cpp +++ b/flang/lib/Lower/Support/ReductionProcessor.cpp @@ -573,10 +573,15 @@ OpType ReductionProcessor::createDeclareReduction( mlir::OpBuilder modBuilder(module.getBodyRegion()); mlir::Type valTy = fir::unwrapRefType(type); + mlir::TypeAttr boxedTy{}; + if (!isByRef) type = valTy; - decl = OpType::create(modBuilder, loc, reductionOpName, type); + if (isByRef) + boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy)); + + decl = OpType::create(modBuilder, loc, reductionOpName, type, boxedTy); createReductionAllocAndInitRegions(converter, loc, decl, redId, type, isByRef); diff --git a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp index 1229018bd9b3e..11609ea7b6040 100644 --- a/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp +++ b/flang/lib/Optimizer/OpenMP/DoConcurrentConversion.cpp @@ -851,7 +851,8 @@ class DoConcurrentConversion if (!ompReducer) { ompReducer = mlir::omp::DeclareReductionOp::create( rewriter, firReducer.getLoc(), ompReducerName, - firReducer.getTypeAttr().getValue()); + firReducer.getTypeAttr().getValue(), + firReducer.getByrefElementTypeAttr()); cloneFIRRegionToOMP(rewriter, firReducer.getAllocRegion(), ompReducer.getAllocRegion()); diff --git a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 index 4b6a643f94059..4c7b6ac5f5f9b 100644 --- a/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 +++ b/flang/test/Lower/OpenMP/delayed-privatization-reduction-byref.f90 @@ -22,7 +22,7 @@ subroutine red_and_delayed_private ! CHECK-SAME: @[[PRIVATIZER_SYM:.*]] : i32 ! CHECK-LABEL: omp.declare_reduction -! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> alloc +! CHECK-SAME: @[[REDUCTION_SYM:.*]] : !fir.ref<i32> attributes {byref_element_type = i32} alloc ! CHECK-LABEL: _QPred_and_delayed_private ! CHECK: omp.parallel diff --git a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 index 41c7d69ebb3ba..f56875dcb518b 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-allocatable-array.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_10:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_10]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 index aa91e1e0e8b15..d9ba3bed464f8 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array-lb.f90 @@ -12,7 +12,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x2xi32 : !fir.ref<!fir.box<!fir.array<3x2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_15:.*]] = fir.alloca !fir.box<!fir.array<3x2xi32>> ! CHECK: omp.yield(%[[VAL_15]] : !fir.ref<!fir.box<!fir.array<3x2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 index 59595de338d50..636660f279e85 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array.f90 @@ -17,7 +17,7 @@ program reduce print *,i end program -! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CPU-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> attributes {byref_element_type = !fir.array<3xi32>} alloc { ! CPU: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CPU: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CPU-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 index 14338c6f50817..9cf8a63427ed1 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-array2.f90 @@ -13,7 +13,7 @@ program reduce print *,i end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3xi32 : !fir.ref<!fir.box<!fir.array<3xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<3xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<3xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 index 36344458d1cae..3de2ba8f61f8e 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction-pointer-array.f90 @@ -19,7 +19,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_Uxi32 : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>> attributes {byref_element_type = !fir.array<?xi32>} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/parallel-reduction3.f90 b/flang/test/Lower/OpenMP/parallel-reduction3.f90 index 9af18378f0ae0..da337378862be 100644 --- a/flang/test/Lower/OpenMP/parallel-reduction3.f90 +++ b/flang/test/Lower/OpenMP/parallel-reduction3.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 index 8b94d51f986f5..4a0593ff9eca4 100644 --- a/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 +++ b/flang/test/Lower/OpenMP/reduction-array-intrinsic.f90 @@ -9,7 +9,7 @@ subroutine max_array_reduction(l, r) !$omp end parallel end subroutine -! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<?xi32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<?xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/sections-array-reduction.f90 b/flang/test/Lower/OpenMP/sections-array-reduction.f90 index 2f2808cebfc0c..0dbe9e3673395 100644 --- a/flang/test/Lower/OpenMP/sections-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/sections-array-reduction.f90 @@ -14,7 +14,7 @@ subroutine sectionsReduction(x) end subroutine -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 index 18a4f75b86309..3a63bb09c59de 100644 --- a/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 +++ b/flang/test/Lower/OpenMP/taskgroup-task-array-reduction.f90 @@ -1,7 +1,7 @@ ! RUN: bbc -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s ! RUN: %flang_fc1 -emit-hlfir -fopenmp -fopenmp-version=50 -o - %s 2>&1 | FileCheck %s -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf32 : !fir.ref<!fir.box<!fir.array<?xf32>>> {{.*}} alloc { ! [...] ! CHECK: omp.yield ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 index 2cd953de0dffa..ed81577ecce16 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable-array-minmax.f90 @@ -32,7 +32,7 @@ program reduce15 print *,"min: ", mins end program -! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @min_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { @@ -93,7 +93,7 @@ program reduce15 ! CHECK: omp.yield ! CHECK: } -! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> alloc { +! CHECK-LABEL: omp.declare_reduction @max_byref_box_heap_Uxi32 : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.heap<!fir.array<?xi32>>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.heap<!fir.array<?xi32>>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 index 663851cba46c6..d8c0a36db126e 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-allocatable.f90 @@ -18,7 +18,7 @@ program reduce end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_heap_i32 : !fir.ref<!fir.box<!fir.heap<i32>>> attributes {byref_element_type = i32} alloc { ! CHECK: %[[VAL_2:.*]] = fir.alloca !fir.box<!fir.heap<i32>> ! CHECK: omp.yield(%[[VAL_2]] : !fir.ref<!fir.box<!fir.heap<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 index 209ee9a4e0cef..28acb8f19531f 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-assumed-shape.f90 @@ -22,7 +22,7 @@ subroutine reduce(r) end subroutine end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxf64 : !fir.ref<!fir.box<!fir.array<?xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<?xf64>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<?xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 index 2233a74600948..ec448cf20f111 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb.f90 @@ -11,7 +11,7 @@ program reduce !$omp end parallel do end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<2xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<2xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 index 211bde19da8db..9da05a290ec21 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array-lb2.f90 @@ -19,7 +19,7 @@ subroutine sub(a, lb, ub) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_Uxi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> {{.*}} alloc { ! CHECK: } combiner { ! CHECK: ^bb0(%[[ARG0:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>, %[[ARG1:.*]]: !fir.ref<!fir.box<!fir.array<?xi32>>>): ! CHECK: %[[ARR0:.*]] = fir.load %[[ARG0]] : !fir.ref<!fir.box<!fir.array<?xi32>>> diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 index afaeba27c5eae..14b657c8e180d 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> attributes {byref_element_type = !fir.array<2xi32>} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 index 25b2e97a1b7f7..d0a0c38e4ccb1 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-array2.f90 @@ -14,7 +14,7 @@ program reduce print *,r end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_2xi32 : !fir.ref<!fir.box<!fir.array<2xi32>>> {{.*}} alloc { ! CHECK: %[[VAL_8:.*]] = fir.alloca !fir.box<!fir.array<2xi32>> ! CHECK: omp.yield(%[[VAL_8]] : !fir.ref<!fir.box<!fir.array<2xi32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 index edd2bcb1d6be8..60a162d8f8002 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-multiple-clauses.f90 @@ -24,7 +24,7 @@ program main endprogram -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_3x3xf64 : !fir.ref<!fir.box<!fir.array<3x3xf64>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.array<3x3xf64>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.array<3x3xf64>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 index 27b726376fbeb..f640f5caddf76 100644 --- a/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 +++ b/flang/test/Lower/OpenMP/wsloop-reduction-pointer.f90 @@ -18,7 +18,7 @@ program reduce_pointer deallocate(v) end program -! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> alloc { +! CHECK-LABEL: omp.declare_reduction @add_reduction_byref_box_ptr_i32 : !fir.ref<!fir.box<!fir.ptr<i32>>> {{.*}} alloc { ! CHECK: %[[VAL_3:.*]] = fir.alloca !fir.box<!fir.ptr<i32>> ! CHECK: omp.yield(%[[VAL_3]] : !fir.ref<!fir.box<!fir.ptr<i32>>>) ! CHECK-LABEL: } init { diff --git a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 index 873fd10dd1b97..4fb67c094b594 100644 --- a/flang/test/Lower/do_concurrent_reduce_allocatable.f90 +++ b/flang/test/Lower/do_concurrent_reduce_allocatable.f90 @@ -8,7 +8,7 @@ subroutine do_concurrent_allocatable end do end subroutine -! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] alloc { +! CHECK: fir.declare_reduction @[[RED_OP:.*]] : ![[RED_TYPE:.*]] attributes {byref_element_type = !fir.array<?x?xf32>} alloc { ! CHECK: %[[ALLOC:.*]] = fir.alloca ! CHECK: fir.yield(%[[ALLOC]] : ![[RED_TYPE]]) ! CHECK: } init { diff --git a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h index 5331cb5abdc6f..f4192f9b49fd9 100644 --- a/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h +++ b/llvm/include/llvm/Frontend/OpenMP/OMPIRBuilder.h @@ -1448,11 +1448,15 @@ class OpenMPIRBuilder { ReductionInfo(Type *ElementType, Value *Variable, Value *PrivateVariable, EvalKind EvaluationKind, ReductionGenCBTy ReductionGen, ReductionGenClangCBTy ReductionGenClang, - ReductionGenAtomicCBTy AtomicReductionGen) + ReductionGenAtomicCBTy AtomicReductionGen, + Type *ByRefAllocatedType = nullptr, + Type *ByRefElementType = nullptr) : ElementType(ElementType), Vari... [truncated] 
@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from e23a9e7 to 643dafc Compare October 30, 2025 14:51
@github-actions
Copy link

github-actions bot commented Oct 30, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch 2 times, most recently from 23caa56 to 1fdd4e7 Compare October 31, 2025 10:12
// pointer to the descriptor of the by-ref reduction element.
ShuffleType = RI.ByRefElementType;

ShuffleSrcAddr = Builder.CreateGEP(RI.ByRefAllocatedType,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat leaking the Fortran descriptor struct to the OMP IR builder. I think we can abstract this away somehow but I do not know if it will worth it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. It defeats the purpose of OMPIRBuilder. It would be better to add optional regions to the mlir reduction op which extract the information we want and then inline them here. I suspect we will need to read much more out of the box to support variable length arrays.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add optional regions to the mlir reduction op which extract the information we want and then inline them here.

The issue I see with this is that we would extend the op to model what we need for the GPU on the MLIR level.

It defeats the purpose of OMPIRBuilder.

If we have the regions, the OMPIRBuilder will still have to be aware of them, I think. At least in the form of callbacks passed from MLIR to LLVM translation so that we inline these regions at the proper locations.

Copy link
Contributor

@tblah tblah Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would have to understand these new regions. I was just imagining the region would allow us to keep the codegen for accessing elements of a box inside of flang. I'm worried that one day some other MLIR user may want to lower through this (e.g. maybe the efforts to represent the clang AST in MLIR) and then hit strange bugs from these bits of code that make assumptions about the use of fortran descriptors.

So something like this might work

omp.declare_reduction @add_reduction_byref_box_?xi32 : !fir.ref<!fir.box<!fir.array<?xi32>>> alloc { %0 = fir.alloca !fir.box<!fir.array<?xi32>> omp.yield(%0 : !fir.ref<!fir.box<!fir.array<?xi32>>>) } init { ^bb0(%arg0: !fir.ref<!fir.box<!fir.array<?xi32>>>, %arg1: !fir.ref<!fir.box<!fir.array<?xi32>>>): ... omp.yield(%arg1 : !fir.ref<!fir.box<!fir.array<?xi32>>>) } combiner { ^bb0(%arg0: !fir.ref<!fir.box<!fir.array<?xi32>>>, %arg1: !fir.ref<!fir.box<!fir.array<?xi32>>>): ... omp.yield(%arg0 : !fir.ref<!fir.box<!fir.array<?xi32>>>) } cleanup { ^bb0(%arg0: !fir.ref<!fir.box<!fir.array<?xi32>>>): ... omp.yield } data_size { ^bb0(%arg0: !fir.ref<!fir.box<!fir.array<?xi32>>>): %c1 = arith.constant 0 : i32 %box = fir.load %arg0 %ele_size = fir.box_elesize %box %dims:3 = fir.box_dims %box, %c1 %num_elements = ... %size = arith.muli %ele_size %num_elements omp.yield(%size) } data_addr { ^bb0(%arg0: !fir.ref<!fir.box<!fir.array<?xi32>>>): %box = fir.load %arg0 %addr = fir.box_addr %box omp.yield(%addr) } 

Then other lowering routes (e.g. via mlir's memref) would be free to populate these regions differently. I guess there should be some sensible default for types which aren't passed via a descriptor.

For the Fortran case, it is important we use fir.box_elesize because the size might not be known at compile time (I think one example of that is polymorphic types). If it isn't done already, this could easily be canonicalised within flang to a constant for types which do have a size known at compile time. Similarly, a constant size should be generated where possible so that MLIR can fold the whole data_size region into a constant where possible.

I'm mentioning all of this now because there is an issue asking about declare reduction inside of target regions so you might want to consider a design which is future proof. However, if you would rather not do that then I'm happy so long as we have the right error checks to ensure the compiler doesn't crash or produce nonsense code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply, I had to take some off. I think your suggestion would be cleaner that my current approach. The only downside I see is that, now we are leaking both Fortran and GPU-implemenation details to the op's definition. But it will be clearer what the op is encapsulating. I will prototype your suggestion and see how it looks like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this either way. It is not clear to me that any other front-end is likely to use the data_addr or data_size regions so it may be specific to flang. On the other hand, hard-coding how to access the data in the OMPIRBuilder is not as clean. Prototyping seems like a good idea. Btw, what would the default implementation be for the data_addr region, just a yield %arg0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, what would the default implementation be for the data_addr region, just a yield %arg0?

I think it should be empty. For value reductions, it won't be printed and won't be used by any parts of the code-gen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7da2821 contains an implementation to support scalar allocatables and const-shaped arrays.

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this implementing the following from the sepcification?
image

Where is isByRef set to true (instead of forwarding the value)?

const ReductionInfo &RI = En.value();
Value *LHS = RI.Variable;
Type *ValueType = RI.ElementType;
Value *RedValue = RI.Variable;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Value *RedValue = RI.Variable;
Value *LHS = RI.Variable;

Why the rename?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the naming used in CPU reduction.

I think RedValue gives more context for the code than just LHS.

@ergawy
Copy link
Member Author

ergawy commented Nov 5, 2025

Thanks for the review Michael.

Is this implementing the following from the sepcification?

Yes, exactly. The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves.

The same problem is also present for inter-team reduction-value combination for which I have a fix and will open a follow-up PR. This PR only focuses on single-team (i.e. intra- and inter-warp reductions).

Where is isByRef set to true (instead of forwarding the value)?

This is determined by the ReductionProcessor (@tblah please correct me if I am wrong here).

@Meinersbur
Copy link
Member

Thanks for the review Michael.

Is this implementing the following from the sepcification?

Yes, exactly.

Isn't scalar_alloc shared in the outer context of the worksharing-loop (e.g. parallel) in the example from your summary, so that part from the specification does not apply? I am not sure which pre-determined data-sharing attribute applies here.

@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from 1fdd4e7 to c02ae7b Compare November 5, 2025 13:20
@ergawy
Copy link
Member Author

ergawy commented Nov 5, 2025

Isn't scalar_alloc shared in the outer context of the worksharing-loop (e.g. parallel) in the example from your summary, so that part from the specification does not apply?

Right, I think I did not properly understand the quote your shared.

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the explanation

The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves.

this PR makes sense to me. Would be good if it could be added to the summary.

LGTM

Comment on lines 581 to 585
if (!isByRef)
type = valTy;

decl = OpType::create(modBuilder, loc, reductionOpName, type);
if (isByRef)
boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy));
Copy link
Contributor

@tblah tblah Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
if (!isByRef)
type = valTy;
decl = OpType::create(modBuilder, loc, reductionOpName, type);
if (isByRef)
boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy));
if (isByRef)
boxedTy = mlir::TypeAttr::get(fir::unwrapPassByRefType(valTy));
else
type = valTy;
@ergawy
Copy link
Member Author

ergawy commented Nov 6, 2025

If I understand correctly, any FIR type in the attribute will need to be convertible to an LLVM (MLIR dialect) type. Is that always the case?

I might be mising something, but for which types that would not be the case? The attribute stores the actual type on which the reduction operator will apply so I think it should be convertible, otherwise that would mean the compiler does not support reductions by-value on that type.

Does this have any error checking to catch the unsupported cases (e.g. dynamically sized arrays)?

No, I will try to do that.

How does the shuffling work for derived types?

Shuffling reinterprets the data as bytes and moves the size of a type regardless of what the type original was (see: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp#L2454).

@tblah
Copy link
Contributor

tblah commented Nov 6, 2025

If I understand correctly, any FIR type in the attribute will need to be convertible to an LLVM (MLIR dialect) type. Is that always the case?

I might be mising something, but for which types that would not be the case? The attribute stores the actual type on which the reduction operator will apply so I think it should be convertible, otherwise that would mean the compiler does not support reductions by-value on that type.

I think some time in the past, translation of a literal box (not a pointer to a box) from fir into LLVMIR (dialect) wasn't implemented. Boxes are weird because they have a different size depending on the rank (possibly only known at runtime) and whether the addendum is required.

It may now be implemented. If you already tried all of the weirder fortran types and didn't crash the compiler then that's okay.

Shuffling reinterprets the data as bytes and moves the size of a type regardless of what the type original was (see: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp#L2454).

Thanks for the explanation. I suspect that implementation won't work correctly in cases where the element size is only known dynamically. That's okay so long as we catch these cases and refuse to compile it instead of crashing the compiler or generating broken code.

@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from c02ae7b to 7da2821 Compare November 24, 2025 11:25
Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so long as the people who understand GPUs are happy too :)

I would suggest restricting the new data pointer region to only a single block, but that's up to you.

@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from 30b6e73 to d92d501 Compare November 25, 2025 05:08
Copy link
Contributor

@bhandarkar-pranav bhandarkar-pranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR, @ergawy. I had some comments in-flight that I decided to publish before i pick up the rest of the review first thing tomorrow. Sorry about the delay.

mlir::OpBuilder modBuilder(module.getBodyRegion());
mlir::Type valTy = fir::unwrapRefType(type);
// For by-ref reductions, we want to keep track of the
// boxed/referenced/allocated type. For example, a for `real, allocatable`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sed 's/a for/for a/'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

be executed after the reduction has completed.
6. The DataPtrPtr region specifies how to access the base address of a
boxed-value. This is used, in particular, for GPU reductions in order
know where partial reduction resutls are stored in remote lanes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sed s/resutls/results/

Also, at the top of this numbered list there is a line that says

This requires two mandatory and three optional regions. 

It should be updated to say 4 optional regions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deon 😛

@@ -1017,6 +1020,31 @@ makeAtomicReductionGen(omp::DeclareReductionOp decl,
return atomicGen;
}

static OwningDataPtrPtrReductionGen
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, could you please add a comment here about this function just like the other makexxxGen functions above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from d92d501 to 98751f2 Compare November 25, 2025 06:35
Adds initial support for GPU by-ref reductions. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in `target` regions. For example: ```fortran integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target ``` This PR supports by-ref reductions on the intra- and inter-warp levels. So far, there are still steps to be takens for full support of by-ref reductions, for example: * Support inter-block value combination is still not supported. Therefore, `target teams distribute parallel do` is still not supported. * Support for dynamically-sized arrays still needs to be added. * Support for more than one allocatable/array on the same `reduction` clause.
@ergawy ergawy force-pushed the users/ergawy/allocatable_reduction_irbuilder branch from 98751f2 to 8f2acf6 Compare November 25, 2025 11:11
Copy link
Contributor

@jsjodin jsjodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bhandarkar-pranav bhandarkar-pranav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Ignore the nit if you don't have any other changes to make. Thank you for this PR.

contains the value of the thread-local reduction accumulator. This will
be executed after the reduction has completed.
6. The DataPtrPtr region specifies how to access the base address of a
boxed-value. This is used, in particular, for GPU reductions in order
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

< nit >: in order to know

@ergawy ergawy merged commit f481f5b into main Nov 26, 2025
11 checks passed
@ergawy ergawy deleted the users/ergawy/allocatable_reduction_irbuilder branch November 26, 2025 10:59
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 26, 2025

LLVM Buildbot has detected a new failure on builder ppc64le-mlir-rhel-clang running on ppc64le-mlir-rhel-test while building clang,flang,llvm,mlir at step 3 "clean-build-dir".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/129/builds/33900

Here is the relevant piece of the build log for the reference
Step 3 (clean-build-dir) failure: Delete failed. (failure) (timed out) Step 6 (test-build-check-mlir-build-only-check-mlir) failure: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill ... PASS: MLIR-Unit :: Interfaces/./MLIRInterfacesTests/11/22 (3681 of 3692) PASS: MLIR :: Bytecode/invalid/invalid-structure.mlir (3682 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/0/130 (3683 of 3692) PASS: MLIR :: mlir-tblgen/rewriter-errors.td (3684 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/38/130 (3685 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/39/130 (3686 of 3692) PASS: MLIR :: Pass/ir-printing.mlir (3687 of 3692) PASS: MLIR :: Pass/pipeline-parsing.mlir (3688 of 3692) PASS: MLIR :: mlir-tblgen/llvm-intrinsics.td (3689 of 3692) PASS: MLIR :: mlir-reduce/dce-test.mlir (3690 of 3692) command timed out: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2284.089756 
@ergawy
Copy link
Member Author

ergawy commented Nov 26, 2025

LLVM Buildbot has detected a new failure on builder ppc64le-mlir-rhel-clang running on ppc64le-mlir-rhel-test while building clang,flang,llvm,mlir at step 3 "clean-build-dir".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/129/builds/33900
Here is the relevant piece of the build log for the reference

Step 3 (clean-build-dir) failure: Delete failed. (failure) (timed out) Step 6 (test-build-check-mlir-build-only-check-mlir) failure: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill ... PASS: MLIR-Unit :: Interfaces/./MLIRInterfacesTests/11/22 (3681 of 3692) PASS: MLIR :: Bytecode/invalid/invalid-structure.mlir (3682 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/0/130 (3683 of 3692) PASS: MLIR :: mlir-tblgen/rewriter-errors.td (3684 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/38/130 (3685 of 3692) PASS: MLIR-Unit :: IR/./MLIRIRTests/39/130 (3686 of 3692) PASS: MLIR :: Pass/ir-printing.mlir (3687 of 3692) PASS: MLIR :: Pass/pipeline-parsing.mlir (3688 of 3692) PASS: MLIR :: mlir-tblgen/llvm-intrinsics.td (3689 of 3692) PASS: MLIR :: mlir-reduce/dce-test.mlir (3690 of 3692) command timed out: 1200 seconds without output running [b'ninja', b'check-mlir'], attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=2284.089756 

The reported problem does not seem to be relevant to the PR changes. Anyone, please let me know if I missed something.

ergawy added a commit that referenced this pull request Nov 26, 2025
Extends the work started in #165714 by supporting team reductions. Similar to what was done in #165714, this PR introduces proper allocations, loads, and stores for by-ref reductions in teams-related callbacks: * `_omp_reduction_list_to_global_copy_func`, * `_omp_reduction_list_to_global_reduce_func`, * `_omp_reduction_global_to_list_copy_func`, and * `_omp_reduction_global_to_list_reduce_func`.
tanji-dg pushed a commit to tanji-dg/llvm-project that referenced this pull request Nov 27, 2025
…lvm#165714) Adds initial support for GPU by-ref reductions. The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in `target` regions. For example: ```fortran integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target ``` This PR supports by-ref reductions on the intra- and inter-warp levels. So far, there are still steps to be takens for full support of by-ref reductions, for example: * Support inter-block value combination is still not supported. Therefore, `target teams distribute parallel do` is still not supported. * Support for dynamically-sized arrays still needs to be added. * Support for more than one allocatable/array on the same `reduction` clause.
GeneraluseAI pushed a commit to GeneraluseAI/llvm-project that referenced this pull request Nov 27, 2025
…lvm#165714) Adds initial support for GPU by-ref reductions. The main problem for reduction by reference is that, prior to this PR, we were shuffling (from remote lanes within the same warp or across different warps within the block) pointers/references to the private reduction values rather than the private reduction values themselves. In particular, this diff adds support for reductions on scalar allocatables where reductions happen on loops nested in `target` regions. For example: ```fortran integer :: i real, allocatable :: scalar_alloc allocate(scalar_alloc) scalar_alloc = 0 !$omp target map(tofrom: scalar_alloc) !$omp parallel do reduction(+: scalar_alloc) do i = 1, 1000000 scalar_alloc = scalar_alloc + 1 end do !$omp end target ``` This PR supports by-ref reductions on the intra- and inter-warp levels. So far, there are still steps to be takens for full support of by-ref reductions, for example: * Support inter-block value combination is still not supported. Therefore, `target teams distribute parallel do` is still not supported. * Support for dynamically-sized arrays still needs to be added. * Support for more than one allocatable/array on the same `reduction` clause.
ergawy added a commit that referenced this pull request Nov 27, 2025
Extends the work started in #165714 by supporting team reductions. Similar to what was done in #165714, this PR introduces proper allocations, loads, and stores for by-ref reductions in teams-related callbacks: * `_omp_reduction_list_to_global_copy_func`, * `_omp_reduction_list_to_global_reduce_func`, * `_omp_reduction_global_to_list_copy_func`, and * `_omp_reduction_global_to_list_reduce_func`.
ergawy added a commit to ergawy/aomp that referenced this pull request Nov 28, 2025
… static arrays Adds smoke tests for the changes introduced in upstream PR: llvm/llvm-project#165714. These tests cannot be added as offloading tests upstream (yet) since they need to link wit h flang's device RT lib.
ergawy added a commit to ROCm/aomp that referenced this pull request Dec 1, 2025
… static arrays (#1718) * [smoke-fort][OpenMP] Add tests for GPU reductions on allocatables and static arrays Adds smoke tests for the changes introduced in upstream PR: llvm/llvm-project#165714. These tests cannot be added as offloading tests upstream (yet) since they need to link wit h flang's device RT lib. * review comments, Michael
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

clang:codegen IR generation bugs: mangling, exceptions, etc. clang:openmp OpenMP related changes to Clang clang Clang issues not falling into any other category flang:fir-hlfir flang:openmp flang Flang issues not falling into any other category mlir:llvm mlir:openmp mlir

8 participants