Skip to content

Conversation

@charithaintc
Copy link
Contributor

@charithaintc charithaintc commented Jul 24, 2025

This example fails in subgroup distribution:

gpu.module @test { gpu.func @check_update_nd_offset_distributed_tensor_desc() { %c32 = arith.constant 32 : index %cst = arith.constant {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} dense<1.000000e+00> : vector<16x16xf32> %0 = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> %1 = xegpu.update_nd_offset %0, [%c32, %c32] : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> xegpu.store_nd %cst, %1 : vector<16x16xf32>, !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> gpu.return } } 

With this error:

within split at test.mlir:108 offset :3:3: error: 'gpu.warp_execute_on_lane_0' op expected vector type for distributed operands. gpu.func @check_update_nd_offset_distributed_tensor_desc() { ^ within split at test.mlir:108 offset :3:3: note: see current operation: %3 = "gpu.warp_execute_on_lane_0"(%2) <{warp_size = 16 : i64}> ({ %5 = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> "gpu.yield"(%5) : (!xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>) -> () }) : (index) -> !xegpu.tensor_desc<16x16xf32> 

Reason is UpdateNdOffset source operand not retaining the layouts when it is yielded by the warp op. warp_execute_on_lane0 op expects that TensorDesc type is unchanged during distribution out of its region. we use UnrealizedCasts to reconcile this mismatch outside the warpOp (via resolveDistributedTy)

@llvmbot
Copy link
Member

llvmbot commented Jul 24, 2025

@llvm/pr-subscribers-mlir

Author: Charitha Saumya (charithaintc)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/150545.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp (+22-40)
  • (modified) mlir/test/Dialect/XeGPU/subgroup-distribute.mlir (+23-1)
diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp index 8957ea5399ea2..2088c3c7fc5ec 100644 --- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp +++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp @@ -277,22 +277,13 @@ struct CreateNdDescDistribution final : public gpu::WarpDistributionPattern { descOp, "the tensor descriptor lacks layout attribute"); SmallVector<size_t> newRetIndices; - SmallVector<Value> newYieldValues; - SmallVector<Type> newYieldTypes; - - for (Value operand : descOp->getOperands()) { - newYieldValues.push_back(operand); - newYieldTypes.push_back(operand.getType()); - } rewriter.setInsertionPoint(warpOp); gpu::WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndAppendReturns( - rewriter, warpOp, /* new yieled values = */ newYieldValues, - /* new yielded types = */ newYieldTypes, newRetIndices); + rewriter, warpOp, /* new yieled values = */ descOp->getOperands(), + /* new yielded types = */ descOp.getOperandTypes(), newRetIndices); - SmallVector<Value> newDescOperands; - for (size_t i : newRetIndices) { - newDescOperands.push_back(newWarpOp.getResult(i)); - } + SmallVector<Value> newDescOperands = llvm::map_to_vector( + newRetIndices, [&](size_t i) { return newWarpOp.getResult(i); }); rewriter.setInsertionPointAfter(newWarpOp); xegpu::TensorDescType distributedTensorDescTy = descOp.getType().dropLayouts(); // Distributed tensor descriptor type @@ -696,39 +687,30 @@ struct UpdateNdOffsetDistribution final : public gpu::WarpDistributionPattern { warpOp, "warp result is not a xegpu::UpdateNdOffset op"); auto updateOp = operand->get().getDefiningOp<xegpu::UpdateNdOffsetOp>(); unsigned operandIdx = operand->getOperandNumber(); - // new update op does not have layout attribute. - xegpu::TensorDescType newTensorDescTy = - updateOp.getTensorDescType().dropLayouts(); - SmallVector<Value, 3> newYieldValues; - SmallVector<Type, 3> newYieldTypes; - for (Value operand : updateOp->getOperands()) { - newYieldValues.push_back(operand); - if (isa<xegpu::TensorDescType>(operand.getType())) { - newYieldTypes.push_back(newTensorDescTy); - } else { - newYieldTypes.push_back(operand.getType()); - } - } SmallVector<size_t> newRetIndices; gpu::WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndAppendReturns( - rewriter, warpOp, newYieldValues, newYieldTypes, newRetIndices); + rewriter, warpOp, updateOp->getOperands(), updateOp.getOperandTypes(), + newRetIndices); rewriter.setInsertionPointAfter(newWarpOp); - SmallVector<Value> newUpdateOperands; - for (size_t i : newRetIndices) { - // For the tensor descriptor operand, the layout attribute is dropped - // after distribution. Types needs to be resolved in this case. - if (isa<xegpu::TensorDescType>(newWarpOp.getResult(i).getType())) { - newUpdateOperands.push_back(resolveDistributedTy( - newWarpOp.getResult(i), newTensorDescTy, rewriter)); - } else { - newUpdateOperands.push_back(newWarpOp.getResult(i)); - } - } + // new update op does not have layout attribute. + xegpu::TensorDescType distributedTensorDescTy = + updateOp.getTensorDescType().dropLayouts(); + SmallVector<Value> newUpdateOperands = + llvm::map_to_vector(newRetIndices, [&](size_t i) { + // For the tensor descriptor operand, the layout attribute is + // dropped after distribution. Types needs to be resolved in this + // case. + if (isa<xegpu::TensorDescType>(newWarpOp.getResult(i).getType())) { + return resolveDistributedTy(newWarpOp.getResult(i), + distributedTensorDescTy, rewriter); + } + return newWarpOp.getResult(i); + }); // Create a new update op outside the warp op. auto newUpdateOp = xegpu::UpdateNdOffsetOp::create( - rewriter, newWarpOp.getLoc(), newTensorDescTy, newUpdateOperands, - updateOp->getAttrs()); + rewriter, newWarpOp.getLoc(), distributedTensorDescTy, + newUpdateOperands, updateOp->getAttrs()); xegpu::removeLayoutAttrs(newUpdateOp); Value distributedVal = newWarpOp.getResult(operandIdx); // Resolve the distributed type with the original type. diff --git a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir index e78ae4a17710b..4bfa797e2a9b3 100644 --- a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir +++ b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir @@ -1,4 +1,4 @@ -// RUN: mlir-opt -xegpu-subgroup-distribute -canonicalize -cse -split-input-file %s | FileCheck %s +// RUN: mlir-opt -xegpu-subgroup-distribute -allow-unregistered-dialect -canonicalize -cse -split-input-file %s | FileCheck %s // CHECK-LABEL: gpu.func @store_nd_1d // CHECK: (%[[ARG0:[0-9a-zA-Z]+]]: memref<16xf32>) { @@ -265,6 +265,28 @@ gpu.module @test { } } +// ----- +// Explicitly check that update_nd_offset distributed tensor descriptor retains the layouts. +// CHECK-LABEL: gpu.func @check_update_nd_offset_distributed_tensor_desc +// CHECK: %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[16] -> +// CHECK-SAME: (!xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>) { +// CHECK: %[[T0:.*]] = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> +// CHECK: gpu.yield %[[T0]] : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> +// CHECK: } +// CHECK: %[[T1:.*]] = builtin.unrealized_conversion_cast %[[W]] : +// CHECK-SAME: !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> to !xegpu.tensor_desc<16x16xf32> {resolve_simt_type_mismatch} +// CHECK: xegpu.update_nd_offset %[[T1]], [%{{.*}}] : !xegpu.tensor_desc<16x16xf32> +gpu.module @test { + gpu.func @check_update_nd_offset_distributed_tensor_desc() { + %c32 = arith.constant 32 : index + %cst = arith.constant {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} dense<1.000000e+00> : vector<16x16xf32> + %0 = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + %1 = xegpu.update_nd_offset %0, [%c32, %c32] : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + xegpu.store_nd %cst, %1 : vector<16x16xf32>, !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + gpu.return + } +} + // ----- // CHECK-LABEL: gpu.func @prefetch_1d // CHECK: (%[[ARG0:[0-9a-zA-Z]+]]: memref<256xf16>) { 
@llvmbot
Copy link
Member

llvmbot commented Jul 24, 2025

@llvm/pr-subscribers-mlir-gpu

Author: Charitha Saumya (charithaintc)

Changes

Full diff: https://github.com/llvm/llvm-project/pull/150545.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp (+22-40)
  • (modified) mlir/test/Dialect/XeGPU/subgroup-distribute.mlir (+23-1)
diff --git a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp index 8957ea5399ea2..2088c3c7fc5ec 100644 --- a/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp +++ b/mlir/lib/Dialect/XeGPU/Transforms/XeGPUSubgroupDistribute.cpp @@ -277,22 +277,13 @@ struct CreateNdDescDistribution final : public gpu::WarpDistributionPattern { descOp, "the tensor descriptor lacks layout attribute"); SmallVector<size_t> newRetIndices; - SmallVector<Value> newYieldValues; - SmallVector<Type> newYieldTypes; - - for (Value operand : descOp->getOperands()) { - newYieldValues.push_back(operand); - newYieldTypes.push_back(operand.getType()); - } rewriter.setInsertionPoint(warpOp); gpu::WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndAppendReturns( - rewriter, warpOp, /* new yieled values = */ newYieldValues, - /* new yielded types = */ newYieldTypes, newRetIndices); + rewriter, warpOp, /* new yieled values = */ descOp->getOperands(), + /* new yielded types = */ descOp.getOperandTypes(), newRetIndices); - SmallVector<Value> newDescOperands; - for (size_t i : newRetIndices) { - newDescOperands.push_back(newWarpOp.getResult(i)); - } + SmallVector<Value> newDescOperands = llvm::map_to_vector( + newRetIndices, [&](size_t i) { return newWarpOp.getResult(i); }); rewriter.setInsertionPointAfter(newWarpOp); xegpu::TensorDescType distributedTensorDescTy = descOp.getType().dropLayouts(); // Distributed tensor descriptor type @@ -696,39 +687,30 @@ struct UpdateNdOffsetDistribution final : public gpu::WarpDistributionPattern { warpOp, "warp result is not a xegpu::UpdateNdOffset op"); auto updateOp = operand->get().getDefiningOp<xegpu::UpdateNdOffsetOp>(); unsigned operandIdx = operand->getOperandNumber(); - // new update op does not have layout attribute. - xegpu::TensorDescType newTensorDescTy = - updateOp.getTensorDescType().dropLayouts(); - SmallVector<Value, 3> newYieldValues; - SmallVector<Type, 3> newYieldTypes; - for (Value operand : updateOp->getOperands()) { - newYieldValues.push_back(operand); - if (isa<xegpu::TensorDescType>(operand.getType())) { - newYieldTypes.push_back(newTensorDescTy); - } else { - newYieldTypes.push_back(operand.getType()); - } - } SmallVector<size_t> newRetIndices; gpu::WarpExecuteOnLane0Op newWarpOp = moveRegionToNewWarpOpAndAppendReturns( - rewriter, warpOp, newYieldValues, newYieldTypes, newRetIndices); + rewriter, warpOp, updateOp->getOperands(), updateOp.getOperandTypes(), + newRetIndices); rewriter.setInsertionPointAfter(newWarpOp); - SmallVector<Value> newUpdateOperands; - for (size_t i : newRetIndices) { - // For the tensor descriptor operand, the layout attribute is dropped - // after distribution. Types needs to be resolved in this case. - if (isa<xegpu::TensorDescType>(newWarpOp.getResult(i).getType())) { - newUpdateOperands.push_back(resolveDistributedTy( - newWarpOp.getResult(i), newTensorDescTy, rewriter)); - } else { - newUpdateOperands.push_back(newWarpOp.getResult(i)); - } - } + // new update op does not have layout attribute. + xegpu::TensorDescType distributedTensorDescTy = + updateOp.getTensorDescType().dropLayouts(); + SmallVector<Value> newUpdateOperands = + llvm::map_to_vector(newRetIndices, [&](size_t i) { + // For the tensor descriptor operand, the layout attribute is + // dropped after distribution. Types needs to be resolved in this + // case. + if (isa<xegpu::TensorDescType>(newWarpOp.getResult(i).getType())) { + return resolveDistributedTy(newWarpOp.getResult(i), + distributedTensorDescTy, rewriter); + } + return newWarpOp.getResult(i); + }); // Create a new update op outside the warp op. auto newUpdateOp = xegpu::UpdateNdOffsetOp::create( - rewriter, newWarpOp.getLoc(), newTensorDescTy, newUpdateOperands, - updateOp->getAttrs()); + rewriter, newWarpOp.getLoc(), distributedTensorDescTy, + newUpdateOperands, updateOp->getAttrs()); xegpu::removeLayoutAttrs(newUpdateOp); Value distributedVal = newWarpOp.getResult(operandIdx); // Resolve the distributed type with the original type. diff --git a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir index e78ae4a17710b..4bfa797e2a9b3 100644 --- a/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir +++ b/mlir/test/Dialect/XeGPU/subgroup-distribute.mlir @@ -1,4 +1,4 @@ -// RUN: mlir-opt -xegpu-subgroup-distribute -canonicalize -cse -split-input-file %s | FileCheck %s +// RUN: mlir-opt -xegpu-subgroup-distribute -allow-unregistered-dialect -canonicalize -cse -split-input-file %s | FileCheck %s // CHECK-LABEL: gpu.func @store_nd_1d // CHECK: (%[[ARG0:[0-9a-zA-Z]+]]: memref<16xf32>) { @@ -265,6 +265,28 @@ gpu.module @test { } } +// ----- +// Explicitly check that update_nd_offset distributed tensor descriptor retains the layouts. +// CHECK-LABEL: gpu.func @check_update_nd_offset_distributed_tensor_desc +// CHECK: %[[W:.*]] = gpu.warp_execute_on_lane_0(%{{.*}})[16] -> +// CHECK-SAME: (!xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>>) { +// CHECK: %[[T0:.*]] = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> +// CHECK: gpu.yield %[[T0]] : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> +// CHECK: } +// CHECK: %[[T1:.*]] = builtin.unrealized_conversion_cast %[[W]] : +// CHECK-SAME: !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> to !xegpu.tensor_desc<16x16xf32> {resolve_simt_type_mismatch} +// CHECK: xegpu.update_nd_offset %[[T1]], [%{{.*}}] : !xegpu.tensor_desc<16x16xf32> +gpu.module @test { + gpu.func @check_update_nd_offset_distributed_tensor_desc() { + %c32 = arith.constant 32 : index + %cst = arith.constant {layout_result_0 = #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>} dense<1.000000e+00> : vector<16x16xf32> + %0 = "some_op"() : () -> !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + %1 = xegpu.update_nd_offset %0, [%c32, %c32] : !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + xegpu.store_nd %cst, %1 : vector<16x16xf32>, !xegpu.tensor_desc<16x16xf32, #xegpu.layout<lane_layout = [1, 16], lane_data = [1, 1]>> + gpu.return + } +} + // ----- // CHECK-LABEL: gpu.func @prefetch_1d // CHECK: (%[[ARG0:[0-9a-zA-Z]+]]: memref<256xf16>) { 
@charithaintc charithaintc requested a review from silee2 July 24, 2025 23:18
@charithaintc
Copy link
Contributor Author

Hi @adam-smnk, Can you please take a look?

@silee2
Copy link
Contributor

silee2 commented Aug 4, 2025

Can you add a short description including:

  • nature of the bug
  • how the fix works
@charithaintc
Copy link
Contributor Author

Can you add a short description including:

  • nature of the bug
  • how the fix works

added a description.

@adam-smnk
Copy link
Contributor

Nit to description: you can skip the code snippets. Reproducer is now a test and honestly the produced error is confusing.
Pure text description like the sentences at the end should be enough here.

Copy link
Contributor

@adam-smnk adam-smnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix itself looks fine 👍

AFAIK, the issue here was pretty generic as in type mismatch occurred during partial distribution.
Won't this problem occur for any other ops?

@silee2
Copy link
Contributor

silee2 commented Aug 5, 2025

Thanks for the description.

@charithaintc
Copy link
Contributor Author

Fix itself looks fine 👍

AFAIK, the issue here was pretty generic as in type mismatch occurred during partial distribution. Won't this problem occur for any other ops?

Other ops seems to be handled correctly. I made a mistake when constructing the UpdateNd op.

@charithaintc charithaintc merged commit 06884d0 into llvm:main Aug 5, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

4 participants