Consider the following kernel, which reduces along the rows of a 2-D matrix
function row_sum!(x, ncol, out) """out = sum(x, dims=2)""" row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x for i = 1:ncol @inbounds out[row_idx] += x[row_idx, i] end return end N = 1024 x = CUDA.rand(Float64, N, 2*N) out = CUDA.zeros(Float64, N) @cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out) isapprox(out, sum(x, dims=2)) # true How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?