Julia CUDA - Reduce matrix columns

Question

Consider the following kernel, which reduces along the rows of a 2-D matrix

function row_sum!(x, ncol, out) """out = sum(x, dims=2)""" row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x for i = 1:ncol @inbounds out[row_idx] += x[row_idx, i] end return end N = 1024 x = CUDA.rand(Float64, N, 2*N) out = CUDA.zeros(Float64, N) @cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out) isapprox(out, sum(x, dims=2)) # true

How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?

Przemyslaw Szufel · Accepted Answer · 2022-01-21 18:57:27Z

Here is the code:

function col_sum!(x, nrow, out) """out = sum(x, dims=1)""" col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x for i = 1:nrow @inbounds out[col_idx] += x[i, col_idx] end return end N = 1024 x = CUDA.rand(Float64, N, 2N) out = CUDA.zeros(Float64, 2N) @cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out)

And here is the test:

julia> isapprox(out, vec(sum(x, dims=1))) true

As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)

More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

Awesome answer. I didn't realize that the thread indexing code could be the exact same, and changing just the block size gets us the desired behavior. Thanks!
You have blockIdx().y and blockIdx().z but these are for multi-dimensional blocks. Here the block remains one-dimensional so almost no change in the code.

Collectives™ on Stack Overflow

Julia CUDA - Reduce matrix columns

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related