4

Consider the following kernel, which reduces along the rows of a 2-D matrix

function row_sum!(x, ncol, out) """out = sum(x, dims=2)""" row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x for i = 1:ncol @inbounds out[row_idx] += x[row_idx, i] end return end N = 1024 x = CUDA.rand(Float64, N, 2*N) out = CUDA.zeros(Float64, N) @cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out) isapprox(out, sum(x, dims=2)) # true 

How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?

1 Answer 1

3

Here is the code:

function col_sum!(x, nrow, out) """out = sum(x, dims=1)""" col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x for i = 1:nrow @inbounds out[col_idx] += x[i, col_idx] end return end N = 1024 x = CUDA.rand(Float64, N, 2N) out = CUDA.zeros(Float64, 2N) @cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out) 

And here is the test:

julia> isapprox(out, vec(sum(x, dims=1))) true 

As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)

More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

Sign up to request clarification or add additional context in comments.

2 Comments

Awesome answer. I didn't realize that the thread indexing code could be the exact same, and changing just the block size gets us the desired behavior. Thanks!
You have blockIdx().y and blockIdx().z but these are for multi-dimensional blocks. Here the block remains one-dimensional so almost no change in the code.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.