10

I have a float array that needs to be referenced many times on the device, so I believe the best place to store it is in __ constant __ memory (using this reference). The array (or vector) will need to be written once at run-time when initializing, but read by multiple different functions many millions of times, so constant copying to the kernel each function call seems like A Bad Idea.

const int n = 32; __constant__ float dev_x[n]; //the array in question struct struct_max : public thrust::unary_function<float,float> { float C; struct_max(float _C) : C(_C) {} __host__ __device__ float operator()(const float& x) const { return fmax(x,C);} }; void foo(const thrust::host_vector<float> &, const float &); int main() { thrust::host_vector<float> x(n); //magic happens populate x cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float)); foo(x,0.0); return(0); } void foo(const thrust::host_vector<float> &input_host_x, const float &x0) { thrust::device_vector<float> dev_sol(n); thrust::host_vector<float> host_sol(n); //this method works fine, but the memory transfer is unacceptable thrust::device_vector<float> input_dev_vec(n); input_dev_vec = input_host_x; //I want to avoid this thrust::transform(input_dev_vec.begin(),input_dev_vec.end(),dev_sol.begin(),struct_max(x0)); host_sol = dev_sol; //this memory transfer for debugging //this method compiles fine, but crashes at runtime thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x); thrust::transform(dev_ptr,dev_ptr+n,dev_sol.begin(),struct_max(x0)); host_sol = dev_sol; //this line crashes } 

I tried adding a global thrust::device_vector dev_x(n), but that also crashed at run-time, and would be in __ global __ memory rather than __ constant__ memory

This can all be made to work if I just discard the thrust library, but is there a way to use the thrust library with globals and device constant memory?

1 Answer 1

10

Good question! You can't cast a __constant__ array as if it's a regular device pointer.

I will answer your question (after the line below), but first: this is a bad use of __constant__, and it isn't really what you want. The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time. If each thread of the warp accesses a different constant memory location, then the accesses get serialized. So your access pattern, where consecutive threads access consecutive memory locations, will be 32 times slower than a uniform access. You should really just use device memory. If you need to write the data once, but read it many times, then just use a device_vector: initialize it once, and then read it many times.


To do what you asked, you can use a thrust::counting_iterator as the input to thrust::transform to generate a range of indices into your __constant__ array. Then your functor's operator() takes an int index operand rather than a float value operand, and does the lookup into constant memory.

(Note that this means your functor is now __device__ code only. You could easily overload the operator to take a float and call it differently on host data if you need portability.)

I modified your example to initialize the data and print the result to verify that it is correct.

#include <stdio.h> #include <stdlib.h> #include <thrust/device_vector.h> #include <thrust/host_vector.h> #include <thrust/iterator/counting_iterator.h> const int n = 32; __constant__ float dev_x[n]; //the array in question struct struct_max : public thrust::unary_function<float,float> { float C; struct_max(float _C) : C(_C) {} // only works as a device function __device__ float operator()(const int& i) const { // use index into constant array return fmax(dev_x[i],C); } }; void foo(const thrust::host_vector<float> &input_host_x, const float &x0) { thrust::device_vector<float> dev_sol(n); thrust::host_vector<float> host_sol(n); thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x); thrust::transform(thrust::make_counting_iterator(0), thrust::make_counting_iterator(n), dev_sol.begin(), struct_max(x0)); host_sol = dev_sol; //this line crashes for (int i = 0; i < n; i++) printf("%f\n", host_sol[i]); } int main() { thrust::host_vector<float> x(n); //magic happens populate x for (int i = 0; i < n; i++) x[i] = rand() / (float)RAND_MAX; cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float)); foo(x, 0.5); return(0); } 
Sign up to request clarification or add additional context in comments.

4 Comments

thanks for your help! The vector will be a power of 2 elements long, probably >=8096, so I'll drop the idea of using __ constant __ memory
If I change to a global device_vector and reference that, I get a crash at run-time (well, debug run-time time) Can I add a global device_vector or does it need to be declared in main() and passed by reference?
Power of 2 or size is not the reason not to use __constant__ here -- it's as I said: yours is not the type of memory access pattern for which __constant__ is optimized. Regarding your crash: why make it a global? The problem I see with making it global is that you would not be able to create the array with a size determined at runtime, because the constructor would be called before main(). There are also tricky issues with the order of construction of globals across compilation units. Generally I would create it in a function and pass it by reference.
@harrism Please, can you clarify? You said: "The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time." - does it mean that if I use random access to __constant__ memory then it will not have any advantage compared with the global memory allocated by cudaMemalloc()? But how can I speedup memory access in this case, should I use LDG load? on-demand.gputechconf.com/gtc/2013/presentations/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.