4

My colleague came across >>();++++cudaDeviceSynchronize();++++my_print("host",+a[0]);++++return+0;}'),l:'5',n:'0',o:'CUDA+C+++source+#1',t:'0')),k:53_9311241065627,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:nvcc128u1,filters:(b:'0',binary:'1',binaryObject:'1',commentOnly:'0',debugCalls:'1',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'0',trim:'1',verboseDemangling:'0'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:cuda,libs:!(),options:'-expt-relaxed-constexpr+-rdc=true+-std=c++17+',overrides:!(),selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'+NVCC+12_8_1+(Editor+#1)',t:'0')),k:50,l:'4',m:47_61904761904763,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64+gcc+12_2',editorid:1,fontScale:14,fontUsePx:'0',j:1,wrap:'1'),l:'5',n:'0',o:'Output+of+NVCC+12_8_1+(Compiler+#1)',t:'0')),header:(),l:'4',m:23_764515257899,n:'0',o:'',s:0,t:'0'),(g:!((h:device,i:(compilerName:'NVCC+12_8_1',device:PTX,editorid:1,fontScale:14,fontUsePx:'0',j:1,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),treeid:0),l:'5',n:'0',o:'Device+Viewer+NVCC+12_8_1+(Editor+#1,+Compiler+#1)',t:'0')),l:'4',m:28_61643712305338,n:'0',o:'',s:0,t:'0')),k:46_0688758934373,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">this situation where global __device__ constexpr variables are accessible from both the host and the device.

#include <array> #include <cstdio> __device__ constexpr std::array<int, 2> a = {15, 1}; __host__ __device__ void my_print(const char* c_str, const int& i) { std::printf(c_str); std::printf(" i=%d &i=%p\n", i, &i); } __global__ void display() { my_print("device", a[0]); } int main() { display<<<1, 1>>>(); cudaDeviceSynchronize(); my_print("host", a[0]); return 0; } 

Here is the output from a run, which looks reasonable:

device i=15 &i=0x7e97e9200000 host i=15 &i=0x482008 

Our situation is related to this question. However, unlike that case, we know the variable's value at compile time. Is the current situation well-defined?

Update: Thanks for the discussion in the comments. We understand that passing a constexpr device variable as const ref to a host should be treated as a bug by the CUDA compiler and it has been reported. We also understand that a global constexpr host variable is enough as a variable that should be accessed by both host and device. It is somewhat unfortunate that constexpr host variables cannot be passed as const ref to the device. Most likely nothing could go wrong.

13
  • 4
    The reverse situation is well defined - when is a constexpr host variable accessible from device code. I find no similar descriptions for variables decorated with __device__. When I reverse the scenario I get a compilation failure, which certainly gives me pause. (If we change the i parameter from pass-by-reference to pass-by-value, we can avoid that compilation failure - again - this is documented.) Commented May 14 at 19:12
  • 1
    From Robert's (first) link: Device source code cannot contain a reference to V or take the address of V. So, unless you're trying to modify i in my_print (which you can't because it's constexpr and const), there's no need for my_print to have const int& i. int i will be faster because it's a simple scalar (and work, apparently). Commented May 14 at 22:26
  • 1
    I think this is a bug and you should report it as such: forums.developer.nvidia.com/t/how-to-report-a-bug/67911 Commented May 16 at 7:54
  • 1
    @paleonix, __device__ variables leaking into the __host__ code looks like a bug to me, the host should not see device code and any use of device methods/vars should be a compilation error. Who would you make it undefined behavior? Is the C++ community not trying to get rid of undefined behavior? Commented May 16 at 9:53
  • 2
    "It is somewhat unfortunate that constexpr host variables cannot be passed as const ref to the device. Most likely nothing could go wrong." You have to understand that the difference between e.g a constexpr and other ways to define compile time constants is that constexpr "variables" do have an address and that creating a reference is taking that address (which can point to memory that is not accessible from either the host or the device). Ideally the compiler will inline the function and directly use the value of the constexpr variable. But if it can't do that, something will go wrong. Commented May 17 at 9:50

1 Answer 1

4

Indeed, as @Johan says in a comment, this is a bug.

The bug has two aspects, actually:

  • A __device__-side global variable exists on every device on which the code with that variable is used (or more exactly, in every context into which the variable's module is loaded; see §18.2 of the Programming Guide). In other words - when a device runs your code, a is well-defined; but on the host side - there may be many a's, so - which a's address could you take, at all?

  • Even supposing there were just one device (and one context), and the choice of a on the host was somehow made reasonably (e.g. the primary context on the current device) - you should still not be getting different addresses: CUDA, since version 6.0, uses a single unified memory space for all addresses on GPUs and on the host. That means that the address of an entity (like a variable or function) is the same for device-side code and host-side code - regardless of whether it's visible from both host and device.

I tried try your code with a non-constexpr variable, and of a simpler type (an int x); but I changed it to print the value on all of my (two) CUDA-capable GPUs. The resulting output is:

device 0: int x at address 0x7f826ba00000 has value 123 device 1: int x at address 0x7f826b000000 has value 123 host: int x at address 0x560032e54568 has value 0 host: int y at address 0x560032e544b0 has value 456 

so, two variables, on two devices; and yet we get a third address; and it looks very much like the address of another global host-side variable that I also defined.

Here's the program I used:

#include <cstdio> __device__ int x = 123; int y = 456; __host__ __device__ void display_x(int const* dev_idx, int const& x) { if (dev_idx) { std::printf("device %d: ", *dev_idx); } else { std::printf("host: "); } std::printf("int x at address %p has value %d\n", &x, x); } // Type your code here, or load an example. __global__ void display_x_kernel(int dev_idx) { display_x(&dev_idx, x); } int main() { int num_devices = 0; cudaGetDeviceCount(&num_devices); for(int dev_idx = 0; dev_idx < num_devices; dev_idx++) { cudaSetDevice(dev_idx); display_x_kernel<<<1, 1>>>(dev_idx); cudaDeviceSynchronize(); } display_x(nullptr, x); std::printf("host: int y at address %p has value %d\n", &y, y); } 

I'll also take this opportunity to suggest that you avoid __device__ global variables altogether - except for rare case I can't think of right now off the top of my head. (Non-constant) globals are to be avoided generally, and device globals are kind of the lame brothers of host-side-code globals - considering that your program will have to copy the values of these globals, at run-time, to the GPU, anyway. And if you want constexpr values which "disappear", due to only being used at compile-time - well, I would suggest making them local to a consteval function. Maybe it won't make a difference, but frankly - I don't want to have to prove to myself that NVCC or NVRTC or clang will actually get rid of them.

Thanks goes to @paleonix for setting me straight regarding the most salient aspect of this bug.

Sign up to request clarification or add additional context in comments.

5 Comments

According to the docs, there is a "distinct object" on each device meaning that there can't be a single unified virtual address.
@paleonix: " The address you get from cudaGetSymbolAddress() is the actual (unified) address. " <- Well, yes. And that's also the address you'll see within the kernel.
@paleonix: "there is a "distinct object" on each device" <- Yes, that's right. "meaning that there can't be a single unified virtual address " <- No, there is a single unified address space. And each device's distinct object has a single address in that space, which is different than all the other device's objects' addresses.
But that is exactly what I'm saying. Taking the address on the host can't give the UVA address because there isn't a single one but multiple.
@paleonix: Yes that's true. Which means I mis-described the problem. Thanks. ... and I've now rephrased the answer.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.