Running multiple GPUs together
As we mentioned in the chapter introduction, having multiple GPUs on the same machine is not a very common setup, due to the great cost. Nevertheless, it is still a form of overlapping computation that we can use to our advantage. We will look in this section at an adaptation of the previous vector matrix multiplication program, in which the problem is divided into two parts and each part is submitted to a different GPU.
We will not be using streams in this program, so as not to confuse the topics. Nor will we carry out any performance measurements, because the system on which the executions are run is based on a PCIe 2.0 bus which is really slow, impacting significantly on the final time results.
The key concept to multi-GPU programming is the use of the cudaSetDevice(int d) function which defines the GPU device that will be addressed until it is called again with a different device identifier.
The kernel to perform vector matrix multiplication...