Parallelizing with streams
Now we will talk about the second feature that helps improve performance of CUDA programs: streams.
We can think of a stream as a queue on which we enqueue kernel executions and memory transfers, which are then executed sequentially on the GPU in the order in which they were added. There is a default stream which receives index 0, which is used when we don’t define a specific stream for execution. One important characteristic of the default stream is that it is synchronous, meaning that each operation will complete before the next starts. This simplifies behavior but limits performance improvements.
However, we can define our own non-default streams and execute different operations on different streams so that we end up overlapping memory transfers and computations. Since streams are not tied to any streaming processor or memory channel, anything available will execute the requested operations.
We must keep in mind that...