Summary
In this chapter we have worked our way through a variety of optimization strategies, as well as the use of the graphical profiler that is available in the CUDA toolkit. This is an essential tool to help attack the hot spots in our kernels so that we use our time as effectively and efficiently as possible.
In the next chapter we will learn another strategy to accelerate our code: overlaying memory transfers with kernel execution.