Reducing GPU-CPU data transfers in C++Amp

Question

I have encountered a following problem when trying to optimize my application with C++Amp: the data transfers. For me, there is no problem with copying data from CPU to GPU (as I can do it in the initial state of the application). The worse thing is that I need a fast access to the results computed by C++Amp kernels so the bottleneck between GPU and CPU is a pain. I read that there is a performance boost under Windows 8.1, however I am using Windows 7 and I am not planing to change it. I read about staging arrays but I don't know how they could help solve my problem. I need to return a single float value to the host and it seems that it is the most time consuming operation.

float Subset::reduction_cascade(unsigned element_count, concurrency::array<float, 1>& a) { static_assert(_tile_count > 0, "Tile count must be positive!"); //static_assert(IS_POWER_OF_2(_tile_size), "Tile size must be a positive integer power of two!"); assert(source.size() <= UINT_MAX); //unsigned element_count = static_cast<unsigned>(source.size()); assert(element_count != 0); // Cannot reduce an empty sequence. unsigned stride = _tile_size * _tile_count * 2; // Reduce tail elements. float tail_sum = 0.f; unsigned tail_length = element_count % stride; // Using arrays as a temporary memory. //concurrency::array<float, 1> a(element_count, source.begin()); concurrency::array<float, 1> a_partial_result(_tile_count); concurrency::parallel_for_each(concurrency::extent<1>(_tile_count * _tile_size).tile<_tile_size>(), [=, &a, &a_partial_result] (concurrency::tiled_index<_tile_size> tidx) restrict(amp) { // Use tile_static as a scratchpad memory. tile_static float tile_data[_tile_size]; unsigned local_idx = tidx.local[0]; // Reduce data strides of twice the tile size into tile_static memory. unsigned input_idx = (tidx.tile[0] * 2 * _tile_size) + local_idx; tile_data[local_idx] = 0; do { tile_data[local_idx] += a[input_idx] + a[input_idx + _tile_size]; input_idx += stride; } while (input_idx < element_count); tidx.barrier.wait(); // Reduce to the tile result using multiple threads. for (unsigned stride = _tile_size / 2; stride > 0; stride /= 2) { if (local_idx < stride) { tile_data[local_idx] += tile_data[local_idx + stride]; } tidx.barrier.wait(); } // Store the tile result in the global memory. if (local_idx == 0) { a_partial_result[tidx.tile[0]] = tile_data[0]; } }); // Reduce results from all tiles on the CPU. std::vector<float> v_partial_result(_tile_count); copy(a_partial_result, v_partial_result.begin()); return std::accumulate(v_partial_result.begin(), v_partial_result.end(), tail_sum); }

I checked that in the example above the most time-consuming operation is copy(a_partial_result, v_partial_result.begin());. I am trying to find a better approach.

How are you timing the data copies vs. the compute parts of your code? Remember to some extent C++ AMP calls are asynchronous, they queue things to the DMA buffer and only block when needed. See the following answer for more discussion on timing stackoverflow.com/questions/13936994/copy-data-from-gpu-to-cpu/… — Ade Miller
– Ade Miller, Commented Feb 19, 2014 at 23:44
I am timing it in the same way that I am timing non-parrallel methods. When I commented out the copy() method, I got a boost from 800-900 ms to 300 ms. — Paweł Jastrzębski
– Paweł Jastrzębski, Commented Feb 19, 2014 at 23:54
If you are not forcing the AMP kernel to finish its compute by either copying the data or calling synchronize() or wait() then you may not be timing anything at all. See the link in my previous comment. — Ade Miller
– Ade Miller, Commented Feb 20, 2014 at 0:25
So after calling wait() explicitly I got: ~640 ms without copy() and ~1300 ms with copy(). What's even worse, my previous methods seem to to be slower than I expected after adding wait() everywhere. It's a really bad news. — Paweł Jastrzębski
– Paweł Jastrzębski, Commented Feb 20, 2014 at 1:06

Ade Miller · Accepted Answer · 2014-02-21 06:07:51Z

So I think there's something else going on here. Have you tried running the original sample on which your code is based? This is available on CodePlex.

Load the samples solution and build the Reduction project in Release mode and then run it without the debugger attached. You should see some output like this.

Running kernels with 16777216 elements, 65536 KB of data ... Tile size: 512 Tile count: 128 Using device : NVIDIA GeForce GTX 570 Total : Calc SUCCESS: Overhead 0.03 : 0.00 (ms) SUCCESS: CPU sequential 9.48 : 9.45 (ms) SUCCESS: CPU parallel 5.92 : 5.89 (ms) SUCCESS: C++ AMP simple model 25.34 : 3.19 (ms) SUCCESS: C++ AMP simple model using array_view 62.09 : 20.61 (ms) SUCCESS: C++ AMP simple model optimized 25.24 : 1.81 (ms) SUCCESS: C++ AMP tiled model 29.70 : 7.27 (ms) SUCCESS: C++ AMP tiled model & shared memory 30.40 : 7.56 (ms) SUCCESS: C++ AMP tiled model & minimized divergence 25.21 : 5.77 (ms) SUCCESS: C++ AMP tiled model & no bank conflicts 25.52 : 3.92 (ms) SUCCESS: C++ AMP tiled model & reduced stalled threads 21.25 : 2.03 (ms) SUCCESS: C++ AMP tiled model & unrolling 22.94 : 1.55 (ms) SUCCESS: C++ AMP cascading reduction 20.17 : 0.92 (ms) SUCCESS: C++ AMP cascading reduction & unrolling 24.01 : 1.20 (ms)

Note that none of the examples are taking anywhere near the time you code is. Although it's fair to say that the CPU is faster and data copy time is a big contributing factor here.

This is to be expected. Effective use of a GPU involves moving more than operations like reduction to the GPU. You need to move significant amount of compute to make up for the copy overhead.

Some things you should consider:

What happens with you run the sample from CodePlex?
Are you running a release build with optimization enabled?
Are you sure running are running against the actual GPU hardware and not against a WARP (software emulator) accelerator?

Some more information that would be helpful

what hardware are you using?
How large is your data set, both the input data and the size of the partial result array?

Did this help or are you still experiencing really slow copies?
Yes, it helped me a lot. It turned out that the tests that I was running were measuring in us (microseconds) not in milliseconds. That was the case. I want to optimize two methods (convolution calculation and another very simple mathematical equation). This mathematical equation on CPU is very fast (around 50 microseconds ~= 0.05 ms). Copying one float from concurrency::array<...> to CPU takes much more than 0.05 ms and I think it is about at least 0.9 ms so only copying the value makes the CPU-accelerated computations more than 10 times slower. Or maybe I am wrong here?

Collectives™ on Stack Overflow

Reducing GPU-CPU data transfers in C++Amp

1 Answer 1

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Linked

Related