I have encountered a following problem when trying to optimize my application with C++Amp: the data transfers. For me, there is no problem with copying data from CPU to GPU (as I can do it in the initial state of the application). The worse thing is that I need a fast access to the results computed by C++Amp kernels so the bottleneck between GPU and CPU is a pain. I read that there is a performance boost under Windows 8.1, however I am using Windows 7 and I am not planing to change it. I read about staging arrays but I don't know how they could help solve my problem. I need to return a single float value to the host and it seems that it is the most time consuming operation.
float Subset::reduction_cascade(unsigned element_count, concurrency::array<float, 1>& a) { static_assert(_tile_count > 0, "Tile count must be positive!"); //static_assert(IS_POWER_OF_2(_tile_size), "Tile size must be a positive integer power of two!"); assert(source.size() <= UINT_MAX); //unsigned element_count = static_cast<unsigned>(source.size()); assert(element_count != 0); // Cannot reduce an empty sequence. unsigned stride = _tile_size * _tile_count * 2; // Reduce tail elements. float tail_sum = 0.f; unsigned tail_length = element_count % stride; // Using arrays as a temporary memory. //concurrency::array<float, 1> a(element_count, source.begin()); concurrency::array<float, 1> a_partial_result(_tile_count); concurrency::parallel_for_each(concurrency::extent<1>(_tile_count * _tile_size).tile<_tile_size>(), [=, &a, &a_partial_result] (concurrency::tiled_index<_tile_size> tidx) restrict(amp) { // Use tile_static as a scratchpad memory. tile_static float tile_data[_tile_size]; unsigned local_idx = tidx.local[0]; // Reduce data strides of twice the tile size into tile_static memory. unsigned input_idx = (tidx.tile[0] * 2 * _tile_size) + local_idx; tile_data[local_idx] = 0; do { tile_data[local_idx] += a[input_idx] + a[input_idx + _tile_size]; input_idx += stride; } while (input_idx < element_count); tidx.barrier.wait(); // Reduce to the tile result using multiple threads. for (unsigned stride = _tile_size / 2; stride > 0; stride /= 2) { if (local_idx < stride) { tile_data[local_idx] += tile_data[local_idx + stride]; } tidx.barrier.wait(); } // Store the tile result in the global memory. if (local_idx == 0) { a_partial_result[tidx.tile[0]] = tile_data[0]; } }); // Reduce results from all tiles on the CPU. std::vector<float> v_partial_result(_tile_count); copy(a_partial_result, v_partial_result.begin()); return std::accumulate(v_partial_result.begin(), v_partial_result.end(), tail_sum); } I checked that in the example above the most time-consuming operation is copy(a_partial_result, v_partial_result.begin());. I am trying to find a better approach.