I have this piece of code that is as profiled, optimised and cache-efficient as I am likely to get it with my level of knowledge. It runs on the CPU conceptually like this:
#pragma omp parallel for schedule(dynamic) for (int i = 0; i < numberOfTasks; ++i) { result[i] = RunTask(i); // result is some array where I store the result of RunTask. } It just so happens that RunTask() is essentially a set of linear algebra operations that operate repeatedly on the same, very large dataset every time, so it's suitable to run on a GPU. So I would like to achieve the following:
- Offload some of the tasks to the GPU
- While the GPU is busy, process the rest of the tasks on the CPU
- For the CPU-level operations, keep my super-duper
RunTask()function without having to modify it to comply withrestrict(amp). I could of course design arestrict(amp)compliant lambda for the GPU tasks.
Initially I thought of doing the following:
// assume we know exactly how much time the GPU/CPU needs per task, and this is the // most time-efficient combination: int numberOfTasks = 1000; int ampTasks = 800; // RunTasksAMP(start,end) sends a restrict(amp) kernel to the GPU, and stores the result in the // returned array_view on the GPU Concurrency::array_view<ResulType, 1> concurrencyResult = RunTasksAMP(0,ampTasks); // perform the rest of the tasks on the CPU while we wait #pragma omp parallel for schedule(dynamic) for (int i = ampTasks; i < numberOfTasks; ++i) { result[i] = RunTask(i); // this is a thread-safe } // do something to wait for the parallel_for_each in RunTasksAMP to finish. concurrencyResult.synchronize(); //... now load the concurrencyResult array into the first elements of "result" But I doubt you could do something like this because
A call to parallel_for_each behaves as though it's synchronous
(http://msdn.microsoft.com/en-us/library/hh305254.aspx)
So is it possible to achieve 1-3 of my requests, or do I have to ditch number 3? Even so, how would I implement it?