Return to Question

added 2447 characters in body

edited Aug 30, 2013 at 20:52

2.7k
18
28

CUDATestFunction = CUDAFunctionLoad[...,{"Float", 2, "Input""InputOutput"},{16,16}];

Real Code for the actual problem with some modification to simplify the input (which does not change the occurring problem):

(*data initialization*) imageData = Table[RandomReal[], {i, 200}, {j, 300}]; (*this would usually be some GrayScale ImageData*) f = Table[RandomReal[], {i, 200}, {j, 300}]; (*this would be some function evaluated on the imageData*) maxiter = 10000; testImageData = Table[{RandomReal[], RandomReal[]}, {i, 200}, {j, 300}]; (*would be the gradient of the imageData*) (*the for loop that leads to the memory overflow*) For[i = 1, i < maxiter, i++, test = First@CUDATestFunction[testImageData, 0.1, f, imageGradientNormalized, 1, Sequence @@ Dimensions[imageData]]; testImageData = test; ] (*CUDACode*) CUDATestFunction = CUDAFunctionLoad[" __device__ float length(const float2& a) { return sqrtf(a.x*a.x+a.y*a.y); } __device__ float2 operator+(const float2& a, const float2& b) { return make_float2(a.x + b.x, a.y + b.y); } __device__ float2 operator-(const float2& a, const float2& b) { return make_float2(a.x - b.x, a.y - b.y); } __device__ float2 operator*(const float& a, const float2& b) { return make_float2(a * b.x, a * b.y); } __device__ float2 operator/(const float2& a, const float& b) { return make_float2(a.x / b, a.y / b); } __global__ void resolvFs(float* p, float sigma, float* f, float* imageGradientNormalized, float lambda1, mint width, mint height) { int xIndex = threadIdx.x + blockIdx.x * blockDim.x; int yIndex = threadIdx.y + blockIdx.y * blockDim.y; int index = 2*(xIndex + yIndex * width); if(xIndex < width && yIndex < height) { float2 vecP = make_float2(p[index], p[index+1]); float2 vecN = make_float2(imageGradientNormalized[index], imageGradientNormalized[index+1]); vecP = vecP + 2 * sqrtf(lambda1) * vecN; vecP = vecP/fmaxf(1, length(vecP)/(2*sqrtf(f[index/2]+lambda1))); vecP = vecP - 2 * sqrtf(lambda1) * vecN; p[index] = vecP.x; p[index+1] = vecP.y; } }", "resolvFs", {{"Float", 3, "InputOutput"}, "Float", {"Float", 2, "Input"}, {"Float", 3, "Input"}, "Float", _Integer, _Integer}, {16, 16}]

CUDATestFunction = CUDAFunctionLoad[...,{"Float", 2, "Input"},{16,16}];

CUDATestFunction = CUDAFunctionLoad[...,{"Float", 2, "InputOutput"},{16,16}];

Real Code for the actual problem with some modification to simplify the input (which does not change the occurring problem):

(*data initialization*) imageData = Table[RandomReal[], {i, 200}, {j, 300}]; (*this would usually be some GrayScale ImageData*) f = Table[RandomReal[], {i, 200}, {j, 300}]; (*this would be some function evaluated on the imageData*) maxiter = 10000; testImageData = Table[{RandomReal[], RandomReal[]}, {i, 200}, {j, 300}]; (*would be the gradient of the imageData*) (*the for loop that leads to the memory overflow*) For[i = 1, i < maxiter, i++, test = First@CUDATestFunction[testImageData, 0.1, f, imageGradientNormalized, 1, Sequence @@ Dimensions[imageData]]; testImageData = test; ] (*CUDACode*) CUDATestFunction = CUDAFunctionLoad[" __device__ float length(const float2& a) { return sqrtf(a.x*a.x+a.y*a.y); } __device__ float2 operator+(const float2& a, const float2& b) { return make_float2(a.x + b.x, a.y + b.y); } __device__ float2 operator-(const float2& a, const float2& b) { return make_float2(a.x - b.x, a.y - b.y); } __device__ float2 operator*(const float& a, const float2& b) { return make_float2(a * b.x, a * b.y); } __device__ float2 operator/(const float2& a, const float& b) { return make_float2(a.x / b, a.y / b); } __global__ void resolvFs(float* p, float sigma, float* f, float* imageGradientNormalized, float lambda1, mint width, mint height) { int xIndex = threadIdx.x + blockIdx.x * blockDim.x; int yIndex = threadIdx.y + blockIdx.y * blockDim.y; int index = 2*(xIndex + yIndex * width); if(xIndex < width && yIndex < height) { float2 vecP = make_float2(p[index], p[index+1]); float2 vecN = make_float2(imageGradientNormalized[index], imageGradientNormalized[index+1]); vecP = vecP + 2 * sqrtf(lambda1) * vecN; vecP = vecP/fmaxf(1, length(vecP)/(2*sqrtf(f[index/2]+lambda1))); vecP = vecP - 2 * sqrtf(lambda1) * vecN; p[index] = vecP.x; p[index+1] = vecP.y; } }", "resolvFs", {{"Float", 3, "InputOutput"}, "Float", {"Float", 2, "Input"}, {"Float", 3, "Input"}, "Float", _Integer, _Integer}, {16, 16}]

Source Link

asked Aug 30, 2013 at 16:02

Wizard

2.7k
18
28

CUDALink ran out of available memory

I have a problem with automatic GPU memory management when using CUDA. The situation is the following:

I create a simple CUDA function that operates on a 2D array (image), which does not take much memory space. This 2D array is passed as input then modified and output again. The function is loaded into mathematica using CUDAFuncitonLoad (pseudocode):

CUDATestFunction = CUDAFunctionLoad[...,{"Float", 2, "Input"},{16,16}];

The function compiles fine and works. Then I create some sort of loop that involves calling the CUDA-function mentioned before, assigning the output to a new variable and then doing the whole thing again till some number of iterations is reached (again only pseudocode):

Array2Dold = imageData; (*assign raw image data*) For[i=1,i<maxiter,i++, Array2Dnew = CUDATestFunction[Array2Dold]; Array2Dold = Array2Dnew; ]

The problem I have now is that when the number of iterations gets high I am running out of GPU memory and I get the error-message:"CUDAFunction::outmem: CUDALink ran out of available memory, possibly due to not freeing memory using the memory manager".

Does anybody know why that happens?

My guess is that mathematica for some reason keeps the Array2Dold or some modified version of it after calculation in CUDA in GPU memory and does not free the memory when the CUDA-funtion (CUDATestFunction) is finished. I know that there is a CUDALink Memory Manager, but from what I read in the mathematica documentation mathematica should at least be able to handle the above situation automatically in a way that does not lead to a memory overflow.

memory cudalink