How to run Python script on a Discrete Graphics AMD GPU?

Question

WHAT I WANT TO DO:

I have a script that I use for factorizing prime numbers given a certain range:

# Python program to display all the prime numbers within an interval lower = 900 upper = 1000 print("Prime numbers between", lower, "and", upper, "are:") for num in range(lower, upper + 1): # all prime numbers are greater than 1 if num > 1: for i in range(2, num): if (num % i) == 0: break else: print(num)

I would like to use the GPU instead of the CPU to run such script so it would be faster

THE PROBLEM:

I don't have a NVIDIA GPU on my Intel NUC NUC8i7HVK but a "Discrete GPU"

If I run this code to check what are my GPUs:

import pyopencl as cl import numpy as np a = np.arange(32).astype(np.float32) res = np.empty_like(a) ctx = cl.create_some_context() queue = cl.CommandQueue(ctx) mf = cl.mem_flags a_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=a) dest_buf = cl.Buffer(ctx, mf.WRITE_ONLY, res.nbytes) prg = cl.Program(ctx, """ __kernel void sq(__global const float *a, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] * a[gid]; } """).build() prg.sq(queue, a.shape, None, a_buf, dest_buf) cl.enqueue_copy(queue, res, dest_buf) print (a, res)

I receive:

[0] <pyopencl.Platform 'AMD Accelerated Parallel Processing' at 0x7ffb3d492fd0>
[1] <pyopencl.Platform 'Intel(R) OpenCL HD Graphics' at 0x187b648ed80>

THE POSSIBLE APPROACH TO THE PROBLEM:

I found a guide that takes you by the hand and explains step by step how to run it on your GPU. But all Pyhton libraries that pipes Python through the GPU like PyOpenGL, PyOpenCL, Tensorflow (Force python script on GPU), PyTorch, etc... are tailored for NVIDIA.

In case you have an AMD all libraries ask for ROCm but such software still doesn't support integrated GPU or Discrete GPU as far as I know (see my own reply below).

I only found a guide that talks about such approach but I cannot make it work.

Is there hope or I'm just tying to do something impossible?

EDIT: Reply to @chapelo

If I choose 0 the reply is:

Set the environment variable PYOPENCL_CTX='0' to avoid being asked again. [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81. 100. 121. 144. 169. 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729. 784. 841. 900. 961.]

If I choose 1 the reply is:

Set the environment variable PYOPENCL_CTX='1' to avoid being asked again. [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31.] [ 0. 1. 4. 9. 16. 25. 36. 49. 64. 81. 100. 121. 144. 169. 196. 225. 256. 289. 324. 361. 400. 441. 484. 529. 576. 625. 676. 729. 784. 841. 900. 961.]

Did you look at the documentation of numba for AMD Gpu using RocM? — Lescurel
– Lescurel, Commented Jan 13, 2021 at 15:01
@Lescurel Thank you, it looks like a good starting point. Apparently is only for Linux but I can sort it out. I will have a look at it. — Francesco Mantovani
– Francesco Mantovani, Commented Jan 13, 2021 at 17:02
@Lescurel, I'm following this guide right now shawonashraf.github.io/rocm-tf-ubuntu . BTW, I'm doing all this on an Intel NUC8i7HVK ark.intel.com/content/www/us/en/ark/products/126143/… which doesn't really has a GPU but instead it has two "integrated" GPU. Am I wasting my time or Tensorflow, PyOpenGL, etc... can work also with integrated GPU? — Francesco Mantovani
– Francesco Mantovani, Commented Jan 13, 2021 at 23:22

Francesco Mantovani · Accepted Answer · 2021-01-16 23:55:25Z

After extensive research and several try I reached the conclusion:

PyOpenGL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
PyOpenCL: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
TensorFlow: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm
PyTorch: Mainly works with NVIDIA. If you have an AMD GPU you you need to install ROCm

I installed ROCm but if I run rocminfo it returns:

ROCk module is NOT loaded, possibly no GPU devices Unable to open /dev/kfd read-write: No such file or directory Failed to get user name to check for video group membership hsa api call failure at: /src/rocminfo/rocminfo.cc:1142 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

clinfo returns:

Number of platforms 1 Platform Name AMD Accelerated Parallel Processing Platform Vendor Advanced Micro Devices, Inc. Platform Version OpenCL 2.0 AMD-APP (3212.0) Platform Profile FULL_PROFILE Platform Extensions cl_khr_icd cl_amd_event_callback Platform Extensions function suffix AMD Platform Name AMD Accelerated Parallel Processing Number of devices 0 NULL platform behavior clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) No platform clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) No platform clCreateContext(NULL, ...) [default] No platform clCreateContext(NULL, ...) [other] No platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) No devices found in platform

rocm-smi returns:

Segmentation fault

This because in the official guide it says that "The integrated GPUs of Ryzen are not officially supported targets for ROCm." and because mine is an integrated GPU I'm out of scope.

I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)

@chapelo I have the file C:\Windows\System32\OpenCL.dll and /etc/OpenCL/vendors/amdocl64.icd in Linux
@FrancescoMantovani I believe you did an extensive research on that. However, I'm not sure if this should be considered as an answer, especially when you said "I will stop wasting my time and probably buy an NVIDIA or AMD eGPU (external GPU)". I suggest that you do an edit to your post instead. Thanks!
@AndrewNaguib, yeah, definitely, I'm waiting chapelo 's answer which I believe is a good solution. I will test it and I will probably remove my answer.

chapelo · Accepted Answer · 2021-01-21 18:27:24Z

pyopencl does work with both your AMD and your Intel GPUs. And you checked that your installation is working. Only set your environment variable PYOPENCL_CTX='0' to use the AMD every time without being asked.

Or instead of using ctx = cl.create_some_context(), you could define the context in your program by using:

platforms = cl.get_platforms() ctx = cl.Context( dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, platforms[0])])

Don't take for granted that your AMD is better than your Intel for every case. I have had cases when the Intel surpasses the other one. I think this has to do with the cost of copying the data outside the CPU to the other GPU.

Having said that, I think that running your script in parallel won't be too much of an improvement, as compared to having a better algorithm:

with a sieving algorithm, get the prime numbers up to the square root of your upper number.
applying a similar sieving algorithm, use the primes from the previous step to sieve numbers from your lower to upper bounds.

Perhaps this is not a good example of an algorithm that can be easily run in parallel, but you are all set up to try another example.

However, to show you how you can solve this problem using your GPU, consider the following changes:

The serial algorithm would be something like the following:

from math import sqrt def primes_below(number): n = lambda a: 2 if a==0 else 2*a + 1 limit = int(sqrt(number)) + 1 size = number//2 primes = [True] * size for i in range(1, size): if primes[i]: num = n(i) for j in range(i+num, size, num): primes[j] = False for i, flag in enumerate(primes): if flag: yield n(i) def primes_between(lo, hi): primes = list(primes_below(int(sqrt(hi))+1)) size = (hi - lo - (0 if hi%2 else 1))//2 + 1 n = lambda a: 2*a + lo + (0 if lo%2 else 1) numbers = [True]*size for i, prime in enumerate(primes): if i == 0: continue start = 0 while (n(start)%prime) != 0: start += 1 for j in range(start, size, prime): numbers[j] = False for i, flag in enumerate(numbers): if flag: yield n(i)

This prints a list of the primes between 1e6 and 5e6, in 0.64 seconds

Trying to use your script with my GPU didn't finish in more than 5 minutes. For a 10 times smaller problem: the primes between 1e5 and 5e5, it took roughly 29 seconds.

Modifying the script so that each process in the GPU divides one odd number (there's no point testing even numbers) by a list of pre-computed primes up to the square root of the upper number, stopping if the prime is greater than the square root of the number itself, it completes the same task in 0.50 seconds. That's an improvement!

The code is the following:

import numpy as np import pyopencl as cl import pyopencl.algorithm import pyopencl.array def primes_between_using_cl(lo, hi): primes = list(primes_below(int(sqrt(hi))+1)) numbers_h = np.arange( lo + (0 if lo&1 else 1), hi + (0 if hi&1 else 1), 2, dtype=np.int32) size = (hi - lo - (0 if hi%2 else 1))//2 + 1 code = """\ __kernel void is_prime( __global const int *primes, __global int *numbers) { int gid = get_global_id(0); int num = numbers[gid]; int max = (int) (sqrt((float)num) + 1.0); for (; *primes; ++primes) { if (*primes <= max && num % *primes == 0) { numbers[gid] = 0; return; } } } """ platforms = cl.get_platforms() ctx = cl.Context(dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, platforms[0])]) queue = cl.CommandQueue(ctx) prg = cl.Program(ctx, code).build() numbers_d = cl.array.to_device(queue, numbers_h) primes_d = cl.array.to_device(queue, np.array(primes[1:], # don't need 2 dtype=np.int32)) prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data) array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2] yield from array.get()[:length.get()]

Check the edit I made, I posted an example at the end. Try to run it.
your code looks promising but both scripts returns nothing snipboard.io/M7N1Dy.jpg . Maybe because there isn't actually a print() in it? Also how can I set hi and lo?
The functions I gave you, are generators. You have to use them in your program with the input values that are relevant in each case, and the output will be used for something useful, I guess. Just printing the results does very little. Call the function from your program with the values of hi and lo (like your lower and upper values) and use the result one by one, or put them in a list. I thought you'd knew these simple things.
my knowledge are definitely lower than yours, for instance it's the first time I see a yield and I didn't know the existence of it before. I just want to print the results on the terminal. If I substitute yield from with return it says NameError: name 'primes_below' is not defined. How can I just print the numbers form 1 to 1000? I just want to check if your solution works. Thank you
I will post my response as another answer, to keep it clear and to the point.

chapelo · Accepted Answer · 2021-01-22 17:09:48Z

The following code is sample of a complete python program that usually includes:

The import statements
The function definitions
The main() function
The if __name__ == "__main__": section.

I hope this helps you solve your problem.

import pyprimes from math import sqrt import numpy as np import pyopencl as cl import pyopencl.algorithm import pyopencl.array def primes_below(number): """Generate a list of prime numbers below a specified `number`""" n = lambda a: 2 if a==0 else 2*a + 1 limit = int(sqrt(number)) + 1 size = number//2 primes = [True] * size for i in range(1, size): if primes[i]: num = n(i) if num > limit: break for j in range(i+num, size, num): primes[j] = False for i, flag in enumerate(primes): if flag: yield n(i) def primes_between(lo, hi): """Generate a list of prime numbers betwenn `lo` and `hi` numbers""" primes = list(primes_below(int(sqrt(hi))+1)) size = (hi - lo - (0 if hi%2 else 1))//2 + 1 n = lambda a: 2*a + lo + (0 if lo%2 else 1) numbers = [True]*size for i, prime in enumerate(primes): if i == 0: continue # avoid dividing by 2 nlo = n(0) # slower # start = prime * (nlo//prime + 1) if nlo%prime else 0 start = 0 while (n(start)%prime) != 0: start += 1 for j in range(start, size, prime): numbers[j] = False for i, flag in enumerate(numbers): if flag: yield n(i) def primes_between_using_cl(lo, hi): """Generate a list of prime numbers betwenn a lo and hi numbers this is a parallel algorithm using pyopencl""" primes = list(primes_below(int(sqrt(hi))+1)) size_primes_h = np.array( (len(primes)-1, ), dtype=np.int32) numbers_h = np.arange( lo + (0 if lo&1 else 1), hi + (0 if hi&1 else 1), 2, dtype=np.int32) size = (hi - lo - (0 if hi%2 else 1))//2 + 1 code = """\ __kernel void is_prime( __global const int *primes, __global int *numbers) { int gid = get_global_id(0); int num = numbers[gid]; int max = (int) (sqrt((float)num) + 1.0); for (; *primes; ++primes) { if (*primes > max) break; if (num % *primes == 0) { numbers[gid] = 0; return; } } } """ platforms = cl.get_platforms() ctx = cl.Context(dev_type=cl.device_type.ALL, properties=[(cl.context_properties.PLATFORM, platforms[0])]) queue = cl.CommandQueue(ctx) prg = cl.Program(ctx, code).build() numbers_d = cl.array.to_device(queue, numbers_h) primes_d = cl.array.to_device(queue, np.array(primes[1:], dtype=np.int32)) prg.is_prime(queue, (size, ), None, primes_d.data, numbers_d.data) array, length = cl.algorithm.copy_if(numbers_d, "ary[i]>0")[:2] yield from array.get()[:length.get()] def test(f, lo, hi): """Test that all prime numbers are generated by comparing with the output of the library `pyprimes`""" a = filter(lambda p: p>lo, pyprimes.primes_below(hi)) b = f(lo, hi) result = True for p, q in zip (a, b): if p != q: print(p, q) result = False return result def main(): lower = 1000 upper = 5000 print("The prime numbers between {} and {}, are:".format(lower,upper)) print() for p in primes_between_using_cl(lower, upper): print(p, end=' ') print() if __name__ == '__main__': main()

Thank you so much @chapelo. However I still have tons of doubt: I can set platforms[0] or platforms[0] however the script is always using the Radeon and it never uses the Intel. Also: the script only takes a 10% of the GPU, not more than that: snipboard.io/sNqzWk.jpg

Collectives™ on Stack Overflow

How to run Python script on a Discrete Graphics AMD GPU?

3 Answers 3

4 Comments

5 Comments

1 Comment

Linked

Hot Network Questions