8

I am using AWS to train a CNN on a custom dataset. I launched a p2.xlarge instance, uploaded my (Python) scripts to the virtual machine, and I am running my code via the CLI.

I activated a virtual environment for TensorFlow(+Keras2) with Python3 (CUDA 10.0 and Intel MKL-DNN), which was a default option via AWS.

I am now running my code to train the network, but it feels like the GPU is not 'activated'. The training goes just as fast (slow) as when I run it locally with a CPU.

This is the script that I am running:

https://github.com/AntonMu/TrainYourOwnYOLO/blob/master/2_Training/Train_YOLO.py

I also tried to alter it by putting with tf.device('/device:GPU: 0'): after the parser (line 142) and indenting everything underneath under there. However, this doesn't seem to have changed anything.

Any tips on how to activate the GPU (or check if the GPU is activated)?

3 Answers 3

3

Checkout this answer for listing available GPUs.

from tensorflow.python.client import device_lib def get_available_gpus(): local_device_protos = device_lib.list_local_devices() return [x.name for x in local_device_protos if x.device_type == 'GPU'] 

You can also use CUDA to list the current device and, if necessary, set the device.

import torch print(torch.cuda.is_available()) print(torch.cuda.current_device()) 
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for your answer! I am able to view the available GPU's, also by running nvidia-smi. So I do know that there is an available GPU, it is just not activate when I run my code. And that's exactly the problem I want to solve.
Do you get any errors setting the device?
Hi Miles. Thanks again for you comment! I tried doing that, and I didn't get any errors. I did the following commands which had the following outputs: torch.cuda.is_available() --> True, torch.cuda.is_initialized() -->False , torch.cuda.set_device(0), torch.cuda.is_initialized() --> True. However, the processing speed does not go up and nvidia-smi still gives me 'no running processes', unfortunately.
I also ran the first command you propsed, and weirdly it doesn't return a GPU. The output is: [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 15573112437867445376 , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 9660188961145538128 physical_device_desc: "device: XLA_CPU device" ]
If the first command doesn't return anything the GPU isn't available to tensorflow. There can be a couple issues for this, but I would 1) check the the GPU is available to the OS: lspci | grep VGA should return the NVIDIA GPU. 2) check that the versions of tensorflow and cuda support your GPU. What AMI are you using?
2

In the end it had to do with my tensorflow package! I had to uninstall tensorflow and install tensorflow-gpu. After that the GPU was automatically activated.

For documentation see: https://www.tensorflow.org/install/gpu

Comments

1

Option 1) pre-installed drivers e.g. "AWS Deep Learning Base GPU AMI (Ubuntu 20.04)"

This AMI is documented at: https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/ and can be found on the AWS EC2 web UI Launch instance by searching for "gpu" under the "Quickstart AMIs" section (their search is terrible btw). I believe it is maintained by Amazon.

I have tested it on a g5.xlarge, documented at: https://aws.amazon.com/ec2/instance-types/g5/ which I believe is currently the most powerful single Nvidia GPU machine available (Nvidia A10G) as of December 2023. Make sure to use a US region as they are cheaper there, us-east-1 (North Virginia) was one of the cheapest when I checked, at 1.006 USD / hour, so a negligible cost for most people in a developed country. Just make sure to shutdown the VM each time to not keep paying!!!

Another working alternative is g4dn.xlarge, which is the cheapest GPU machine at 0.526 USD / hour on us-east-1 and runs an Nvidia T4, but I don't think there's much point in it as it is just half the price of the most powerful GPU choice, so why not just go for the most powerful one which which might save you some of your precious time by making such interactive experiments faster? This one should only be a consideration when optimizing deployment costs.

Also, to get access to g5.xlarge, first you have to request your vCPU limit to be increased to 4 as per: You have requested more vCPU capacity than your current vCPU limit of 0 since the GPU machines all seem to require at least 4 vCPUs, it is supremely annoying.

Once you finally get the instance and the image, running:

nvidia-smi 

just works and returns:

Tue Dec 19 18:43:59 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 18C P8 9W / 300W | 4MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ 

This means the drivers are working, and from then on I managed run several software that use the GPU and watch nvidia-smi show the GPU usage go up.

The documentation page also links to: https://docs.aws.amazon.com/dlami/latest/devguide/gs.html which is a guide on the so called "AWS Deep Learning AMI" (DLAMI) which appears to be a selection of deep learning AMI variants by AWS, though unfortunately many of the ones documented there use Amazon Linux (RPM-based) rather than Ubuntu.

A sample AWS CLI that launches it is:

aws ec2 run-instances --image-id ami-095ff65813edaa529 --count 1 --instance-type g5.xlarge \ --key-name <yourkey> --security-group-ids sg-<yourgroup> 

Option 2) install the drivers yourself on the base Ubuntu image "Ubuntu Server 22.04 LTS (HVM)"

This option adds extra time to the installation, but it has the advantage of giving you a newer Ubuntu and greater understanding of what the image contains. Driver installation on Ubuntu 22.04 was super easy, so this is definitely a viable option.

Just pick the first Ubuntu AMI Amazon suggests when launching an instance and run:

sudo apt update sudo apt install nvidia-driver-510 nvidia-utils-510 sudo reboot 

and from there on nvidia-smi and everything else just works on g5.xlarge.

Related question: https://askubuntu.com/questions/1397934/how-to-install-nvidia-cuda-driver-on-aws-ec2-instance

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.