Option 1) pre-installed drivers e.g. "AWS Deep Learning Base GPU AMI (Ubuntu 20.04)"
This AMI is documented at: https://aws.amazon.com/releasenotes/aws-deep-learning-base-gpu-ami-ubuntu-20-04/ and can be found on the AWS EC2 web UI Launch instance by searching for "gpu" under the "Quickstart AMIs" section (their search is terrible btw). I believe it is maintained by Amazon.
I have tested it on a g5.xlarge, documented at: https://aws.amazon.com/ec2/instance-types/g5/ which I believe is currently the most powerful single Nvidia GPU machine available (Nvidia A10G) as of December 2023. Make sure to use a US region as they are cheaper there, us-east-1 (North Virginia) was one of the cheapest when I checked, at 1.006 USD / hour, so a negligible cost for most people in a developed country. Just make sure to shutdown the VM each time to not keep paying!!!
Another working alternative is g4dn.xlarge, which is the cheapest GPU machine at 0.526 USD / hour on us-east-1 and runs an Nvidia T4, but I don't think there's much point in it as it is just half the price of the most powerful GPU choice, so why not just go for the most powerful one which which might save you some of your precious time by making such interactive experiments faster? This one should only be a consideration when optimizing deployment costs.
Also, to get access to g5.xlarge, first you have to request your vCPU limit to be increased to 4 as per: You have requested more vCPU capacity than your current vCPU limit of 0 since the GPU machines all seem to require at least 4 vCPUs, it is supremely annoying.
Once you finally get the instance and the image, running:
nvidia-smi
just works and returns:
Tue Dec 19 18:43:59 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A10G On | 00000000:00:1E.0 Off | 0 | | 0% 18C P8 9W / 300W | 4MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
This means the drivers are working, and from then on I managed run several software that use the GPU and watch nvidia-smi show the GPU usage go up.
The documentation page also links to: https://docs.aws.amazon.com/dlami/latest/devguide/gs.html which is a guide on the so called "AWS Deep Learning AMI" (DLAMI) which appears to be a selection of deep learning AMI variants by AWS, though unfortunately many of the ones documented there use Amazon Linux (RPM-based) rather than Ubuntu.
A sample AWS CLI that launches it is:
aws ec2 run-instances --image-id ami-095ff65813edaa529 --count 1 --instance-type g5.xlarge \ --key-name <yourkey> --security-group-ids sg-<yourgroup>
Option 2) install the drivers yourself on the base Ubuntu image "Ubuntu Server 22.04 LTS (HVM)"
This option adds extra time to the installation, but it has the advantage of giving you a newer Ubuntu and greater understanding of what the image contains. Driver installation on Ubuntu 22.04 was super easy, so this is definitely a viable option.
Just pick the first Ubuntu AMI Amazon suggests when launching an instance and run:
sudo apt update sudo apt install nvidia-driver-510 nvidia-utils-510 sudo reboot
and from there on nvidia-smi and everything else just works on g5.xlarge.
Related question: https://askubuntu.com/questions/1397934/how-to-install-nvidia-cuda-driver-on-aws-ec2-instance