0

I have a machine with cuda 10.1 and tensorflow and tensorflow gpu 1.14.0 installed. I am running a python script that trains a CNN in a virtualenv. I am indicating in the source code that I want to use the GPU, as follows:

import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"; os.environ["CUDA_VISIBLE_DEVICES"]="0"; 

However, when I run the script, the training epochs are taking a lot to finish. Here is the output of my nvidia-smi:

enter image description here

What I think is strange is why the GPU utilization is that low and why my python script is not appearing in the processes list. Here are the outputs of some commands I have tried:

>>> import tensorflow as tf >>> sess = tf.Session(config=tf.ConfigProto(log_device_placement=True)) 

the output is

2019-10-14 09:53:12.674719: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-10-14 09:53:12.679047: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1 2019-10-14 09:53:12.784993: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-14 09:53:12.785744: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f155c59650 executing computations on platform CUDA. Devices: 2019-10-14 09:53:12.785771: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5 2019-10-14 09:53:12.806453: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2019-10-14 09:53:12.807345: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55f15605dfc0 executing computations on platform Host. Devices: 2019-10-14 09:53:12.807408: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-10-14 09:53:12.807829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-14 09:53:12.808859: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:01:00.0 2019-10-14 09:53:12.809148: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.809313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.809481: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.809531: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.809572: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.809611: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:53:12.811997: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-10-14 09:53:12.812038: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2019-10-14 09:53:12.812059: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-14 09:53:12.812067: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-10-14 09:53:12.812072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N Device mapping: /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device 2019-10-14 09:53:12.812372: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping: /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

Other command I tried is

>>> with tf.Session() as sess: devices = sess.list_devices() 

The output is

2019-10-14 09:55:52.398317: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-14 09:55:52.399249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:01:00.0 2019-10-14 09:55:52.399355: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399399: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399437: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399475: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399509: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399544: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 09:55:52.399552: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-10-14 09:55:52.399557: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2019-10-14 09:55:52.402143: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-14 09:55:52.402162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]

Finally, I also tried this

>>> from tensorflow.python.client import device_lib >>> print(device_lib.list_local_devices()) 

With the following output

2019-10-14 10:00:52.389511: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-10-14 10:00:52.390582: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545 pciBusID: 0000:01:00.0 2019-10-14 10:00:52.390741: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390811: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390854: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390897: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390934: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390968: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64: 2019-10-14 10:00:52.390975: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2019-10-14 10:00:52.390980: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices... 2019-10-14 10:00:52.390990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-10-14 10:00:52.390994: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187] 0 2019-10-14 10:00:52.390998: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0: N [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 17281747132467712783 , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality { } incarnation: 3885020928213180904 physical_device_desc: "device: XLA_GPU device" , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 15667518323180153095 physical_device_desc: "device: XLA_CPU device" ]

Interestingly, when I run these commands, the python process appears in the NVIDIA-SMI monitor.

What am I missing here?

2 Answers 2

1

From your log:

Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory;

You installed CUDA 10.1 but TF-GPU requires CUDA 10.0, so you need to install it (no need to uninstall the 10.1 one, they can coexist)

Sign up to request clarification or add additional context in comments.

1 Comment

Let me try that. Thank you so much!
1

Recently I sent to friends instructions to install cuda and tf-gpu using conda (because this is the fast) - after some while of searching in the internet, my protocol is this:

########################## # Install Miniconda ########################## mkdir -p ~/install cd ~/install wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # I guess on a mac you should do # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh ######################### # install nvidia driver # so these are the linux (ubuntu) commands # for mac, maybe one should follow the scheme # removing nvidia drivers first # and then download newest nvidia driver # and install it # and reboot # # If you are using a laptop without gpu, just skip this block ######################### sudo apt purge nvidia-* # remove all nvidia driver first sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt install nvidia-driver-418 sudo apt install nvidia-cuda-toolkit # reboot sudo reboot ######################### # install machine learning stuff keras tensorflow-gpu # # if you are installing in a laptop without gpu, # replace 'tensorflow-gpu' by 'tensorflow'! ######################### conda create --name keras conda activate keras conda install python ipython jupyter pandas scipy seaborn scikit-learn tensorflow-gpu keras pytest openpyxl graphviz ######################### # finally, test a successful installation by: # entering: ipython # and there trying: from tensorflow.python.client import device_lib print(device_lib.list_local_devices()) # should list gpu # sth like: physical_device_desc: "device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1" , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality { } incarnation: 14085000268159177816 physical_device_desc: "device: XLA_GPU device" ] 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.