15

I am having some difficulty understanding exactly why the GPU and CPU speeds are similar with networks of small size (CPU is sometimes faster), and GPU is faster with networks of larger size. The code at the bottom of the question runs in 103.7s on an i7-6700k, but when using tensorflow-gpu, the code runs in 29.5 seconds.

However, when I train a network that has 100 hidden neurons, instead of 1000 like in the example below, I get ~20 seconds when using the GPU, and ~15 seconds when using the CPU.

I read on another stack overflow answer that CPU->GPU transfers take long, I'm assuming this is in reference to loading the data examples on the GPU.

Can someone explain why this occurs, and possibly reference some change in the code that I can make to maximize speed?

import numpy as np import tensorflow as tf import keras from keras.models import Sequential from keras.utils import np_utils from keras.layers.core import Dense, Activation, Flatten, Dropout from sklearn.preprocessing import normalize ## Importing the MNIST dataset using Keras from keras.datasets import mnist (X_train, y_train), (X_test, y_test) = mnist.load_data() # reshape for vector input N, x, y = X_train.shape X_train = normalize(np.reshape(X_train, (N, x * y))) N, x, y = X_test.shape X_test = normalize(np.reshape(X_test, (N, x * y))) # one-hot encoding y_train = np_utils.to_categorical(y_train) y_test = np_utils.to_categorical(y_test) model = Sequential() model.add(Dense(output_dim=750, input_dim=784)) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(150)) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(50)) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(50)) model.add(Activation('relu')) model.add(Dropout(0.2)) model.add(Dense(10)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy']) fit = model.fit(X_train, y_train, batch_size=128, nb_epoch=10, verbose=0) ## Printing the accuracy of our model, according to the loss function specified in model.compile above score = model.evaluate(X_test, y_test, verbose=0) print('Test score:', score[0]) print('Test accuracy:', score[1]) 
3
  • 1
    What GPU are you using? Note that to completely saturate a top-of-the line GPU requires tens of thousands of threads. Assuming each thread handles the computation of one neuron, a system with 100 neurons would be underutilizing the GPU. Conversely, if you were to increase the number of neurons to, say 10K, the relative advantage of the GPU vs the CPU is likely to increase further. Commented Feb 7, 2017 at 18:41
  • Whoops, totally forgot to include that in the answer. I have a GTX 1070. And I see. That makes sense Commented Feb 7, 2017 at 18:42
  • I actually noticed the same behaviour on my GTX 1070 GPU. I don't see any difference between running my model (which has similar dimensions to the one you are using) on CPU (i7-7700) and the GPU. Need to try to increase the capacity of the network to evaluate the difference Commented Oct 4, 2017 at 7:36

1 Answer 1

14

In case of tiny networks batch loading may be the culprit here.

Keras is loading each minibatch from RAM to GPU at the start of each iteration, thus creating a bottleneck in tiny networks (where forward/backward computation is very quick).
You can try using model.fit_generator instead of plain fit, so that CPU thread which loads minibatches works in parallel.

Unfortunately, there is no way I am aware of to preload the whole dataset on GPU for Keras (see my issue)

If you're using Tensorflow backend, you can use Google Timeline profiling tool to see what causes the slowdowns. For the reference, see this issue

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, batch loading was the issue for me. Now it runs much faster.
can you explain me how to do a good generator as you described?
Not sure that I understand what do you mean by good, there are a couple of examples searchable with google like this: kaggle.com/ezietsman/simple-keras-model-with-data-generator
Worth mentioning, slow performance on GPU sometimes can be solved by using a cudnn layer, see this question: stackoverflow.com/questions/41948406/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.