1
$\begingroup$

I am currently testing 5 different optimizers to see their training loss, and their testing accuracy. The optimizers are: AdaGrad, AdaDelta, RMSprop, Adam, Nadam. I am using a quite simple model with only two hidden layers, each with 1000 hidden nodes, and the dataset is Cifar10. I am also testing it with, and without, dropout. Here are my models and my setup:

import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import sklearn import sys import tensorflow as tf import time ## Load data (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data() ## Build Neural Network, take optimizer as parameter and compile def get_model(optimizer): model = tf.keras.Sequential() model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3))) model.add(tf.keras.layers.Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='he_normal')) # Hidden model.add(tf.keras.layers.Dense(1000, activation='relu', kernel_regularizer='l2', kernel_initializer='he_normal')) # Hidden model.add(tf.keras.layers.Dense(10, activation='softmax', name='output')) # Output model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]) return model optimizers = dict() optimizers['AdaGrad'] = tf.keras.optimizers.Adagrad() optimizers['AdaDelta'] = tf.keras.optimizers.Adadelta() optimizers['RMSProp'] = tf.keras.optimizers.RMSprop() optimizers['Adam'] = tf.keras.optimizers.Adam() optimizers['Nadam'] = tf.keras.optimizers.Nadam() ### Fit histories = dict() for optimizer in optimizers: print("Running", optimizer) model = get_model(optimizers[optimizer]) history = model.fit(x_train, y_train, epochs=200, batch_size=128, verbose=1) histories[optimizer] = history 

Next is my very similar code but for dropout. I must admit that my proficiency with dropout in Tensorflow/Keras is not super high, so I may well be doing something wrong here. For example, I'm not sure which side of the layer the dropout is supposed to go on, or whether using both He initialization and dropout is acceptable.

## Build Neural Network def get_dropout_model(optimizer): model = tf.keras.Sequential() model.add(tf.keras.layers.Flatten(input_shape=(32, 32, 3))) model.add(tf.keras.layers.Dropout(0.2)) model.add(tf.keras.layers.Dense(1000, activation='relu', kernel_initializer='he_normal')) # Hidden model.add(tf.keras.layers.Dropout(0.5)) model.add(tf.keras.layers.Dense(1000, activation='relu', kernel_initializer='he_normal')) # Hidden model.add(tf.keras.layers.Dropout(0.5)) model.add(tf.keras.layers.Dense(10, activation='softmax')) model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=[tf.keras.metrics.SparseCategoricalAccuracy()]) return model # Dropout dropout_histories = dict() for optimizer in optimizers: print("Running", optimizer, " with dropout") model = get_dropout_model(optimizers[optimizer]) history = model.fit(x_train, y_train, epochs=200, batch_size=128, verbose=1) dropout_histories[optimizer] = history 

What my training process shows is that Loss monotonically decreases over the course of training, and for almost all of them I reach an exceptionally low loss. However, training accuracy and test accuracy both remain very bad. AdaGrad and AdaDelta are the exception, where they achieved a decent accuracy when not using dropout. With dropout, however, all models got accuracies of 0.1, no better than a random guess.

Some of the logs from training: enter image description here

Loss for AdaGrad monotonically decreases and accuracy steadily goes up.

Last 5 epochs of RMSprop: enter image description here

Loss is very low, but training accuracy is bad.

Adam finishes with worse training accuracy than it started with (first few epochs ~0.2 accuracy): enter image description here

Nadam: enter image description here

AdaGrad with Dropout: enter image description here

Final evaluation: enter image description here

enter image description here

Some of my ideas what I might be doing wrong:

  • My loading of the data is wrong. Perhaps the labels need to be converted to categorical with tf.keras.util.to_categorical()
  • I am using the wrong metrics for the wrong data. Perhaps sparse categorical accuracy is not applicable to the format of the data set. From my understanding, though, it is. And besides, the first few optimizers do actually achieve a decent accuracy.
  • Something about my model architecture is wrong.
  • For the dropout results specifically, I am implementing dropout wrong. I know dropout weights need to be scaled during inference, but from my understanding this is performed automatically by Keras/TF. If I do need to manually scale them, how would I go about doing this?

What is very strange is that there is no difference between the models except the optimizer used. I am using the same get method to generate each network! So if AdaGrad and AdaDelta get decent accuracies, then so should Adam, Nadam and RMSprop using the exact same architecture.

If this were simply that my models overfit to the training data, and thus has poor generalization to the test set, then I would expect that at least the training accuracy is high. But this is not the case.

Any input on this would be greatly appreciated!

$\endgroup$

1 Answer 1

1
$\begingroup$

Have you tried adding an extra layer to your network? Adding another layer with 1000 neurons allows me to achieve 40+% accuracy using the Adam-based optimizers, with some reaching an accuracy of more than 60% (not using dropout). Using RMSProp with the same additional layer also seems to improve training accuracy wise, however it seems to converge much slower. This doesn't seem to just be about the number of parameters, as using an addition layer but fewer neurons per layer (e.g. [1000, 500, 250]) achieves better accuracy using fewer parameters (~4m parameters vs ~3.7m). Didn't have the time to also test the networks with dropout, but I'd suggest first making sure the model can overfit on the training data before using dropout to prevent overfitting.

$\endgroup$
1
  • $\begingroup$ That's a useful comment for sure. Unfortunately my task was specified as using only 2 hidden layers, so I don't have that flexibility. What I am curious about, though, is if you got the same testing (and training) accuracy as me when running only with 2 hidden layers. To confirm whether it is something specifically to what I did or a more general result. $\endgroup$ Commented Nov 9, 2022 at 21:18

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.