dataset split for image classification

Question

I am trying to do image classification for 14 categories (around 1000 images for each cat). And i initially created two folders for training and validation. In this case, do I still need to set a validation split or a subset in a code? or I can use the whole files as train_ds and val_ds by deleting them

Folder names in the training and validation directory are same.

data_dir = 'trainingdatav1' data_val = 'Validationv1' train_ds = tf.keras.preprocessing.image_dataset_from_directory( data_dir, validation_split=0.1, #is it required if I'm gonna use the whole folders and files for training? subset="training", seed=123, image_size=(img_height, img_width), batch_size=batch_size) val_ds = tf.keras.preprocessing.image_dataset_from_directory( data_val, validation_split=0.8, #need to check subset="validation", seed=455, image_size=(img_height, img_width), batch_size=batch_size) num_classes = 14 model = tf.keras.Sequential([ layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width, 3)), layers.Conv2D(16, 3, padding='same', activation='softmax'), layers.MaxPooling2D(), layers.Conv2D(32, 3, padding='same', activation='relu'), #from renu layers.MaxPooling2D(), layers.Conv2D(64, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Dropout(.2), #prevent overfitting layers.Flatten(), layers.Dense(128, activation='sigmoid'), layers.Dense(num_classes) ]) model.compile(optimizer='SGD', #adam loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) model.summary() epochs=50 history = model.fit( train_ds, validation_data=val_ds, epochs=epochs )

Another question is the overfitting issue - validation accuracy is not over 0.4 and val_loss is around 2.xxx. Suggestions from Stacexchange are:

Reduce the layers of the neural network.
Reduce the number of neurons in each layer of the network to reduce the number of parameters.
Add dropout and tune its rate.
Use L2 normalisation on the parameter weights and tune the lambda value.
If possible add more data for training.

Are there any other suggestions?

VRaina · Accepted Answer · 2021-04-16 00:06:26Z

If you have already split your training and validation sets into separate directories then there is no need to technically do the splitting in your code. However, the problem with a pre-defined validation set is that it can lead to overfitting more easily: the primary purpose of a validation set is to detect overfitting and if you keep tuning your hyperparameters for your model using a fixed validation set every time you train, then your model hyperparameters may be overfitting to that specific validation set.

Another problem with a fixed validation set is that it prevents you from exploiting approaches such as K-fold cross validation, where you split your data into K groups and each each group takes its turn as a validation set during training (prevents hyperparameter overfitting to a specific validation set as discussed above)...

With regard to your question on overfitting, your list is very comprehensive. However, I would add Early Stopping to that list - this you may already be doing - but good to mention explicitly.

Luis Ezequiel Muñoz · Accepted Answer · 2021-04-16 02:18:29Z

The problem with a pre-defined validation set is that it can lead to overfitting more easily: the primary purpose of a validation set is to detect overfitting and if you keep tuning your hyperparameters for your model using a fixed validation set every time you train, then your model hyperparameters may be overfitting to that specific validation set.

Stack Exchange Network

dataset split for image classification

2 Answers 2

Hot Network Questions

dataset split for image classification

2 Answers 2

Related

Hot Network Questions