3
$\begingroup$

I am training a model using Keras python library to recognize images of drawings that belong to two artists. Here is a screenshot of the flactuations I am seeing:

587/587 ━━━━━━━━━━━━━━━━━━━━ 906s 2s/step - accuracy: 0.5640 - loss: 0.7947 - val_accuracy: 0.4535 - val_loss: 2.7776 Epoch 2/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 72s 120ms/step - accuracy: 0.6875 - loss: 0.6460 - val_accuracy: 0.4521 - val_loss: 2.3714 Epoch 3/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 886s 2s/step - accuracy: 0.6059 - loss: 0.6735 - val_accuracy: 0.4670 - val_loss: 1.0623 Epoch 4/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 72s 120ms/step - accuracy: 0.6562 - loss: 0.5518 - val_accuracy: 0.6021 - val_loss: 0.6561 Epoch 5/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 874s 1s/step - accuracy: 0.6801 - loss: 0.5918 - val_accuracy: 0.5806 - val_loss: 0.7847 Epoch 6/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 72s 120ms/step - accuracy: 0.5938 - loss: 0.6360 - val_accuracy: 0.5635 - val_loss: 1.1081 Epoch 7/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 879s 1s/step - accuracy: 0.7032 - loss: 0.5737 - val_accuracy: 0.7170 - val_loss: 0.5536 Epoch 8/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 71s 119ms/step - accuracy: 0.6562 - loss: 0.5647 - val_accuracy: 0.5729 - val_loss: 1.0483 Epoch 9/100 587/587 ━━━━━━━━━━━━━━━━━━━━ 887s 2s/step - accuracy: 0.7040 - loss: 0.5656 - val_accuracy: 0.5951 - val_loss: 1.3292

Why does one epoch take 900s but the next epoch only 70s? Does anyone know why this is happening and if that is considered normal behaviour?

Adding the code for more details:

STEP1: augmentation

image_gen = ImageDataGenerator(rotation_range=30, width_shift_range=0.1, height_shift_range=0.1, rescale=1/255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True, fill_mode='nearest')

STEP2: model architecture

model = tf.keras.models.Sequential([

layers.Conv2D(32, (3, 3), activation='relu', input_shape=(300, 300, 3)),

layers.MaxPooling2D(2, 2),

layers.Conv2D(64, (3, 3), activation='relu', input_shape=(300, 300, 3)),

layers.MaxPooling2D(2, 2),

layers.Conv2D(64, (3, 3), activation='relu', input_shape=(300, 300, 3)),

layers.MaxPooling2D(2, 2),

layers.Conv2D(64, (3, 3), activation='relu', input_shape=(300, 300, 3)),

layers.MaxPooling2D(2, 2),

layers.Flatten(),

layers.Dense(512, activation='relu'),

layers.BatchNormalization(),

layers.Dense(512, activation='relu'),

layers.Dropout(0.1),

layers.BatchNormalization(),

layers.Dense(512, activation='relu'),

layers.Dropout(0.2),

layers.BatchNormalization(),

layers.Dense(1, activation='sigmoid') ])

model.summary()

model.compile(

loss='binary_crossentropy',

optimizer='adam',

metrics=['accuracy'])

train_datagen = image_gen.flow_from_directory('data2/train', target_size=(300, 300), batch_size=32, class_mode='binary')

test_datagen = image_gen.flow_from_directory('data2/test', target_size=(300, 300), batch_size=32, class_mode='binary')

train_datagen.class_indices

history = model.fit(train_datagen, epochs=100, steps_per_epoch=18814//32, validation_data=test_datagen, validation_steps=2911//32)

STEP3: saving the model

model.save('my_model_11.keras')

$\endgroup$
3
  • 2
    $\begingroup$ Obviously, every second epoch is short. First assumption would be, that the sampling process for your epochs is somehow screwed so that one epoch gets to see very few samples samples. If for example one epoch takes 90% of the samples and the next epoch only takes from the remaining ones, then you would see such effects. $\endgroup$ Commented Aug 22 at 19:41
  • 2
    $\begingroup$ To give you non-speculative answer, you should probably add some more details about your setup and source code. As @Broele already stated, this could be due the sampler. But, it might also be, that your sample images have different resolution and hence some need longer to process than others. Another reason might be that some other process running on the same GPU is running (e.g. at a university a colleague might run a job in parallel and partially use the same GPU). To gather more insight (on NVidia devices) is to run "nvidia-smi" in terminal and observe VRAM and GPU usage and running jobs. $\endgroup$ Commented Aug 26 at 12:36
  • $\begingroup$ Thank you for the comments. I have added the code for more details. $\endgroup$ Commented Aug 26 at 16:55

1 Answer 1

3
$\begingroup$

Large, regular fluctuation in epoch times—alternating between around 900 seconds and 70 seconds—is a strong indicator of an underlying performance bottleneck rather than a bug in Keras. The comments from @Broele and @SDwarfs correctly point towards issues with the data pipeline and resource utilisation, which we will look into in detail in what follows.

Likely Diagnosis: Filesystem Caching and I/O Bottlenecks

The most likely cause of this odd/even pattern is the interaction between the data loading process and the operating system's filesystem cache/page cacshe. This is a core OS feature that uses available RAM to store recently accessed data from slower storage devices like an SSD or HDD.

The alternating times are a classic case of the filesystem cache being successively populated and then partially evicted. The slow epochs (the first, third, etc.) represent a "cold read" phase. This occurs when the required training data is not fully available in the cache, forcing the system to read images directly from disk. While this is common when a script first runs, the key is that the pattern recurs. The fast epochs (the second, fourth, etc.) happen because the preceding slow epoch has populated the cache, turning subsequent file requests into "cache hits" that read from fast RAM. The cycle resets after each fast epoch because the validation step reads thousands of different files, causing cache eviction and forcing the next training epoch to read from disk again.

Resolving the Bottleneck

To achieve consistent performance, we can move from relying on this unpredictable, OS-level cache to a deliberate, in-application caching strategy.

The most robust solution is to replace the ImageDataGenerator with the modern tf.data API. This gives us granular control to build a more performant pipeline. Specifically, we can use the dataset.cache() transformation. This function explicitly creates a stable cache for the dataset in memory (or a local file) after the data is loaded during the first epoch. This should ensure that all subsequent epochs read from a predictable, high-speed source, completely bypassing the disk I/O bottleneck and the variability of the OS page cache.

Also, a critical error in the original setup is using the same augmented generator for both training and validation. A separate, non-augmenting data pipeline should be used for the validation set to ensure a consistent measure of model performance.

Diagnostics and Architectural Adjustments

To confirm the diagnosis, we can monitor GPU utilisation during training. Consistently low utilisation during the slow epochs would verify that the GPU is idling while waiting for data, confirming an I/O or CPU bottleneck.

Finally, the model architecture seems generally OK, but for clarity, the input_shape argument should only be specified on the first layer of a Keras Sequential model. Subsequent layers automatically infer their input shapes, so including the argument on intermediate Conv2D layers is redundant.

References

Gregg, B. (n.d.). Linux Performance: File System Caching. brendangregg.com. Retrieved from https://www.brendangregg.com/linuxperf.html#FileSystemCaching

Red Hat. (2023). What is the Linux page cache? Red Hat Sysadmin. Retrieved from https://www.redhat.com/sysadmin/linux-page-cache

TensorFlow Team. (n.d.). Better performance with the tf.data API: Caching. TensorFlow. Retrieved from https://www.tensorflow.org/guide/data_performance#caching

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.