I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a segmentation fault during validation.
Here is a snippet of my code:
from config import MainConfig from dataset import dataset from model2 import build_tf_model from utils import CustomModelCheckpoint, get_lr_callback import tensorflow as tf checkpoint_callback_val = CustomModelCheckpoint( "models/val_model_{epoch:02d}_{val_acc_l:.1f}.keras", monitor="val_acc_l", save_best_only=True, mode="max", verbose=0, start_epoch=5 ) gpus = tf.config.experimental.list_physical_devices('GPU') for gpu in gpus: tf.config.experimental.set_memory_growth(gpu, True) input_shape = (MainConfig.sequence_length, MainConfig.features) train_sequences, train_labels, validation_sequences, validation_labels = dataset() # numpy arrays of shape (samples, sequences, features) strategy = tf.distribute.MirroredStrategy(devices=["/GPU:0", "/GPU:1"]) with strategy.scope(): model = build_tf_model(input_shape) model.fit(train_sequences, train_labels, validation_data=(validation_sequences, validation_labels), epochs=MainConfig.epochs, shuffle=True, batch_size=MainConfig.train_batch_size, callbacks=[checkpoint_callback_val, get_lr_callback()] ) I have tried the following for the validation dataset:
Passing validation_sequences and validation_labels as NumPy arrays Using a tf.data.Dataset object Using a distributed dataset created with strategy.experimental_distribute_dataset Regardless of these attempts, the segmentation fault persists when using two GPUs. Here's the stack trace:
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py", line 335 in auto_shard_dataset File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/data/experimental/ops/distribute.py", line 74 in __init__ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_ops.py", line 56 in auto_shard_dataset File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 919 in _create_cloned_datasets_from_dataset File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 834 in build File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 804 in __init__ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_util.py", line 65 in get_distributed_dataset File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 592 in _experimental_distribute_dataset File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1468 in experimental_distribute_dataset File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 668 in __init__ File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 334 in fit File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117 in error_handler File "/mnt/train/train2.py", line 28 in <module> Has anyone encountered this issue or have any insights into why this might be happening with two GPUs? Are there any specific considerations or configurations needed for validation datasets when using MirroredStrategy with multiple GPUs?
Thank you in advance for your help!