In the official paper "Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)", they provided a dataset including over 12,500 images across 3 tasks related to skin image analysis. I am using this dataset for research purposes, specifically focusing on Task 1 (skin lesion segmentation). According to the paper, the description of this task is:
"For Part 1, 2,594 dermoscopic images with ground truth segmentation masks were provided for training. For validation and test sets, 100 and 1,000 images were provided, respectively, without ground truth masks."
The ground truth masks for the validation and test sets are now released. The dataset can be found here.
Now I am confused about whether to use the official split for developing the segmentation model or to use a custom split. Specifically, using only 100 images as the validation set for model checkpointing and early stopping can be noisy, as 100 images may be too few for stable validation.
Another paper, "LAMFFNet: Lightweight Adaptive Multi-layer Feature Fusion Network for Medical Image Segmentation", stated that:
"The ISIC 2018 dataset is a challenging and representative 2D skin lesion boundary segmentation dataset in the field of computer-aided diagnosis. It contains 2,594 JPEG dermatoscopy images and 2,594 PNG ground truth (GT) images. The segmentation task uses 2,594 images as training, while the testing set has 1,000 images."
According to their description, they ignored the official validation set, and it is not clear whether they used an internal validation dataset by dividing the training set.
So I am confused about what approach to follow in this case. Can I internally divide the training set into 85/15 for training and validation, then merge the original validation set with the test set to get 1,100 images for testing? Or should I follow the original dataset splitting and use the original 100-image validation set for validation? Alternatively, I was thinking of another solution: either recombine all of the data and use 5-fold cross-validation, or, after selecting the best split from the options above (e.g., 85/15 internal split), repeat the experiment multiple times using this same split and take the average. Is any of them a reasonable approach, or is there another safe solution?