I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1). Those pixels were sampled without replacement from 20 RGB images.
My question is: I have 2,000,000 rows which are divided in train/test data. But I observed that some combinations appear more than once in both train and test divisions. For exemple:
Train dataset (70%)
| R | G | B | Class |
|---|---|---|---|
| 20 | 30 | 40 | 0 |
| 30 | 40 | 50 | 1 |
| 50 | 20 | 5 | 0 |
Test dataset (30%)
| R | G | B | Class |
|---|---|---|---|
| 20 | 30 | 40 | 0 |
| 30 | 40 | 50 | 1 |
| 30 | 40 | 50 | 1 |
| 10 | 20 | 30 | 1 |
Is it ok for the random forest model to be trained and tested in a dataset divided in train/test partitions that have occurences of the same combination of values in both partitions? Or is it a data leakage scenario? If so, how to deal with it? Thank you very much.