4
$\begingroup$

I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1). Those pixels were sampled without replacement from 20 RGB images.

My question is: I have 2,000,000 rows which are divided in train/test data. But I observed that some combinations appear more than once in both train and test divisions. For exemple:

Train dataset (70%)

R G B Class
20 30 40 0
30 40 50 1
50 20 5 0

Test dataset (30%)

R G B Class
20 30 40 0
30 40 50 1
30 40 50 1
10 20 30 1

Is it ok for the random forest model to be trained and tested in a dataset divided in train/test partitions that have occurences of the same combination of values in both partitions? Or is it a data leakage scenario? If so, how to deal with it? Thank you very much.

$\endgroup$
4
  • $\begingroup$ Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. $\endgroup$ Commented May 11 at 17:09
  • $\begingroup$ I tried to reword my question in the original post, but I would like to know if samples that coincidentally are both in train and test datasets can be a problem for a random forest model. For example see the tables from the post. Thanks! $\endgroup$ Commented May 12 at 0:24
  • $\begingroup$ it depends on the data generation process and how the model will be applied. Are all the combinations of values with the same target ? In reality will you have similar match in your model application phase ? $\endgroup$ Commented May 14 at 3:36
  • $\begingroup$ Yes! If they are for example (40,30,20,1) in training, in the test I found the same (40,30,20,1). However, they are from a different sample. In reality it Will occur because they are samples from satellite images (for example, extraction of one hundred pixels representing R,G,B values representing soil - expect to have similar values). I thought now of removing duplicates from csv would help? $\endgroup$ Commented May 14 at 22:06

1 Answer 1

5
$\begingroup$

20 images may have a common color pattern(e.g., green forest), which might be reflected in your data. So theoretically, you will have <R, G, B> color pattern occurring quite a lot.

Also, you should not remove the commonly occurring points in this case. These are images, not regular numerical data or sample data, and removing them could badly affect your performance.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.