When same combination of variable values appear in train and test datasets?

Question

I’m studying the basics of ML and trying to train a random forest model in a .csv dataset which each row contains the values of pixels in the red, green and blue bands (all varying from 0-255 values) plus a target variable which is binary (0-1). Those pixels were sampled without replacement from 20 RGB images.

My question is: I have 2,000,000 rows which are divided in train/test data. But I observed that some combinations appear more than once in both train and test divisions. For exemple:

Train dataset (70%)

R	G	B	Class
20	30	40	0
30	40	50	1
50	20	5	0

Test dataset (30%)

R	G	B	Class
20	30	40	0
30	40	50	1
30	40	50	1
10	20	30	1

Is it ok for the random forest model to be trained and tested in a dataset divided in train/test partitions that have occurences of the same combination of values in both partitions? Or is it a data leakage scenario? If so, how to deal with it? Thank you very much.

Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. — Community
– Community Bot, Commented May 11 at 17:09
I tried to reword my question in the original post, but I would like to know if samples that coincidentally are both in train and test datasets can be a problem for a random forest model. For example see the tables from the post. Thanks! — Kol Rocket
– Kol Rocket, Commented May 12 at 0:24
it depends on the data generation process and how the model will be applied. Are all the combinations of values with the same target ? In reality will you have similar match in your model application phase ? — Lucas Morin
– Lucas Morin, Commented May 14 at 3:36
Yes! If they are for example (40,30,20,1) in training, in the test I found the same (40,30,20,1). However, they are from a different sample. In reality it Will occur because they are samples from satellite images (for example, extraction of one hundred pixels representing R,G,B values representing soil - expect to have similar values). I thought now of removing duplicates from csv would help? — Kol Rocket
– Kol Rocket, Commented May 14 at 22:06

The_Data_Scientist_Man · Accepted Answer · 2025-05-15 14:23:15Z

20 images may have a common color pattern(e.g., green forest), which might be reflected in your data. So theoretically, you will have <R, G, B> color pattern occurring quite a lot.

Also, you should not remove the commonly occurring points in this case. These are images, not regular numerical data or sample data, and removing them could badly affect your performance.

Stack Exchange Network

When same combination of variable values appear in train and test datasets?

1 Answer 1

Hot Network Questions

When same combination of variable values appear in train and test datasets?

1 Answer 1

Related

Hot Network Questions