9
$\begingroup$

Am new to ML and working on a dataset with lot of categorical variables with high cardinality.

I observed that in lot of tutorials for encoding like here, the encoding is applied after the train and test split.

Can I check why is it done so?

Why can't we apply the encoding even before the train test split?

Can't we apply the encoding to the full dataset and after encoding, split it into train and test sets?

What difference does it make?

$\endgroup$

3 Answers 3

12
$\begingroup$

If you perform the encoding before the split, it will lead to data leakage (train-test contamination) In the sense, you will introduce new data (integers of Label Encoders) and use it for your models thus it will affect the end predictions results (good validation scores but poor in deployment).

After the train and validation data category already matched up, you can perform fit_transform on the train data, then only transform for the validation data - based on the encoding maps from train data.

Almost all feature engineering like standarisation, Normalisation etc should be done after train testsplit. Hope it helps

$\endgroup$
1
5
$\begingroup$

If you train the encoder model on the whole dataset before you split it into train/validation/test sets you will introduce bias into the training. The introduction of bias happens because the encoded categories will now contain information about the samples that will be in your validation and/or test sets.

This is commonly called data leakage, and it is a problem because the purpose of your validation and test sets is to apply your trained model to data that it has not seen before. But this would not be the case if your encoder has information about the data distribution of the entire dataset.

$\endgroup$
0
-1
$\begingroup$

If encoding methods like one-hot encoding (OHE) or label encoding are used, they should be done before splitting, as they don’t rely on the dataset distribution. If target encoding or mean encoding (which depend on target labels) is used, it must be done after splitting to avoid data leakage.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.