Correct order of using tranform, fit, and feature selection when there are separate training and test sets

Question

I am not sure if I am following the steps correctly. Here, first I do a feature selection:

selection = SelectKBest(score_func=f_regression, k=15).fit(X,y) X_features = selection.transform(X)

Then, I use cross-validation to calculate the alpha_ with the selected features (X_features):

model1 = LassoCV(cv=10, fit_intercept=True, normalize=False, n_jobs=-1) model1.fit(X_features, y) myalpha = reg.alpha_

Then, I train a new model with the calculated alpha_ value:

model2 = linear_model.Lasso(alpha=myalpha) model2.fit(X_features, y)

Finally, I use cross-validation to test the trained model on a new data set (test data):

pred_r = cross_val_predict(model2, X_test, y_test, cv=10)

I wonder if these steps are correct or not. For example, I am not sure if I need to do fit & transform on the test set as well. I appreciate any guidance.

UPDATE

I also wonder the scenario when there is no separate test set, and I need to do cross-validation on whole data:

selection = SelectKBest(score_func=f_regression, k=150).fit(X,y) X_features = selection.transform(X)# reg = linear_model.LassoCV(cv=10, fit_intercept=True, normalize=False, n_jobs=-1) reg.fit(X_features, y) pred_r = reg.predict(X_features)

With this code, I am afraid I am evaluating my model on the training set, which will result in biased results. Therefore, to decrease the bias and possibly overfit, instead of reg.predict(X_features), I need to do a cross validation again for testing:

pred_r = cross_val_predict(reg, X_features, y, cv=10)

I wonder if this would make sense?

mary · Accepted Answer · 2017-03-15 17:42:07Z

When doing model selection, the order of steps should generally be:

Decide which hyperparameters you want to cross-validate (based on prior knowledge, dataset size, computational budget...).
Fix hyperparameters that are not cross-validated.
Cross-validate remaining hyperparameters.
Evaluate model on a test set.

In your case, the hyperparameters are a) the feature set, and b) alpha_ in the Lasso procedure. If you do not want to cross-validate the feature set (which, admittedly, is a hassle), then the order of steps you take is perfectly fine.

A few remarks / things to consider, unrelated to the order of steps:

The Lasso does implicit feature selection by setting some coefficients to zero, so maybe you can skip the SelectKBest step.
You do not have to refit the model, LassoCV does it automatically for you. So your model1 and model2 are identical.

If you have a separate test set, you do not need cross-validation for model evaluation. Just call model1.predict(X_test) and compare to y_test. You should preprocess the test data in exactly the same way that you preprocessed the training data (i.e. do not fit new transforms, only apply old ones):

selection = SelectKBest(score_func=f_regression, k=15).fit(X,y) X_features = selection.transform(X) model1 = LassoCV(cv=10, fit_intercept=True, normalize=False, n_jobs=-1) model1.fit(X_features, y) X_test = selection.transform(X_test) y_hat = model1.predict(X_test) square_loss = ((y_test - y_hat)**2).mean() # compute your favourite metric

If there is no separate test set and you evaluate your model by CV, the one thing to bear in mind is that everything you do must be nested in a CV loop, including preprocessing, feature selection, etc. The pipeline structure makes this straightforward:

from sklearn.pipeline import Pipeline pipe = Pipeline([ # enumerate all steps of the analysis here ('feature_sel', SelectKBest(score_func=f_regression, k=15)), ('lasso', LassoCV(cv=10, fit_intercept=True, normalize=False)) ]) yhat = cross_val_predict(pipe, X, y, cv=10)

I really appreciate your help Mary!!! Could you please have a look at the update in my question? I am not sure about my approach when there is no test data and when I need to do CV on whole data. Thank you again.. — renakre
– renakre, Commented Mar 14, 2017 at 15:36
Hi, @renakre ;) Your first suggestion indeed evaluates on the training data, and, as you say, it is preferable to do a global CV. When you do this, make sure that all steps of your analysis are nested inside the CV loop (as explained in the answer), otherwise the results will be biased. — mary
– mary, Commented Mar 15, 2017 at 17:40
Thank you very much for updating your answer!!! I see that you used pipeline. So, in this case there is no need for calling the fit() or transform() methods — renakre
– renakre, Commented Mar 15, 2017 at 18:47

Stack Exchange Network

Correct order of using tranform, fit, and feature selection when there are separate training and test sets

1 Answer 1

Hot Network Questions

Correct order of using tranform, fit, and feature selection when there are separate training and test sets

1 Answer 1

Related

Hot Network Questions