I am not sure if I am following the steps correctly. Here, first I do a feature selection:
selection = SelectKBest(score_func=f_regression, k=15).fit(X,y) X_features = selection.transform(X) Then, I use cross-validation to calculate the alpha_ with the selected features (X_features):
model1 = LassoCV(cv=10, fit_intercept=True, normalize=False, n_jobs=-1) model1.fit(X_features, y) myalpha = reg.alpha_ Then, I train a new model with the calculated alpha_ value:
model2 = linear_model.Lasso(alpha=myalpha) model2.fit(X_features, y) Finally, I use cross-validation to test the trained model on a new data set (test data):
pred_r = cross_val_predict(model2, X_test, y_test, cv=10) I wonder if these steps are correct or not. For example, I am not sure if I need to do fit & transform on the test set as well. I appreciate any guidance.
UPDATE
I also wonder the scenario when there is no separate test set, and I need to do cross-validation on whole data:
selection = SelectKBest(score_func=f_regression, k=150).fit(X,y) X_features = selection.transform(X)# reg = linear_model.LassoCV(cv=10, fit_intercept=True, normalize=False, n_jobs=-1) reg.fit(X_features, y) pred_r = reg.predict(X_features) With this code, I am afraid I am evaluating my model on the training set, which will result in biased results. Therefore, to decrease the bias and possibly overfit, instead of reg.predict(X_features), I need to do a cross validation again for testing:
pred_r = cross_val_predict(reg, X_features, y, cv=10) I wonder if this would make sense?