Penalized regression, covariate selection, and the random element in covariate selection

Question

I'm doing logistic regression with too many predictors and too few datapoints, so I'm using elastic net logistic regression as a principled way to do variable selection. (The larger goal is to test a predictive model as measured by AUC, but I don't think that matters here.)

However I've noticed that which coefficients are modelled as non-zero is not consistent -- the cross-validation component of cv.glmnet() introduces randomness, and different runs can select different predictors (for, of course, the same value of $\lambda$).

So I made up a rule of thumb, ran the elastic net model 100 times, and decided to keep the covariates that had non-zero coefficients, at $\lambda = \lambda_{min}$, in at least 90% of the runs.

But I hadn't thought far enough ahead. Now that I've selected my variables, how do I go back and make elastic net regression give me a model including only those covariates? After all, I need my coefficient estimates.

Possibilities are:

This is essentially misguided, and no way to do variable selection. Instead, I should be doing X, where X is. . . .
I could run cv.glmnet() over and over again until it happens to give me a model including just the covariates I want. This is clearly nuts.
I could use penalty factors to force my chosen covariates into the model, and other covariates out. This, also, seems dubious. . . .
I could simply do conventional logistic regression on these covariates. (But these covariates were chosen as "best" in the context of the elastic net penalty. Of course coefficient estimates will be different when computed under conventional logistic regression. So this doesn't seem quite right. . . .)

I think I'm missing something. Grateful for any advice on where to go from here, or what I could be doing instead. Thanks!

Firebug · Accepted Answer · 2017-01-10 19:45:53Z

This is essentially misguided, and no way to do variable selection. Instead, I should be doing X, where X is. . .

Basically, yes.

We do cross-validation to obtain a performance estimate. The selected variables are the ones with non-zero coefficients on a model built on the whole data.

Stack Exchange Network

Penalized regression, covariate selection, and the random element in covariate selection

1 Answer 1

Hot Network Questions

Penalized regression, covariate selection, and the random element in covariate selection

1 Answer 1

Related

Hot Network Questions