I'm doing logistic regression with too many predictors and too few datapoints, so I'm using elastic net logistic regression as a principled way to do variable selection. (The larger goal is to test a predictive model as measured by AUC, but I don't think that matters here.)
However I've noticed that which coefficients are modelled as non-zero is not consistent -- the cross-validation component of cv.glmnet() introduces randomness, and different runs can select different predictors (for, of course, the same value of $\lambda$).
So I made up a rule of thumb, ran the elastic net model 100 times, and decided to keep the covariates that had non-zero coefficients, at $\lambda = \lambda_{min}$, in at least 90% of the runs.
But I hadn't thought far enough ahead. Now that I've selected my variables, how do I go back and make elastic net regression give me a model including only those covariates? After all, I need my coefficient estimates.
Possibilities are:
This is essentially misguided, and no way to do variable selection. Instead, I should be doing X, where X is. . . .
I could run
cv.glmnet()over and over again until it happens to give me a model including just the covariates I want. This is clearly nuts.I could use penalty factors to force my chosen covariates into the model, and other covariates out. This, also, seems dubious. . . .
I could simply do conventional logistic regression on these covariates. (But these covariates were chosen as "best" in the context of the elastic net penalty. Of course coefficient estimates will be different when computed under conventional logistic regression. So this doesn't seem quite right. . . .)
I think I'm missing something. Grateful for any advice on where to go from here, or what I could be doing instead. Thanks!