I am using cv.glmnet to find predictors. The setup I use is as follows:
lassoResults<-cv.glmnet(x=countDiffs,y=responseDiffs,alpha=1,nfolds=cvfold) bestlambda<-lassoResults$lambda.min results<-predict(lassoResults,s=bestlambda,type="coefficients") choicePred<-rownames(results)[which(results !=0)] To make sure the results are reproducible I set.seed(1). The results are highly variable. I ran the exact same code 100 times to see how variable the results were. In the 98/100 runs had one particular predictor always selected (sometimes just on its own); other predictors were selected (co-efficient was non-zero) usually 50/100 times.
So it tells me that each time the cross-validation is running it's going to probably select a different best lambda, because of the initial randomization of the folds matter. Others have seen this problem (CV.glmnet results) but there isn't a suggested solution.
I am thinking that maybe that one which shows up 98/100 is probably pretty highly correlated to all the others? The results do stabilize if I just run LOOCV ($\text{fold-size} = n$), but I am curious why they are so variable when $\text{nfold} < n$.
set.seed(1)once then runcv.glmnet()100 times? That's not great methodology for reproducibility; better toset.seed()right before each run, or else keep the foldids constant across runs. Each of your calls tocv.glmnet()is callingsample()N times. So if the length of your data ever changes, the reprodubility changes. $\endgroup$