Variablity in cv.glmnet results

Question

I am using cv.glmnet to find predictors. The setup I use is as follows:

lassoResults<-cv.glmnet(x=countDiffs,y=responseDiffs,alpha=1,nfolds=cvfold) bestlambda<-lassoResults$lambda.min results<-predict(lassoResults,s=bestlambda,type="coefficients") choicePred<-rownames(results)[which(results !=0)]

To make sure the results are reproducible I set.seed(1). The results are highly variable. I ran the exact same code 100 times to see how variable the results were. In the 98/100 runs had one particular predictor always selected (sometimes just on its own); other predictors were selected (co-efficient was non-zero) usually 50/100 times.

So it tells me that each time the cross-validation is running it's going to probably select a different best lambda, because of the initial randomization of the folds matter. Others have seen this problem (CV.glmnet results) but there isn't a suggested solution.

I am thinking that maybe that one which shows up 98/100 is probably pretty highly correlated to all the others? The results do stabilize if I just run LOOCV ($\text{fold-size} = n$), but I am curious why they are so variable when $\text{nfold} < n$.

To be clear, do you mean you set.seed(1) once then run cv.glmnet() 100 times? That's not great methodology for reproducibility; better to set.seed() right before each run, or else keep the foldids constant across runs. Each of your calls to cv.glmnet() is calling sample() N times. So if the length of your data ever changes, the reprodubility changes. — smci
– smci, Commented Feb 24, 2017 at 2:10

Mooncrater · Accepted Answer · 2017-08-25 18:08:59Z

The point here is that in cv.glmnet the K folds ("parts") are picked randomly.

In K-folds cross validation the dataset is divided in $K$ parts, and $K-1$ parts are used to predict the K-th part (this is done $K$ times, using a different $K$ part each time). This is done for all the lambdas, and the lambda.min is the one that gives the smallest cross validation error.

This is why when you use $nfolds = n$ the results don't change: each group is made of one, so no much choice for the $K$ groups.

From the cv.glmnet() reference manual:

Note also that the results of cv.glmnet are random, since the folds are selected at random. Users can reduce this randomness by running cv.glmnet many times, and averaging the error curves.

### cycle for doing 100 cross validations ### and take the average of the mean error curves ### initialize vector for final data.frame with Mean Standard Errors MSEs <- NULL for (i in 1:100){ cv <- cv.glmnet(y, x, alpha=alpha, nfolds=k) MSEs <- cbind(MSEs, cv$cvm) } rownames(MSEs) <- cv$lambda lambda.min <- as.numeric(names(which.min(rowMeans(MSEs))))

MSEs is the data frame containing all the errors for all lambdas (for the 100 runs), lambda.min is your lambda with minimum average error.

The thing I am most concerned about is that the selection of n really does seem to matter sometimes. Should I trust results that can be so variable? Or should I chalk it up as sketchy even if I do run it multiple times? — user4673
– user4673, Commented Jun 16, 2014 at 17:07
Depending on the sample size you should chose n so you have at least 10 observation per group. So it's better to decrease the default n (=10) if you have a sample size smaller than 100. This said, see the edited answer with the piece of code: with this for loop you can repeat cv.glmnet 100 times and average the error curves. Try it a couple of times and you'll see that lambda.min wont change. — Alice
– Alice, Commented Jun 17, 2014 at 13:08
I like how you've done it. I have the same loop but with one exception at the end : I look at how frequently different features pop up as opposed to the lowest MSE from all iterations. I pick an arbitrary cut point (i.e. show up 50/100 iterations) and use those features. Curious contrast the two approaches. — user4673
– user4673, Commented Jun 17, 2014 at 18:06
This rownames(MSEs) <- cv$lambda error, since cv$lambda in my case is longer than MSEs (I assume its due to convergence ...) — Areza
– Areza, Commented Aug 19, 2015 at 16:24
As user4581 noted, this function can fail due to variability in the length of cv.glmnet(...)$lambda. My alternative fixes this: stats.stackexchange.com/a/173895/19676 — Max Ghenis
– Max Ghenis, Commented Sep 23, 2015 at 21:45

Bakaburg · Accepted Answer · 2015-02-20 10:56:06Z

Lately I faced the same problem. I tried repeating the CV many times, like 100, 200, 1000 on my data set trying to find the best $\lambda$ and $\alpha$ (i'm using an elastic net). But even if I create 3 cv test each with 1000 iterations averaging the min MSEs for each $\alpha$, I get 3 different best ($\lambda$, $\alpha$) couples.

I won't touch the $\alpha$ problem here but I decided that my best solution is not averaging the min MSEs, but instead extracting the coefficients for each iteration best $\lambda$ and then treat them as a distribution of values (a random variable).

Then, for each predictor I get:

mean coefficient
standard deviation
5 number summary (median, quartiles, min and max)
percentage of times is different from zero (ie. has an influence)

This way I get a pretty solid description of the effect of predictor. Once you have distributions for the coefficients, than you could run any statistical stuff you think is worth to get CI, p values, etc... but I didn't investigate this yet.

This method can be used with more or less any selection method I can think of.

This method isn't theoretically sound since using the standard deviation to get the confidence interval assumes normality, which you won't have. It actually won't be at all normal since some fraction of the values for every coefficient are most likely bound to be 0 because of regularization. So there's not a straightforward way to get confidence intervals, p-values, etc. — bob
– bob, Commented Jul 14, 2021 at 18:53
indeed 😊 it's been a while since then, now I'd simply go for a bayesian regression with regularizing priors — Bakaburg
– Bakaburg, Commented Jul 18, 2021 at 10:08
@Bakaburg what about the Bayesian regression? How do you do this? — Javier Hernando
– Javier Hernando, Commented May 16 at 13:29

Sideshow Bob · Accepted Answer · 2016-03-31 10:14:31Z

I'll add another solution, which handles the bug in @Alice's due to missing lambdas, but doesn't require extra packages like @Max Ghenis. Thanks are owed to all the other answers - everyone makes useful points!

lambdas = NULL for (i in 1:n) { fit <- cv.glmnet(xs,ys) errors = data.frame(fit$lambda,fit$cvm) lambdas <- rbind(lambdas,errors) } # take mean cvm for each lambda lambdas <- aggregate(lambdas[, 2], list(lambdas$fit.lambda), mean) # select the best one bestindex = which(lambdas[2]==min(lambdas[2])) bestlambda = lambdas[bestindex,1] # and now run glmnet once more with it fit <- glmnet(xy,ys,lambda=bestlambda)

Nitpicking but maybe useful detail: By adding coef(fit)we can get the regression coefficients of the best-fitting model. — denominator
– denominator, Commented Jan 13, 2023 at 13:02

Brigitte · Accepted Answer · 2018-04-12 19:10:09Z

You can control the randomness if you explicitly set foldid. Here an example for 5-fold CV

library(caret) set.seed(284) flds <- createFolds(responseDiffs, k = cvfold, list = TRUE, returnTrain = FALSE) foldids = rep(1,length(responseDiffs)) foldids[flds$Fold2] = 2 foldids[flds$Fold3] = 3 foldids[flds$Fold4] = 4 foldids[flds$Fold5] = 5

Now run cv.glmnet with these foldids.

lassoResults<-cv.glmnet(x=countDiffs,y=responseDiffs,alpha=1,foldid = foldids)

You will get the same results each time.

$\begingroup$ This is the most correct answer $\endgroup$

David
– David

2021-02-24 21:18:05 +00:00
Commented Feb 24, 2021 at 21:18 — David
– David, Commented Feb 24, 2021 at 21:18

Community · Accepted Answer · 2017-04-13 12:44:51Z

Alice's answer works well in most cases, but sometimes errors out due to cv.glmnet$lambda sometimes returning results of different length, e.g.:

Error in rownames<-(tmp, value = c(0.135739830284452, 0.12368107787663, : length of 'dimnames' [1] not equal to array extent.

OptimLambda below should work in the general case, and is also faster by leveraging mclapply for parallel processing and avoidance of loops.

Lambdas <- function(...) { cv <- cv.glmnet(...) return(data.table(cvm=cv$cvm, lambda=cv$lambda)) } OptimLambda <- function(k, ...) { # Returns optimal lambda for glmnet. # # Args: # k: # times to loop through cv.glmnet. # ...: Other args passed to cv.glmnet. # # Returns: # Lambda associated with minimum average CV error over runs. # # Example: # OptimLambda(k=100, y=y, x=x, alpha=alpha, nfolds=k) # require(parallel) require(data.table) MSEs <- data.table(rbind.fill(mclapply(seq(k), function(dummy) Lambdas(...)))) return(MSEs[, list(mean.cvm=mean(cvm)), lambda][order(mean.cvm)][1]$lambda) }

You're trying to find the best lambda. But isn't calculated inside de cv.glmnet function? What you're doing is iterate over different seeds to replicate the process in order to get more robust statistic output? — Javier Hernando
– Javier Hernando, Commented May 23 at 9:02

Stack Exchange Network

Variablity in cv.glmnet results

5 Answers 5

Linked

Hot Network Questions

Variablity in cv.glmnet results

5 Answers 5

Linked

Related

Hot Network Questions