2
$\begingroup$

I've come from comments here. I'm applying Elastic Net regularization to a Cox model for miRNA biomarker selection. To compare across alpha values, I use a fixed foldid for cross-validation as recommended, all using the glmnet package.

Surprisingly, changing the seed (e.g., set.seed(849) vs another value) dramatically changes the number of variables selected: from almost all (e.g., 111) to a medium (38) and small subset (~17). I expected some variability due to resampling, but not this much.

Is this level of instability normal in penalized Cox models? Would it make sense to repeat the process with multiple seeds and select variables? Does it happen with the hdnom package and the provided aenet or enet functions? The comment from @smci was: The usual advice is try many random seeds and pick the ones with lowest CV score, or average their coefficients.

Thanks for any guidance!

EDIT1: My study is made up 247 individuals (58 cases), and 111 miRNAs. From this database I am trying to associate biomarkers to incident events (acute myocardial infarction). Therefor the elastic net belongs to Cox family. Following the recommendation of glmnet package, if the goal is iterating to find an alpha value, foldid must be an argument established for cross-validation.

I have also been gathering information from other threads where the suggest to improve the value obtention of lambda.

#The code # iam: event or control # tocoro: time to event # 111 miRNAs dat2 <- dat %>% select(iam, tocoro, matches("hsa")) # Matrix x and Surv object y x <- as.matrix(dat2[,-c(1:2)]) # predictores y <- Surv(dat2$tocoro, dat2$iam) # respuesta tipo supervivencia # 1. Prepare data dat2 <- dat %>% select(iam, tocoro, matches("hsa")) dat2 <- dat2 %>% dplyr::mutate(across(matches("hsa"), scale)) x <- as.matrix(dat2[, -c(1:2)]) y <- Surv(dat2$tocoro, dat2$iam) # === 2. Grid alpha === set.seed(849) foldid <- sample(1:10, size = nrow(x), replace = TRUE) alpha_grid <- seq(0, 1, by = 0.1) cv_errors <- numeric(length(alpha_grid)) cv_models <- list() for (i in seq_along(alpha_grid)) { a <- alpha_grid[i] cv_fit <- cv.glmnet(x, y, family = "cox", alpha = a, foldid = foldid) cv_errors[i] <- min(cv_fit$cvm) cv_models[[i]] <- cv_fit } best_index <- which.min(cv_errors) best_alpha <- alpha_grid[best_index] cat("Mejor alpha:", best_alpha, "\n") # === 3. CV to estimate lambda (multiple seeds)=== n <- 100 lambdas <- NULL for (i in 1:n) { set.seed(i) fit <- cv.glmnet(x, y, family = "cox", alpha = best_alpha, nfolds = 10) errors <- data.frame(lambda = fit$lambda, cvm = fit$cvm) lambdas <- rbind(lambdas, errors) } # Promediar el error por lambda lambda_summary <- aggregate(cvm ~ lambda, data = lambdas, mean) # Seleccionar mejor lambda bestindex <- which.min(lambda_summary$cvm) bestlambda <- lambda_summary$lambda[bestindex] cat("Mejor lambda promedio:", bestlambda, "\n") # === 4. Best alpha and lambda === final_model <- glmnet(x, y, family = "cox", alpha = best_alpha, lambda = bestlambda) selected_vars <- coef(final_model) selected_vars <- selected_vars[selected_vars[, 1] != 0, ] print("Variables seleccionadas:") print(selected_vars) 

Modifying the seed, the values are quite different. The PCA from the whole database (not stratifying by cases and controls!!!)

enter image description here enter image description here enter image description here

$\endgroup$
8
  • 1
    $\begingroup$ This does seem like a lot more variability than one might expect, even given the somewhat arbitrary nature of this type of variable selection. Could you please show the code that you are using to fit the elastic net? Also, please edit the question to describe the scope of your data set: how many individuals, how many events (deaths?), whether/how many standard clinical variables are in your model, how many miRNA species you analyze... It might further help to see principal component analysis of your miRNA values. $\endgroup$ Commented May 16 at 20:30
  • 3
    $\begingroup$ This is not surprising. Elastic net and lasso like almost all variable selection methods cannot select the right variables. If you enclose elastic net inside a bootstrap loop you’ll see this quite easily, or simulate from a true model and see how poorly the procedure recovers the right variables. Your effective sample size is about 100 and you have more features than sample size. Expect the worst. The sample size needed to well analyze a SINGLE feature is more than 100. It takes n=400 to well estimate a single correlation coefficient. $\endgroup$ Commented May 19 at 11:35
  • 2
    $\begingroup$ We need to do simulations to figure the minimal needed sample size. The scary fact is the n=400 is needed to estimate a single correlation coefficient well, so more than that is needed to estimate a whole correlation matrix as needed in PCA or SPCA. But if we don’t care about stability of PCA/SPCA itself we may be forced to use these to keep from overfitting the main outcome model. $\endgroup$ Commented May 20 at 11:48
  • 1
    $\begingroup$ Concerning repeating with multiple seeds: see stats.stackexchange.com/questions/310119 for general commentary. In a post at stats.stackexchange.com/a/157646/919 I identify the potential abuse latent in changing the seed. $\endgroup$ Commented May 30 at 16:59
  • 1
    $\begingroup$ @whuber if I got it correctly, the problem lies in changing purposely the seed to obtain the desired results. What I'm trying to execute here is changing it 100 times to make reproducible results. I'm not an expert, so maybe what I'm doing is pretty much the same, but in my mind was a robust approach to obtain an optimum lambda value. (Obviously invalidated by my sample size, but not wrong in the soul) $\endgroup$ Commented Jun 4 at 9:12

1 Answer 1

6
$\begingroup$

The number of retained coefficients in the elastic net will be related to how close the chosen model comes to ridge (L2) versus LASSO (L1) penalization: the closer to ridge, the more retained coefficients.

I suspect that, even if you had a lot more cases, with these miRNA data you would have a fairly broad, flat optimum (in partial likelihood) over your 2-dimensional grid of alpha (L2 penalization) and lambda (L1 penalization) values. That would lead to variation in the number of retained coefficients due to imprecision in finding the "best" alpha value.

Your problem is exacerbated by the small number of cases (events), a limiting factor in Cox regressions. With only 58 events, in 10-fold cross validation with a single setting of the folds you will have on average 5.8 events in a test fold. Poisson sampling would mean that about 30% of test folds have 4 or fewer events, providing very little power for evaluating the quality of the model trained on the other folds. Your results thus might be expected to depend heavily on the particular choice of folds, as determined by your random seed. Combine that with an inherently flat 2-dimensional optimum, and your results might not be so surprising.

I disagree with the suggestion to "try many random seeds and pick the ones with lowest CV score" to optimize alpha and lambda. That risks overfitting. If you "try many random seeds and pick the ones with lowest CV score" with so few events, then the result will be extremely dependent on the vagaries of sampling to set up the folds on this particular data set.

Multiple repeated cross validations might work better. See Frank Harrell's post on split-sample validation. Although that focuses on model validation, the principles should apply to your 2-dimensional optimization.

Finally, consider whether elastic net based solely on miRNA values is wise. As Frank Harrell points out in a comment, it "cannot select the right variables." Also, clinical outcomes depend heavily on standard clinical variables. If your Cox model does not include standard clinical variables, you are likely either to suffer from omitted-variable bias (which tends to lower the magnitudes of the coefficients for variables that aren't omitted) or to identify miRNAs that happen to be associated with the clinical variables.

$\endgroup$
6
  • $\begingroup$ Acknowledging the limitations from my data, I took try the many seeds as possible as a robust approach to estimate the best values of lambda. I guess with my data this is not possible beacause of size, and the flat I'm moving. Otherwise it would be fair approach along with multiple CV? Which I implemented. The standard variables were omitted, I don't see why it would be better to include them. The standard variables are selected in a standard Cox model according to biological background $\endgroup$ Commented Jun 4 at 8:51
  • $\begingroup$ @JavierHernando see the last sentence of my answer about omitting variables. On the one hand, omitting standard clinical variables can reduce your ability to find true associations of mRNAs with outcome, due to the omitted-variable bias toward lower coefficient magnitudes that's inherent in Cox models. On the other hand, if you omit the standard variables and find mRNAs that are associated with outcome just because they happen to be correlated with the standard variables, you haven't done much to advance clinically relevant knowledge. $\endgroup$ Commented Jun 4 at 13:40
  • $\begingroup$ @JavierHernando combining repeated CV with multiple seeds seems OK. I'm a bit worried by the new code you posted, however. It seems to first perform a search for the best alpha, then search over that alpha for the best lambda. I recall that your prior code instead did a combined 2-dimensional search to optimize both of them together. As I understand it, a combined 2-dimensional search is more likely to provide reproducible results. This page provides some related information and links for further study and implementations. $\endgroup$ Commented Jun 4 at 15:10
  • $\begingroup$ I read your last sentence but I wasn't sure about the approach. What if the regul. method drops one of my standard clinical variables that should remain? Let say hypertension. I considered just including molecules to be selected and among them explore potential synergys or antagonism (if the method might capture these interactions). I'm not saying it's not correct to clinical variables, but including just the "high-dimensional variables" is widely reflected in high-impact peer-reviewed articles. Maybe it's possible to not penalize these. Thanks for your suggestion, I will try it $\endgroup$ Commented Jun 5 at 3:58
  • 1
    $\begingroup$ @JavierHernando the penalty.factor argument to glmnet() allows you to specify whether specific predictors should be penalized: "Can be 0 for some variables, which implies no shrinkage, and that variable is always included in the model." I recognize that "high-impact peer-reviewed articles" sometimes don't include clinical variables in survival analysis, but that doesn't make it good practice. If the focus is on biochemical/genomic mechanisms and the study is a prelude to detailed experimental follow up that might be defensible. It won't, however, provide useful clinical guidance. $\endgroup$ Commented Jun 5 at 13:29

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.