I've come from comments here. I'm applying Elastic Net regularization to a Cox model for miRNA biomarker selection. To compare across alpha values, I use a fixed foldid for cross-validation as recommended, all using the glmnet package.
Surprisingly, changing the seed (e.g., set.seed(849) vs another value) dramatically changes the number of variables selected: from almost all (e.g., 111) to a medium (38) and small subset (~17). I expected some variability due to resampling, but not this much.
Is this level of instability normal in penalized Cox models? Would it make sense to repeat the process with multiple seeds and select variables? Does it happen with the hdnom package and the provided aenet or enet functions? The comment from @smci was: The usual advice is try many random seeds and pick the ones with lowest CV score, or average their coefficients.
Thanks for any guidance!
EDIT1: My study is made up 247 individuals (58 cases), and 111 miRNAs. From this database I am trying to associate biomarkers to incident events (acute myocardial infarction). Therefor the elastic net belongs to Cox family. Following the recommendation of glmnet package, if the goal is iterating to find an alpha value, foldid must be an argument established for cross-validation.
I have also been gathering information from other threads where the suggest to improve the value obtention of lambda.
#The code # iam: event or control # tocoro: time to event # 111 miRNAs dat2 <- dat %>% select(iam, tocoro, matches("hsa")) # Matrix x and Surv object y x <- as.matrix(dat2[,-c(1:2)]) # predictores y <- Surv(dat2$tocoro, dat2$iam) # respuesta tipo supervivencia # 1. Prepare data dat2 <- dat %>% select(iam, tocoro, matches("hsa")) dat2 <- dat2 %>% dplyr::mutate(across(matches("hsa"), scale)) x <- as.matrix(dat2[, -c(1:2)]) y <- Surv(dat2$tocoro, dat2$iam) # === 2. Grid alpha === set.seed(849) foldid <- sample(1:10, size = nrow(x), replace = TRUE) alpha_grid <- seq(0, 1, by = 0.1) cv_errors <- numeric(length(alpha_grid)) cv_models <- list() for (i in seq_along(alpha_grid)) { a <- alpha_grid[i] cv_fit <- cv.glmnet(x, y, family = "cox", alpha = a, foldid = foldid) cv_errors[i] <- min(cv_fit$cvm) cv_models[[i]] <- cv_fit } best_index <- which.min(cv_errors) best_alpha <- alpha_grid[best_index] cat("Mejor alpha:", best_alpha, "\n") # === 3. CV to estimate lambda (multiple seeds)=== n <- 100 lambdas <- NULL for (i in 1:n) { set.seed(i) fit <- cv.glmnet(x, y, family = "cox", alpha = best_alpha, nfolds = 10) errors <- data.frame(lambda = fit$lambda, cvm = fit$cvm) lambdas <- rbind(lambdas, errors) } # Promediar el error por lambda lambda_summary <- aggregate(cvm ~ lambda, data = lambdas, mean) # Seleccionar mejor lambda bestindex <- which.min(lambda_summary$cvm) bestlambda <- lambda_summary$lambda[bestindex] cat("Mejor lambda promedio:", bestlambda, "\n") # === 4. Best alpha and lambda === final_model <- glmnet(x, y, family = "cox", alpha = best_alpha, lambda = bestlambda) selected_vars <- coef(final_model) selected_vars <- selected_vars[selected_vars[, 1] != 0, ] print("Variables seleccionadas:") print(selected_vars) Modifying the seed, the values are quite different. The PCA from the whole database (not stratifying by cases and controls!!!)


