I have an unbalanced dataset of 247 individuals, with 58 events. I am trying to select features via lasso regression. There is controversy about my sample size, because seems that lasso accounts for n<<p scenaries, however I was advised previously about the limitations of my sample size. Research shows an extensive amount of studies with similar sample sizes using elastic net for selecting variables
In this regard, my doubt lies in, if it is there the necessity to account for the different proportion between the events and controls?
On the other hand, does it make any sense to make iterations to calculate lambda? Or is it better just find alpha and let the function calculate lambda. Because as I saw in previous threads it is recommended to calculate alpha and then lambda, but not simultaneously. However, to get alpha you automatically obtain lambda value. Despite this fact, we calculate best lambda value
alpha_grid <- seq(0, 1, by = 0.1) cv_errors <- numeric(length(alpha_grid)) cv_models <- list() # Getting the best alpha value (internally obtaining lambda) for (i in seq_along(alpha_grid)) { print(alpha_grid[i]) a <- alpha_grid[i] cv_fit <- cv.glmnet(x, y, family = "cox", alpha = a, nfolds = 5) cv_errors[i] <- min(cv_fit$cvm) cv_models[[i]] <- cv_fit } best_index_p6 <- which.min(cv_errors) best_alpha_p6 <- alpha_grid[best_index_p6] cat("Mejor alpha (p6):", best_alpha_p6, "\n") # Get the better lambda iterating through the seed (although we got a lambda value getting lambda) n <- 100 lambdas_p6 <- NULL for (i in 1:n) { set.seed(i) fit <- cv.glmnet(x, y, family = "cox", alpha = best_alpha_p6) errors <- data.frame(lambda = fit$lambda, cvm = fit$cvm) lambdas_p6 <- rbind(lambdas_p6, errors) } lambda_summary_p6 <- aggregate(cvm ~ lambda, data = lambdas_p6, mean) bestindex_p6 <- which.min(lambda_summary_p6$cvm) bestlambda_p6 <- lambda_summary_p6$lambda[bestindex_p6] cat("Mejor lambda promedio (P6):", bestlambda_p6, "\n") final_model_p6 <- glmnet(x, y, family = "cox", alpha = best_alpha_p6, lambda = bestlambda_p6) selected_vars_p6 <- coef(final_model_p6) selected_vars_p6 <- selected_vars_p6[selected_vars_p6[, 1] != 0, ] print("Variables seleccionadas (p6):") print(selected_vars_p5)