2
$\begingroup$

I am learning about ridge regression. I was under the impression that ridge regression is valuable because it provides better out of sample predictive accuracy than standard linear models. For example, see the bottom of page 217 in this well known statistical learning text: http://faculty.marshall.usc.edu/gareth-james/ISL/ISLR%20Seventh%20Printing.pdf. I tried setting up a short simulation to demonstrate it, but my results aren't showing that ridge models are superior.

First, I simulated the exact multiarm design using DeclareDesign in R (the only difference is I boosted the N = 300). I then set up a simulation where I simulated a data set 1,000 times, split it into a test and training data set, and then fit a linear model and ridge regression model to the training data set. I then examined how well each model predicted responses in the test data set. Surprisingly, I don't show that the linear model does any worse. I must be going wrong somewhere, right? Below is my code - it doesn't take long to run and I'd appreciate any tips on where I might have went wrong.

# Add libraries library(DeclareDesign) library(ridge) library(tidyverse) library(fastDummies) # Use DeclareDesign to get function that can simulate data N <- 300 outcome_means <- c(0.5, 1, 2, 0.5) sd_i <- 1 outcome_sds <- c(0, 0, 0, 0) population <- declare_population(N = N, u_1 = rnorm(N, 0, outcome_sds[1L]), u_2 = rnorm(N, 0, outcome_sds[2L]), u_3 = rnorm(N, 0, outcome_sds[3L]), u_4 = rnorm(N, 0, outcome_sds[4L]), u = rnorm(N) * sd_i) potential_outcomes <- declare_potential_outcomes(formula = Y ~ (outcome_means[1] + u_1) * (Z == "1") + (outcome_means[2] + u_2) * (Z == "2") + (outcome_means[3] + u_3) * (Z == "3") + (outcome_means[4] + u_4) * (Z == "4") + u, conditions = c("1", "2", "3", "4"), assignment_variables = Z) estimand <- declare_estimands(ate_Y_2_1 = mean(Y_Z_2 - Y_Z_1), ate_Y_3_1 = mean(Y_Z_3 - Y_Z_1), ate_Y_4_1 = mean(Y_Z_4 - Y_Z_1), ate_Y_3_2 = mean(Y_Z_3 - Y_Z_2), ate_Y_4_2 = mean(Y_Z_4 - Y_Z_2), ate_Y_4_3 = mean(Y_Z_4 - Y_Z_3)) assignment <- declare_assignment(num_arms = 4, conditions = c("1", "2", "3", "4"), assignment_variable = Z) reveal_Y <- declare_reveal(assignment_variables = Z) estimator <- declare_estimator(handler = function(data) { estimates <- rbind.data.frame(ate_Y_2_1 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "1", condition2 = "2"), ate_Y_3_1 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "1", condition2 = "3"), ate_Y_4_1 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "1", condition2 = "4"), ate_Y_3_2 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "2", condition2 = "3"), ate_Y_4_2 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "2", condition2 = "4"), ate_Y_4_3 = difference_in_means(formula = Y ~ Z, data = data, condition1 = "3", condition2 = "4")) names(estimates)[names(estimates) == "N"] <- "N_DIM" estimates$estimator_label <- c("DIM (Z_2 - Z_1)", "DIM (Z_3 - Z_1)", "DIM (Z_4 - Z_1)", "DIM (Z_3 - Z_2)", "DIM (Z_4 - Z_2)", "DIM (Z_4 - Z_3)") estimates$estimand_label <- rownames(estimates) estimates$estimate <- estimates$coefficients estimates$term <- NULL return(estimates) }) multi_arm_design <- population + potential_outcomes + assignment + reveal_Y + estimand + estimator # Get holding matrix for R2 values rsq_values <- matrix(nrow = 1000, ncol = 2) # Simulate for (i in 1:100){ # Get simulated data set input_data <- draw_data(multi_arm_design) # Format data for analysis input_data <- input_data %>% fastDummies::dummy_cols(select_columns = "Z", remove_first_dummy = TRUE) %>% select(Y:Z_4) # Prep training and test data #set.seed(206) # set seed to replicate results training_index <- sample(1:nrow(input_data), 0.7*nrow(input_data)) # indices for 70% training data - arbitrary training_data <- input_data[training_index, ] # training data test_data <- input_data[-training_index, ] # test data # Fit linear model lm_mod <- lm(Y ~ ., data = training_data) # Fit ridge regression ridge_mod <- linearRidge(Y ~ ., data = training_data) # Get actual (from test data) and fitted values for each model actual <- test_data$Y lm_predicted <- predict(lm_mod, test_data) # predict linear model on test data ridge_predicted <- predict(ridge_mod, test_data) # predict ridge model on test data # See how well linear model from training data fits test data (expressed as R2) lm_rss <- sum((lm_predicted - actual) ^ 2) lm_tss <- sum((actual - mean(actual)) ^ 2) lm_rsq <- 1 - lm_rss/lm_tss rsq_values[i, 1] <- lm_rsq # See how well ridge model from training data fits test data (expressed as R2) ridge_rss <- sum((ridge_predicted - actual) ^ 2) ridge_tss <- sum((actual - mean(actual)) ^ 2) ridge_rsq <- 1 - ridge_rss/ridge_tss rsq_values[i, 2] <- ridge_rsq } # Make matrix into data frame rsq_values <- data.frame(rsq_values) # Summarize R2 values for linear model summary(rsq_values$X1) # Summarize R2 values for ridge model summary(rsq_values$X2) 
$\endgroup$
3
  • $\begingroup$ Are you doing any kind of cross-validation to tune your ridge parameter? I’m trying to read code on my phone (again) but don’t see you doing that. $\endgroup$ Commented Jul 24, 2020 at 16:45
  • $\begingroup$ ^ No, I didn't tune the ridge parameter. I just used the default from the ridge package. Not sure how it finds the lambdas. I could do that myself I guess and see if that changes the results. $\endgroup$ Commented Jul 24, 2020 at 16:47
  • $\begingroup$ That’s essential. I think glmnet has a ridge regression cross validation function. (You’ll have to set the elastic net parameter to $0$ or $1$ to force it to do ridge instead of lasso or a hybrid of the two. The documentation will say what to do.) $\endgroup$ Commented Jul 24, 2020 at 16:49

1 Answer 1

4
$\begingroup$

You are doing nothing wrong. Ridge regression, the LASSO, and other penalized-coefficient regressions yield biased estimations. The idea is that maybe accepting a little bias will greatly reduce the variance.

However, there is nothing in how ridge regression, the LASSO, etc. are formulated that guarantees they will perform better at predictions of out of sample. Sometimes a simple linear model informed by theory and created by an analyst who knows the problem domain can trounce a model selected by ridge regression. This is true across problem domains and in all sorts of circumstances.

This is, essentially, a question about model selection. There is no need for code; the issue is not specific to your data or method of inference. Your findings illustrate that model selection (or what ML/AI people call feature selection) is not a solved problem.

$\endgroup$
4
  • 3
    $\begingroup$ +1 “ The idea is that maybe accepting a little bias will greatly reduce the variance“ is key. We hope to reduce the variance a lot by accepting a little bit of bias, but we might be wrong or might not be accepting enough bias or might be biasing too much. $\endgroup$ Commented Jul 24, 2020 at 16:43
  • $\begingroup$ Interesting. My interpretation from reading books on statistical learning was the methods like ridge regression should do better than OLS. For example, see the bottom of page 217 in this well known text, An Introduction to Statistical Learning: faculty.marshall.usc.edu/gareth-james/ISL/… $\endgroup$ Commented Jul 24, 2020 at 16:45
  • 1
    $\begingroup$ Ridge regression and LASSO can perform better; and, they are useful informative modeling tools. However, I've seen (and presented on) many situations where they perform worse or suggest obviously wrong models. Like everything in modeling, most methods are useful but none is a silver bullet. That said, LASSO and ridge regression are useful enough that I advise students to learn about them. $\endgroup$ Commented Jul 24, 2020 at 17:38
  • $\begingroup$ ^ This makes sense $\endgroup$ Commented Jul 24, 2020 at 18:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.