0
$\begingroup$

This is my understanding of glmnet:

if OLS is minimizing RSS, where

$ RSS = \sum(y-\beta x)^2 $

I believe glmnet is minimizing:

$ RSS - \sum(\alpha |\beta_j| + (1-\alpha) \beta_j^2) $ where $\alpha=\lambda_1/(\lambda_1+\lambda_2) $

$\lambda_1$ and $\lambda_2$ come from lasso and ridge regression, but I'm confused if $\lambda_1 = \lambda_2 $ such that cv.glmnet in glmnet package of R is solving for a single variable (along the whole path) $\lambda$? But then $\alpha = 0.5$ always.

If $\lambda_1 = \lambda_2 $, is the glmnet penalty equivalent to $RSS - \lambda |\beta| - \lambda \beta^2 $

I've read through Hastie et al. 2009 Elements of Statistical Learning and Zou and Hastie 2005 so now I'm trying to get some clarification on the lambdas and alpha. Thanks

EDIT:

I found this to be a useful formulation in Friedman et al (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent.

$$ 1/2N * \sum (y_i - \beta_0 - X\beta)^2 + \lambda P_\alpha (\beta) $$ where $$ P_\alpha (\beta) = \sum (1/2 (1-\alpha) \beta_j^2 + \alpha |\beta_j|) $$ I thought it provided some intuition how lambda and alpha exist together.

$\endgroup$
3
  • $\begingroup$ Relevant: stats.stackexchange.com/questions/67736/… $\endgroup$ Commented Oct 5, 2016 at 7:04
  • $\begingroup$ relevant and helpful, but I still don't understand how a single $\lambda$ comes into the picture when the penalty is formulated with $\alpha$ $\endgroup$ Commented Oct 5, 2016 at 16:15
  • 1
    $\begingroup$ You haven't correctly given the penalty: the entire term must be multiplied by a second independent parameter, called "$\lambda$" in the documentation. It functions like the $\lambda$ in your edit (and is directly proportional to it). This lambda has nothing to do with your $\lambda_1$ and $\lambda_2$. cv.glmnet helps you find $\lambda$, but you have to specify $\alpha$. $\endgroup$ Commented May 26, 2017 at 17:21

2 Answers 2

3
$\begingroup$

$\alpha=\frac{\lambda_1}{\lambda_1+\lambda_2} \text{and } 1-\alpha=\frac{\lambda_2}{\lambda_1+\lambda_2}$. And because $\lambda_i\ge0,$ it should be clear that $\alpha\in[0,1].$ So in glmnet, $\lambda=\lambda_1+\lambda_2$, and each penalty has a coefficient that is either $\alpha(\lambda_1+\lambda_2)$ or $(1-\alpha)(\lambda_1+\lambda_2)$.

But treating $\alpha$ independently of $\lambda_1, \lambda_2$ is convenient as a conceptual model because it controls how much of a ridge and lasso penalty is applied, with either extreme arising as a special case. And you can make a model "more lasso" or "more ridge" by adjusting $\alpha$ without having to worry about how to adjust $\lambda_i$ relative to the size of $\lambda_j, j\neq i$. That is, treated separately, $\alpha$ controls the range of elastic net compositions on a continuum of ridge to lasso, while $\lambda$ controls the overall magnitude of the penalty. The two can be thought of as distinct model hyper-parameters. The method with two lambdas links the two penalties.

And if both $\lambda_1$ and $\lambda_2$ are 0, that should correspond to no penalty, but the fraction $\frac{\lambda_1}{\lambda_1+\lambda_2}=\frac{0}{0}$ is unsightly and indeterminate.

$\endgroup$
11
  • $\begingroup$ Thanks. Could you elaborate your 3rd sentence? how did you get to $\lambda = \lambda_1 + \lambda_2$ ? $\endgroup$ Commented Oct 5, 2016 at 18:01
  • 1
    $\begingroup$ That's just the definition/convention that glmnet uses. $\endgroup$ Commented Oct 5, 2016 at 18:52
  • $\begingroup$ ok. and when you said "α controls the range of elastic net compositions on a continuum of ridge to lasso, while λ controls the overall magnitude of the penalty.", doesn't $\alpha$ scale $\lambda$ therefore also controlling the magnitude of the penalty? $\endgroup$ Commented Oct 5, 2016 at 19:17
  • 1
    $\begingroup$ The total magnitude $\lambda$ isn't changed by $\alpha$. $\alpha$ just changes how much penalty is applied to lasso and how much penalty is applied to ridge. Because it's a convex combination, total penalty remains constant even as you change $\alpha$. If $\alpha$ is 0 or 1, you're doing either lasso or ridge regression because the penalty to one is 0. $\endgroup$ Commented Oct 5, 2016 at 19:25
  • 1
    $\begingroup$ That's addressed in this thread: stats.stackexchange.com/questions/74542/… $\endgroup$ Commented Oct 6, 2016 at 13:58
2
$\begingroup$

Just to add: From the help file of glmnet, we read:

Note that cv.glmnet does NOT search for values for alpha. A specific value should be supplied, else alpha=1 is assumed by default. If users would like to cross-validate alpha as well, they should call cv.glmnet with a pre-computed vector foldid, and then use this same fold vector in separate calls to cv.glmnet with different values of alpha

This shows that glmnet doesn't cross validate over $\alpha$, so that the cross validation is just one-dimensional, as I think you suspect.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.