4
$\begingroup$

I have been trying to get my head around $R^2$ in a bit more details instead of just seeing it as a number.

So far I have looked at the process in the following manner:

If I knew very little about the $x$ and $y$ variables and was asked to make a prediction for $y$ for any $x$ I'd guess the mean of $y$ as my prediction. However if I could see the regression line I'd use this and this to give me a more accurate prediction.

Here's where my confusion starts…

I measure the difference between a $y$ point and the mean for $y$ and then square this. I repeat for every $y$ point and sum them this is total variance in $y$. Lets label this B.

I also measure the difference between every $y$ point and its predicted $y$ point on the line and again square this. I take the total of all of these and This is called explained variation mean. Lets label this A.

So 1 - (A/B) gives me my $R^2$. So A/B is some kind of ratio? But I don't get what it means by explained variation? How does the regression line explain these points?

Whats does it mean for the regression line to account for variance?

It seems the further the points are from the mean while being closer to the line the better?

thanks

$\endgroup$
1
  • $\begingroup$ (+1) The thread at stats.stackexchange.com/questions/13314 addresses some aspects of this question. The various answers offer examples and critical interpretation of $R^2$. $\endgroup$ Commented Jun 22, 2015 at 14:29

1 Answer 1

3
$\begingroup$

Bravo on having an intuition that, knowing nothing else, predicting the mean of $y$ every time is the best you can do (at least assuming "best" to be measured in terms of squared deviations between observed and predicted values). I believe this to be a critical component of understanding what $R^2$ and its generalizations mean.

There are many equivalent ways of writing $R^2$ in the simple cases, such as in-sample for ordinary least squares linear regression. Using standard notation where $n$ is the sample size, $y_i$ are the observed values, $\hat y_i$ are the predicted values, and $\bar y$ is the usual mean of all $y_i$, the one that makes the most sense to me is the following:

$$ R^2=1-\left(\dfrac{ \overset{n}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{n}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

(For (in-sample) OLS linear regression, this turns out to be equal to the squared correlation between predicted and observed values, also equal to the squared correlation between the $x$ and $y$ variables in a simple linear regression.)

A slight modification of the notation gives a relationship to variance.

$$ R^2=1-\left(\dfrac{ \dfrac{1}{n}\overset{n}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \dfrac{1}{n}\overset{n}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

Since the $\dfrac{1}{n}$ terms in the numerator and denominator cancel out, this is equal to the earlier formula. Then the numerator and denominator are equal to the variances of the residuals and of the original data.

$$ R^2=1-\left(\dfrac{ \dfrac{1}{n}\overset{n}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \dfrac{1}{n}\overset{n}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right)\\ = 1 - \left( \dfrac{ \mathbb V\text{ar}\left( Y - \hat Y \right) }{ \mathbb V\text{ar}\left( Y \right) } \right) $$

Next, I will take some of my explanation in another answer of mine.

$$ y_i-\bar{y} = (y_i - \hat{y_i} + \hat{y_i} - \bar{y}) = (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) $$

$$( y_i-\bar{y})^2 = \Big[ (y_i - \hat{y_i}) + (\hat{y_i} - \bar{y}) \Big]^2 = (y_i - \hat{y_i})^2 + (\hat{y_i} - \bar{y})^2 + 2(y_i - \hat{y_i})(\hat{y_i} - \bar{y}) $$

$$SSTotal := \sum_i ( y_i-\bar{y})^2 = \sum_i(y_i - \hat{y_i})^2 + \sum_i(\hat{y_i} - \bar{y})^2 + 2\sum_i\Big[ (y_i - \hat{y_i})(\hat{y_i} - \bar{y}) \Big]$$

$$ :=SSRes + SSReg + Other $$

Divide through by the sample size $n$ (or $n-1$) to get variance estimates.

In OLS linear regression, $Other$ drops to zero. Consequently, all of the variance in $Y$ is accounted for by the residual variance (unexplained) and regression variance (explained). We, therefore, can describe the proportion of total variance explained by the regression, which would be the variance explained by the regression model $(SSReg/n)$ divided by the total variance $(SSTotal/n)$.

$$ \dfrac{SSReg/n}{SSTotal/n} $$$$= \dfrac{SSReg}{SSTotal} $$$$= \dfrac{SSTotal -SSRes-Other}{SSTotal} $$$$= 1-\dfrac{SSRes}{SSTotal}$$$$=1-\left(\dfrac{ \overset{n}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{n}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

For an intuition, you observe some phenomenon and record the values produced. As you notice that they are not all equal, you begin to wonder why. Different starting conditions (values of the features) can account for some of that. As an example, consider why people are not all the same height. One reason for this is that not everyone is the same age, and people tend to get taller as they grow up. If you only consider adults (so age is a feature), you will have a much narrower range of heights than if you consider all people. If you start to consider genetics and lifestyle, you might be able to get a rather tight distribution of plausible heights, thus explaining much of the variation in the combined values of all human heights.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.