5
$\begingroup$

This is a common way to define $R^2$ in a regression problem.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

In an OLS linear regression with an intercept, this winds up being equivalent to other calculations.

  1. Squared Pearson correlation between true and predicted values: $\left[ \text{corr}\left( y, \hat y \right) \right]^2$.

  2. In a simple linear regression with just one predictor, $x$, the squared Pearson correlation between the outcome $y$ and that predictor, $x$: $\left[ \text{corr}\left( y, x \right) \right]^2$.

Consequently, there is a straightforward interpretation of $\sqrt{1-\left(\frac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right)} $ as a correlation coefficient, at least in the case of OLS linear regression (with an intercept).

This leads me to two related questions.

  1. In the situation of OLS linear regression with an intercept, $R^2\ge0$, and probably $R^2>0$. Thus, we (probably) get two roots, one positive and one negative. Only one of those values equals a correlation from above. How can we interpret the other root, particularly in multiple regression with multiple $x$-values?

  2. In general, $R^2\in\mathbb R$, and we are not guaranteed of having real roots. When $R^2 < 0$, when the fit is so poor that a "predict $\bar y$ every time model gives a better value of square loss, how do we interpret the complex roots of such a poor model?

Compleex roots of R^2 < 0

# https://stackoverflow.com/questions/14966814/multiple-roots-in-the-complex-plane-with-r # nRoot <- function(x, root) { polyroot(c(-x, rep(0, root-1), 1)) } y <- c(1, 2, 3) # Observed outcomes yhat <- c(-2, -3, -4) # Predictions from some (bad) model r2 <- 1 - (sum((y - yhat)^2))/(sum((y - mean(y))^2)) # Take the complex square roots # zs <- nRoot(r2, 2) real_parts <- imaginary_parts <- rep(NA, 2) for (i in 1:2){ real_parts[i] <- Re(zs[i]) imaginary_parts[i] <- Im(zs[i]) } # prepare "circle data" # https://stackoverflow.com/a/22266105/11751799 # radius = sqrt(abs(r2)) center_x = 0 center_y = 0 theta = seq(0, 2 * pi, length = 200) # angles for drawing points around the circle # # Draw a circle # plot( x = radius * cos(theta) + center_x, y = radius * sin(theta) + center_y, type = "l", xlab = "Real", ylab = "Imaginary", main = paste( "Complex Roots of \nR^2 =", round(r2, 3) ) ) # # Plot the complex roots # points(0, 0) points(c(real_parts[1]), c(imaginary_parts[1]), col = 'red') lines(c(0, real_parts[1]), c(0, imaginary_parts[1]), col = 'red') points(c(real_parts[2]), c(imaginary_parts[2]), col = 'blue') lines(c(0, real_parts[2]), c(0, imaginary_parts[2]), col = 'blue') 

IDEA

We solve for the OLS solution using matrix calculus and get an explicit formula to calculate an estimate of the regression parameter, the usual $\hat\beta = (X^TX)^{-1}X^Ty$. However, OLS means that we calculate the predictions across all possible parameter vectors and the pick the parameter vector giving the predictions with the lowest sum of squared residuals.

$$\hat\beta_{\text{OLS}} \in \left\{\underset{\left( \tilde\beta_0, \tilde\beta_1, \dots,\tilde\beta_p \right)\in\mathbb{R}^{p+1}}{\arg\min}\left\{ \underset{i = 1}{\overset{N}{\sum}}\left( y_i - \hat y_i \right)^2\bigg\vert \hat y_i = \tilde\beta_0 + \tilde\beta_1 x_{i1} +\dots + \tilde\beta_p x_{ip} \right\}\right\}$$

While the familiar $\hat\beta = (X^TX)^{-1}X^Ty$ is such an $\arg\min$, every combination of real numbers is a candidate estimate that gives some set of predictions with some sum of squared residuals and some $R^2$ value with associated roots in $\mathbb C$.

In the image below, I created some synthetic data and made three guesses about the simple linear regression parameters that might minimize the sum of squared residuals. Sure, there is more to regression that OLS linear regression, but might the green and red roots have some interpretation?

complex roots for multiple simple linear regression estimates

library(data.table) library(ggplot2) x <- c(0, 1, 2) y <- c(2, 7, 6) params1 <- c(2, 7) yhat1 <- cbind(1, x) %*% params1 r2_1 <- 1 - (sum((y - yhat1)^2))/(sum((y - mean(y))^2)) params2 <- c(-1, -4) yhat2 <- cbind(1, x) %*% params2 r2_2 <- 1 - (sum((y - yhat2)^2))/(sum((y - mean(y))^2)) params3 <- c(3, 2) # This happens to be the OLS solution, confirmed by lm(y ~ x) yhat3 <- cbind(1, x) %*% params3 r2_3 <- 1 - (sum((y - yhat3)^2))/(sum((y - mean(y))^2)) params <- rbind(params1, params2, params3) # https://stackoverflow.com/questions/14966814/multiple-roots-in-the-complex-plane-with-r # nRoot <- function(x, root = 2) { polyroot(c(-x, rep(0, root-1), 1)) } r2_1 # -6.428571 r2_2 # -26 r2_3 # 0.5714286 r2 <- c(r2_1, r2_2, r2_3) L <- list() for (i in 1:length(r2)){ zs <- nRoot(r2[i]) real_parts <- imaginary_parts <- rep(NA, 2) for (j in 1:2){ real_parts[j] <- Re(zs[j]) imaginary_parts[j] <- Im(zs[j]) } L[[i]] <- data.frame( Real = c(0, 0, real_parts), Imaginary = c(0, 0, imaginary_parts), Estimate = paste(params[i, 1], ", ", params[i, 2], sep = "") ) } d <- data.table::rbindlist(L) ggplot(d, aes(x = Real, y = Imaginary, col = Estimate)) + geom_line() + geom_point() 
$\endgroup$

1 Answer 1

0
$\begingroup$

As a preliminary note, it is not really clear what you mean in talking about "roots" of a constant real number, but you do note that there is both a positive and negative square-root to $R^2$ in the univariate case, which is the correlation between the response and explanatory variable. I presume that you have in mind some form of extension of this result to the case of multiple regression. In any case, I'm going to give some information that I think will elucidate your problem, and hopefully it is helpful.


This kind of analysis and decomposition of the coefficient of determination in multiple regression is examined in O'Neill (2019), along with various geometric properties of the linear regression model. That paper bears closely on the questions you are asking here, so it be of interest to you. The analysis I offer here is taken from results in that paper. In the case of a multiple linear regression with $m$ explanatory variables, following the above paper, suppose we denote the relevant correlations between the vectors as:

$$R_i = \text{Corr}(\mathbf{y}, \mathbf{x}_i) \quad \quad \quad \quad \quad R_{i,k} = \text{Corr}(\mathbf{x}_i, \mathbf{x}_k),$$

and we organise these into the goodness of fit vector and design correlation matrix given respectively by:

$$\boldsymbol{\Omega} = \begin{bmatrix} R_{1} \\ R_{2} \\ \vdots \\ R_{m} \\ \end{bmatrix} \quad \quad \quad \quad \quad \boldsymbol{\Theta} = \begin{bmatrix} 1 & R_{1,2} & R_{1,3} & \cdots & R_{1,m} \\ R_{2,1} & 1 & R_{2,3} & \cdots & R_{2,m} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ R_{m,1} & R_{m,2} & R_{m,3} & \cdots & 1 \\ \end{bmatrix}.$$

Then we can write the coefficient of determination as the quadratic form:

$$R^2 = \boldsymbol{\Omega}^\text{T} \boldsymbol{\Theta} \boldsymbol{\Omega}.$$

Now, suppose we use the spectral decomposition $\boldsymbol{\Theta} = \boldsymbol{\upsilon}^\text{T} \boldsymbol{\Lambda} \boldsymbol{\upsilon}$ and let $\mathbf{z}_1,...,\mathbf{z}_m$ be the principal components of the design matrix with respect to the eigenvectors in this decomposition. Then it can be shown (O'Neill 2019, pp. 10-14) that:

$$R^2 = \sum_{i=1}^m \text{Corr}(\mathbf{y}, \mathbf{z}_i)^2.$$

This result shows that the coefficient of determination can be decomposed into the sum of squares of the correlations between the response vector in the regression and the principal components of the design matrix. These correlations can be positive or negative, but they enter into the coefficient of determination only through their square. This result extends the phenomenon you have observed that $R^2 = \text{Corr}(\mathbf{y}, \mathbf{x})^2$ in the univariate case.

This particular aspect of the coefficient of determination is just one part of a broader set of geometric behaviours that are exhibited in multiple linear regression. Moreover, there are some interesting and counter-intuitive behaviours of the coefficient, owing to complexities in the interaction of the response vector and the principal components of the design matrix. I recommend you read O'Neill (2019) to get a good overview of the geometric properties of multiple linear regression and the specific properties of the coefficient of determination.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.