Making Sense of the weighted Least Squares method from ordinary Least Squares

Question

It is stated that the equation for the best $\hat{x}$ uses the covariance matrix $V$ such that $$ A^TV^{-1}A\hat{x}=A^TV^{-1}b $$ and the error vector is $\vec{e}=(b-Ax)^TV^{-1}(b-Ax)$ and we need to minimize $e=\sum_{i=1}^m\dfrac{(b-Ax)_i^2}{\sigma_i^2}$

How do I make sense of this idea of weighted least squares in comparison with ordinary least square method ?

And how does the inverse of variances become the respective weights ?

My Understanding

Least square

We have a system $Ax=b$ and $b$ need not have to be in the column space of $A$.

We need the vector $A\hat{x}$ in $C(A)$ thats best fit to $b$, ie., closest to $b$.

So we need to minimize the error vector $\vec{e}=b-Ax$, ie., we need to minimize $e=e_1^2+\cdots+e_m^2=(b-Ax)^T(b-Ax)=||b-Ax||^2=\sum_{i=1}^m(b-Ax)_i^2$

Weighted Least square

I think here we are assigning weights for each errors such that each error is not equally likely, say $w_i$

Then we need to minimize the error $e=\sum_{i=1}^m w_ie_i=(b-Ax)^TW(b-Ax)=\sum_{i=1}^m w_i(b-Ax)_i^2$.

But how does the matrix for the weights $W$ is the inverse of the covariance matrix $V^{-1}$ ?

Reference: Page 557, Introduction to Linear Algebra, Gilbert Strang

Note: I am actually looking for an exaplanation based on linear algebra, by not going deep into statistics

Ben Grossmann · Accepted Answer · 2021-03-09 14:51:47Z

Here is a derivation of the formula (in terms of the weight matrix $W$).

Note that the matrix $W$ is necessarily positive definite. Let $M$ be a matrix such that $W = M^TM$ (such a matrix can be found using the Cholesky decomposition, for instance). Note that for any vector $x$, the weighted norm satisfies $$ x^TWx = (Mx)^T(Mx). $$ With that in mind, we can reframe our original problem as follows: $$ \min_{x \in \Bbb R^n} (Ax - b)^TW(Ax - b) = \\ \min_{x \in \Bbb R^n} [M(Ax - b)]^T[M(Ax - b)] = \\ \min_{x \in \Bbb R^n} [(MA)x - (Mb)]^T[(MA)x - (Mb)]. $$ In other words, the solution that we're looking for is the least squares solution to the equation $(MA) x = Mb$. Using the ordinary least-squares, the solution satisfies $$ (MA)^T(MA)\hat x = (MA)^T(Mb) \implies\\ A^T(M^TM) A \hat x = A^T(M^TM)b \implies\\ A^TWA\hat x = A^TWb. $$

Now, a question remains: why is $W = V^{-1}$ the correct choice of weight matrix? Once answer is that even if entries of $b$ are not uncorrelated with identical variance, then entries of $Mb$ will be, where $M$ is chosen such that $W = V^{-1} = M^TM$.

Indeed: we note that the covariance matrix of $b$ is given by $$ \Bbb E(bb^T) - \Bbb E(b)\Bbb E(b)^T. $$ We note that the covariance of the new variable $Mb$ is given by $$ \begin{align} \Bbb E((Mb)(Mb)^T) - \Bbb E(Mb) \Bbb E(Mb)^T &= \Bbb E(Mbb^TM^T) - \Bbb E(Mb)\Bbb E(b^TM^T) \\ & = M\Bbb E(bb^T)M^T - M\Bbb E(b)\Bbb E(b)^T M^T \\ & = M[\Bbb E(bb^T) - \Bbb E(b)\Bbb E(b)^T]M^T \\ & = MVM^T = M(M^TM)^{-1}M^T \\ & = MM^{-1}M^{-T}M^T = I. \end{align} $$ In other words, the covariance of each pair of components of $Mb$ is zero and the variance of each component is $1$.

Because the equation $(MA)x = (Mb)$ is such that the right side has entries that are uncorrelated with identical variance, it is more appropriate to compute its least squares solution (in the usual sense). As I explain in the first part of this answer, finding this least squares solution is equivalent to finding the weighted least squares solution for the original problem.

I have added the statements in original reference. Still confused about the apprance of covariance matrix here. — SOORAJ SOMAN
– SOORAJ SOMAN, Commented Mar 8, 2021 at 6:41
How do you say that $W$ s positive definite ? I was actually thinking the weight matrix should be diagonal ! — SOORAJ SOMAN
– SOORAJ SOMAN, Commented Mar 9, 2021 at 12:31
@ss1729 A diagonal matrix with positive diagonal entries will be positive definite. More generally, though, covariance matrices are positive semidefinite. — Ben Grossmann
– Ben Grossmann, Commented Mar 9, 2021 at 14:26
@ss1729 When you say that the covariance matrix is diagonal, the corresponding assumption about the data is that the errors in the components of $b$ are uncorrelated (which holds, for instance, if the errors are independent). — Ben Grossmann
– Ben Grossmann, Commented Mar 9, 2021 at 14:29

littleO · Accepted Answer · 2021-03-09 05:08:45Z

I don't think we can understand where $V^{-1}$ comes from without a bit of statistics, so here's a statistics viewpoint (but using notation similar to Strang's notation). Suppose there is an $m \times n$ matrix $A$ and a vector $\bar x \in \mathbb R^n$ such that $$ Y = A \bar x + \epsilon. $$ Here $A$ and $\bar x$ are non-random but $\epsilon$ is a normally distributed random vector with mean $0$ and covariance matrix $V$. So $Y$ is a normally distributed random vector with mean $A \bar x$ and covariance matrix $V$. You can think of $Y$ as being a noisy measurement of the value of $A \bar x$.

Let $b \in \mathbb R^m$ be the observed value of $Y$. Our goal is to estimate $\bar x$, given $A$ and $b$. Let $f_Y$ be the probability density function for a normally distributed random vector with mean $Ax$ and covariance matrix $V$. Notice that $f_Y$ depends on $x$. A natural estimate of $\bar x$ is the maximum likelihood estimate \begin{align} \hat x &= \arg \max_x f_Y(b) \\ &= \arg \max_x \frac{1}{\sqrt{(2 \pi)^m |V|}}e^{-\frac12 (b - Ax)^T V^{-1}(b - Ax)} \end{align} Maximizing $f_Y(b)$ is equivalent to minimizing $-\ln(f_Y(b))$, which is equivalent to minimizing $$ E(x) = (b - Ax)^T V^{-1} (b - Ax). $$ That is how we arrive at our weighted least squares problem. Setting the gradient of $E$ equal to $0$ yields Strang's equation (10).

If the components of the random variable $\epsilon$ are independent, then $V$ is diagonal, so the formula for $E(x)$ simplifies to Strang's equation (11).

Stack Exchange Network

Making Sense of the weighted Least Squares method from ordinary Least Squares

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Making Sense of the weighted Least Squares method from ordinary Least Squares

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions