8
$\begingroup$

I'm trying to understand the connection between the geometric interpretation of solving an inconsistent system $A\mathbf{x} = \mathbf{b}$ and the name "least squares."

I understand the geometric approach perfectly. When an exact solution doesn't exist, the best approximate solution $\hat{\mathbf{x}}$ is the one that makes the vector $A\hat{\mathbf{x}}$ the closest possible vector to $\mathbf{b}$ within the column space of $A$. This closest vector is the orthogonal projection of $\mathbf{b}$ onto the column space.

This leads to the condition that the error vector, $\mathbf{e} = \mathbf{b} - A\hat{\mathbf{x}}$, must be orthogonal to the column space of $A$. This orthogonality is expressed by the normal equation: $$A^T(\mathbf{b} - A\hat{\mathbf{x}}) = \mathbf{0} \implies A^T A \hat{\mathbf{x}} = A^T \mathbf{b}$$ This derivation is based entirely on the geometry of vector spaces and the concept of orthogonality.

My confusion is with the terminology. I know that one can also find this solution by using calculus to minimize the sum of the squared errors. This calculus-based method is clearly a "least squares" problem.

My question is: Why do we also call the purely geometric/matrix method a "least squares" solution? Is it simply because its solution happen to be identical to the one derived from the calculus method? Or is there a more fundamental reason that the geometric condition of orthogonality is itself an expression of "least squares," independent of the calculus perspective?

Thank you.

$\endgroup$
2
  • 3
    $\begingroup$ Does it help to consider the following? The distances you are calculating in your geometrical approach, in your vector space, are calculated using the Euclidean norm, AKA the L2 norm. So-called because it is calculated using the square root of the sum of the squares, of the orthogonal distances, along each perpendicular axis. So your notion of distance fundamentally has squares baked into it. $\endgroup$ Commented Sep 23 at 1:51
  • $\begingroup$ Take a look at the Wikipedia page for Coefficient of determination, you'll find a related image in the definition section which shows how squares are minimized. $\endgroup$ Commented Sep 23 at 10:35

4 Answers 4

18
$\begingroup$

You write “the best approximate solution $\hat x$”, but the best by what criterion? What you describe is the solution that minimises $\left\|A\hat x-b\right\|$. For the euclidean norm, but that is a choice. Now minimising $\left\|A\hat x-b\right\|$ is the same as minimising $\left\|A\hat x-b\right\|^2=\sum_i (A\hat x-b)_i^2$ and those are the squares to which “least squares” refers. So you are indeed solving a least squares problem. Note that if you see $Ax=b$ as a system of equations then $(A\hat x-b)_i$ is the error for the $i$-th equation.

$\endgroup$
1
  • $\begingroup$ This is the correct answer. $\endgroup$ Commented Sep 23 at 18:33
16
$\begingroup$

I'll focus on this piece:

Or is there a more fundamental reason that the geometric condition of orthogonality is itself an expression of "least squares," independent of the calculus perspective?

Note that for any $\mathbf x$, the error vector $\mathbf e(\mathbf x) = \mathbf b - A \mathbf x$ can always be decomposed into two components: one parallel to (i.e., an element of) the column space of $A$ and the other perpendicular. That is, $$ \mathbf e(\mathbf x) = \mathbf e_{\|}(\mathbf x) + \mathbf e_{\perp}. $$ Note that the parallel component depends on our choice of $\mathbf x$, but the perpendicular component does not. Indeed: for any attainable error vectors $\mathbf e(\mathbf x_1), \mathbf e(\mathbf x_2),$ the difference $\mathbf e(\mathbf x_2) - \mathbf e(\mathbf x_1)$ is always parallel to the column space.

With that, we can see by the Pythagorean theorem (generalized to $n$-dimensional space) that the sum of squares error corresponding to a given $\mathbf x$ is given by $$ \|\mathbf e(\mathbf x)\|^2 = \|\mathbf e_{\|}(\mathbf x)\|^2 + \|\mathbf e_{\perp}\|^2 $$ So, finding $\mathbf x$ such that $\mathbf e(\mathbf x)$ is orthogonal to the column space of $A$ (equivalently, such that $\mathbf e_\| = 0$) is equivalent to minimizing the sum-of-squares error.

$\endgroup$
11
$\begingroup$

Yes, it is called "least squares" because it solves the problem of minimizing the sum of the squares of the errors. Geometrically, as you say we are finding a "closest possible vector", and "closest" is defined in terms of the Euclidean norm, which is the square root of the sum of the squares of the coordinates.

According to https://mathshistory.st-andrews.ac.uk/Miller/mathword/m/, the term "Method of least squares" (in French: "la Méthode des moindres quarrés") goes back to Legendre in 1805. This was before the development of the modern machinery of linear algebra. So Legendre and Gauss would not be using vectors and matrices as such.

$\endgroup$
0
$\begingroup$

We want to make the error $\mathbf e=\mathbf b-A\mathbf x$ as close as possible to zero, so we can minimize the squared error $e^2=\mathbf e\cdot\mathbf e$. We can actually solve for this using calculus by setting $\frac{\partial}{\partial \mathbf x}e^2=0$. Working with vector derivatives makes me confused, so I will be using Einstein summation convention. Everytime you see two repeated indices, there is an implicit sum. Everytime you see an unpaired index you are viewing it as vector component. For example, $x_ix_i\equiv \sum_{i=1}^n x_ix_i$ and $y_i=2x_i$ means that the i-th component of $y$ is twice the i-th component of $x$. Using this notation we get \begin{align} \frac{\partial}{\partial x_i}e^2&=\frac{\partial}{\partial x_i}(b_j-A_{jk}x_k)(b_j-A_{jl}x_l)\\ &=\frac{\partial}{\partial x_i}\left[b_jb_j-2b_jA_{jk}x_k+x_kA_{jk}A_{jl}x_l\right] \end{align} The $b_jb_j$ term drops out, so let's look at the second and third term individually. Here $\delta_{ij}$ is the Kronecker delta. Second term: \begin{align} \frac{\partial}{\partial x_i}\left[-2b_jA_{jk}x_k\right]&=-2b_jA_{jk}\delta_{ik}\\ &=-2b_jA_{ji}\\ &\equiv-2A^T\mathbf b. \end{align} Third term:

\begin{align} \frac{\partial}{\partial x_i}\left[x_kA_{jk}A_{jl}x_l\right]&=\delta_{ik}A_{jk}A_{jl}x_l+x_kA_{jk}A_{jl}\delta_{il}\\ &=A_{ji}A_{jl}x_l+x_kA_{jk}A_{ji}\\ &=2A_{ji}A_{jl}x_l\\ &\equiv2A^TA\mathbf x. \end{align} Combined, they give \begin{align} \frac{\partial}{\partial \mathbf x}e^2&=-2A^T\mathbf b+2A^T\mathbf x\\ &=2A^T(A\mathbf x -\mathbf b)=0 \end{align}

Einstein notation may seem confusing at first, but all the individual steps make total sense. Converting back to vector notation is actually the hard part.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.