3.1 Summary

Course subject(s) 3. Least Squares Estimation (LSE)

Least squares estimation

Dealing with inconsistent models

Given a set of observations which contain noise, and a model that is assumed to explain these data, the goal is to estimate the unknown parameters of that model. The least squares principle can be used to solve this task. 


Mathematically, we use the ‘system of observation equations’, with the vector of observations \(y\), the vector of unknowns \(x\), and the design matrix \(A\). In a simple form, we can write this system of observation equations as

\[y = Ax\]

If all the observations would fit perfectly to this model, this system is called consistent. This is only possible if the number of observations is equal to (or smaller than) the number of unknowns. 


If the number of observations is greater than the number of unknowns (and the design matrix \(A\) is of full column rank), it is very unlikely that the system is consistent. Physical reality would need to be perfectly described by the conceptual mathematical model. It is evident that in real life this is never the case, since (i) our observations are always contaminated by some form of noise, and (ii) physical reality is often more complex than can be described with a simplified mathematical model. 


Therefore, the mathematical model of choice is usually inconsistent  This means that the mathematical model is not a perfect description of physical reality. Respectively, this could be since (i) the physical measurements are indeed affected by noise which is not in the model, and (ii) the mathematical relation between observations and unknowns is more complex.


Thus, in the case in which there are more observations than unknowns (and design matrix \(A\) is of full column rank) the \(y=Ax\) equation has no solution. In other words; every 'solution' would be wrong, since the model would not 'fit' the data.


We solve the problem of having an ‘unsolvable’ equation by considering errors for each observation:

\[y=Ax + e\]

The length of the error vector (or vector of residuals) \(e\) is equal to the length of the vector of observations \(y\).


Least squares principle

We are looking for a solution for \(x\); this solution will be denoted by \(\hat{x}\). Based on this solution, the 'adjusted observations' would be \(\hat{y}= A\hat{x}\) (solution of the forward model).


To find a solution for an inconsistent system of equations, we prefer the one for which the observations are as ‘close as possible’ to the model. This is a ‘minimum distance’ objective. In mathematical terms, we look at the vector of residuals \(\hat{e}\): the difference between the observations \(y\) and the model realization \(\hat{y}\):

\[\hat{e}=y-\hat{y}=y-A\hat{x}\]

We want this vector of residuals to be as small as possible (i.e., the minimum distance objective).The ‘least squares’ principle attempts to achieve that: by minimizing (‘least’) the sum of the squared residuals (‘squares’). 


If we take the square-root of this sum, we find the length of the vector, also known as the ‘norm’ of the vector. Thus, the various possible error vectors have a norm defined as:

\[\left \| e \right \| = \sqrt{e_1^2+e_2^2+...+e_m^2}=\sqrt{e^Te}\]

Finding the minimum of (a) the sum of the squared differences, or (b) the square-root of the sum of the squared differences, yields the same result, since both terms are always positive or zero.


If we find this minimum, the corresponding solution \(\hat{x}\) is than the least squares solution of the system of observation equations. In mathematical terms this is written as

\[\hat{x}_{\text{LS}} = \arg \underset{x}{\min} \left \| e \right \|^2= \arg \underset{x}{\min} {e^Te}\]

From the system observation equations \(y=Ax+e\), it follows directly that \(e=y-Ax\) and therefore the least squares solution follows as:

\[\hat{x}_{\text{LS}} =\arg \underset{x}{\min} {(y-Ax)^T (y-Ax)}.\]

 In other words, we find \(\hat{x}_{\text{LS}}\) by finding the minimum of \((y-Ax)^T (y-Ax)\).


Least squares solution

We can find the minimum of a function by taking the first derivative with respect to the 'free variable'. Since the observation vector is not free, and also the design matrix \(A\) is fixed, the only variable which we can vary is \(x\). The first derivative of our objective function should be equal to zero to reach a minimum:

\[\begin{align}  \partial_x (y-Ax)^T (y-Ax) &=0\\  \partial_x ( y^Ty -(Ax)^T y -y^T Ax + (Ax)^T(Ax) )&=0\\ \partial_x ( y^Ty -x^TA^T y -y^T Ax + x^T A^TAx )&=0\\ \partial_x (y^Ty -2x^TA^T y + x^T A^TAx) &=0\\ -2A^T y +  2A^TAx &=0\\ A^TAx &=A^T y \end{align}\]

This last equation is known as the normal equation, and the matrix \(N=A^T A\) is known as the normal matrix.


If it is possible to compute the inverse of the normal matrix, the normal equation can be written to express the estimate of \(x\), denoted by \(\hat{x}\), as a linear function of the observations: 

\[\hat{x}_{\text{LS}}= (A^T A)^{-1} A^T y\]

This last equation shows the standard 'least squares' form used to estimate unknown parameters from (i) a vector of observations, and (ii) a design matrix \(A\).


Overview

In summary, the least squares estimates of \(x\), \(y\) and \(e\) are given by:

\[\hat{x}= (A^T A)^{-1} A^T y\]

\[\hat{y} = A \hat{x}\]

\[\hat{e} = y - A\hat{x} = y - \hat{y}\]

Where we omitted the subscript \(\text{LS}\) for notational convenience.


Note that in this module we do not yet consider the fact that the observations and errors are stochastic variables.

Creative Commons License
Observation Theory: Estimating the Unknown by TU Delft OpenCourseWare is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Based on a work at https://ocw.tudelft.nl/courses/observation-theory-estimating-unknown.
Back to top