# Least squares and regularization

*This post is part 1 of a 3 part series: Part I, Part II, Part III.*

Imagine that we have a bunch of points that we want to fit a line to:

In the plot above, $y_i = a\cdot x_i + b + n_i$ where $n_i$ is random noise. We know every point $(x_i, y_i)$ and would like to estimate $a$ and $b$ in the presence of noise.

What method can we use to find an estimation for $a$ and $b$? What constraints
does this method have and when does it fail? This is a classic problem in
signal processing – we are given some noisy observations and would like to
determine *how* the data was generated.

## Least squares

After we express this problem in a matrix multiplication language, the solution becomes more straightforward. We abstract some of the details away and focus on larger concepts. In doing this I will denote scalars with plain math font, vectors with lower-case bold math font and matrices with upper-case bold math font.

To represent this with matrix multiplication, we’ll stack our observations $y_i$ into a vector $\SSy$. When we define $\SSA$ and $\SSw$ (our data and weights respectively) in certain ways, we can say that $\SSy = \SSA\cdot\SSw + \SSn$ when $\SSn$ is a vector of noise. In particular, we’ll define $\SSA$ so that each row is $[x_i,~1]$ and our weights $\SSw$ as $[a, b]^T$.

We are collecting more data than we have unknowns and are fitting a linear line
which means that this problem is overdetermined – we have more measurements
than unknowns. Our only unknowns are $a$ and $b$ and we have many
measurements.^{1}

Our approximation of $\SSw$ will be called $\SSwh$. The key behind finding this approximation is realizing the objective. Should we minimize the error? Should we say that large errors should be penalized more than small errors?

We want to minimize the error, or the distance between our observations and the value that our approximation would predict. That is, we want to minimize the distance between $\SSy$ and $\SSA \cdot \SSwh$ and penalize large errors more. We can represent this a concise mathematical equation:

This says that our estimate $\SSwh$ will be the *arg*ument that *min*imizes the
standard error term. After writing this equation down and knowing what to
Google, we can find a closed form solution with the help of the Wikipedia page for least squares!

With this equation we find that $\SSwh = [\widehat{a},~\widehat{b}]^T \approx [0.89,~0.03]$, and not that far away from the true solution $\SSx = [1,~0]^T$.

I won’t cover on how to *solve* this equation, but know that matrix inversion
is often expensive time- and memory-wise. Instead, we often rely on methods
like gradient descent to find a solution to this equation.

This method works and is (and should!) be widely used in similar use cases, but it’s also old. What other solutions are out there and where does it fail?

## Noisy blurry signals

Let’s say we are given noisy observations of a blurry signal. How can we reconstruct the values of the original signal from our observations? Similar situations happen all the time in the real world. We see averaging input data together all the time.

Let’s say that we make $n$ observations of a signal $\SSx$ and put the observations into a vector $\SSy$. We know that the averaging of $k$ values can be represented by a matrix $\SSH_k$ because blurring is a linear transformation. When we represent noise with standard deviation $\sigma$ as $\SSn_\sigma$,

The naive approach would be to use the inverse transform with the the inverse filter, which is possible because $\SSH_k$ is a square matrix. It would seem natural just to reverse the operation by making our estimate $\SSxh$ by

But the noise ruins this approximation and similar approximation (such as the least squares approximation). In general, $\SSH^{-1}$ is adding or subtracting many noisy values. Even worse, it is common for $\SSH^{-1}$ to be ill-conditioned and include large values. This means that our estimate $\SSxh$ quickly gets a lot of error as (typically) many noisy values are multiplied by large values and then added.

Of course, in the noise free case this method works *perfectly.* But in noisy
cases, the estimate does not perform well… and I’ve made an interactive widget
to help explore that fact!