Least squares and regularization
This post is part 1 of a 3 part series: Part I, Part II, Part III.
Imagine that we have a bunch of points that we want to fit a line to:
In the plot above, $y_i = a\cdot x_i + b + n_i$ where $n_i$ is random noise. We know every point $(x_i, y_i)$ and would like to estimate $a$ and $b$ in the presence of noise.
What method can we use to find an estimation for $a$ and $b$? What constraints does this method have and when does it fail? This is a classic problem in signal processing – we are given some noisy observations and would like to determine how the data was generated.
Least squares
After we express this problem in a matrix multiplication language, the solution becomes more straightforward. We abstract some of the details away and focus on larger concepts. In doing this I will denote scalars with plain math font, vectors with lowercase bold math font and matrices with uppercase bold math font.
To represent this with matrix multiplication, we’ll stack our observations $y_i$ into a vector $\yb$. When we define $\Ab$ and $\wb$ (our data and weights respectively) in certain ways, we can say that $\yb = \Ab\cdot\wb + \nb$ when $\nb$ is a vector of noise. In particular, we’ll define $\Ab$ so that each row is $[x_i,~1]$ and our weights $\wb$ as $[a, b]^T$.
We are collecting more data than we have unknowns and are fitting a linear line which means that this problem is overdetermined – we have more measurements than unknowns. Our only unknowns are $a$ and $b$ and we have many measurements.^{1}
Our approximation of $\wb$ will be called $\est{\wb}$. The key behind finding this approximation is realizing the objective. Should we minimize the error? Should we say that large errors should be penalized more than small errors?
We want to minimize the error, or the distance between our observations and the value that our approximation would predict. That is, we want to minimize the distance between $\yb$ and $\Ab \cdot \est{\wb}$ and penalize large errors more. We can represent this a concise mathematical equation:
This says that our estimate $\est{\wb}$ will be the argument that minimizes the standard error term. After writing this equation down and knowing what to Google, we can find a closed form solution with the help of the Wikipedia page for least squares!
With this equation we find that $\est{\wb} = [\widehat{a},~\widehat{b}]^T \approx [0.89,~0.03]$, and not that far away from the true solution $\xb = [1,~0]^T$.
I won’t cover on how to solve this equation, but know that matrix inversion is often expensive time and memorywise. Instead, we often rely on methods like gradient descent to find a solution to this equation.
This method works and is (and should!) be widely used in similar use cases, but it’s also old. What other solutions are out there and where does it fail?
Noisy blurry signals
Let’s say we are given noisy observations of a blurry signal. How can we reconstruct the values of the original signal from our observations? Similar situations happen all the time in the real world. We see averaging input data together all the time.
Let’s say that we make $n$ observations of a signal $\xb$ and put the observations into a vector $\yb$. We know that the averaging of $k$ values can be represented by a matrix $\Hb_k$ because blurring is a linear transformation. When we represent noise with standard deviation $\sigma$ as $\nb_\sigma$,
The naive approach would be to use the inverse transform with the the inverse filter, which is possible because $\Hb_k$ is a square matrix. It would seem natural just to reverse the operation by making our estimate $\est{\xb}$ by
But the noise ruins this approximation and similar approximation (such as the least squares approximation). In general, $\Hb^{1}$ is adding or subtracting many noisy values. Even worse, it is common for $\Hb^{1}$ to be illconditioned and include large values. This means that our estimate $\est{\xb}$ quickly gets a lot of error as (typically) many noisy values are multiplied by large values and then added.
Of course, in the noise free case this method works perfectly. But in noisy cases, the estimate does not perform well… and I’ve made an interactive widget to help explore that fact!
We can see that in the noisefree case, this works perfectly, which is to be expected: we know the output $\yb$ and how the signal became blurred because we also know $\Hb$.
However, we can see that we add any noise, our estimate is way off. This occurs because with noise, we’re approximating by $\est{\xb} = \xb + \Hb^{1} \nb$. This $\Hb^{1}\nb$ will be adding up $k$ noisy values when we average by $k$ pixels. This means our noise grows and the approximation grows worse and worse. Note that it doesn’t grow in expectation, but the standard deviation/variance grows, meaning that we’re more likely to get larger values.
We know our error comes from $\Hb^{1}\nb$. When we have a moving average filter over $k$ values, this adds up $k$ noisy values. We would expect the value of our noise to remain constant as expectation is linear. However, the variance of our independent noisy values is
Meaning that even a single estimate $\widehat{x}_i$ grows poorly – there’s nothing that limiting how large it grows!
How can we do better?
We might know something about our signal. Why don’t we use that information? This is the basis behind prior probability. This is a fancy word, but it’s just saying that we know thing about our signal. For example, I know the weather will not drastically change throughout the day and will not bring several changes of clothes, a winter coat, an umbrella and sunscreen with me.
With the above blurred signal, what facts do we know? We know that our signal does not grow to infinity as we saw when we increased the noise in our observation. So, let’s use this information to our advantage and say that our reconstruction should not grow too large! To do this, we’ll use regularization.
Regularization
Often times, we want to balance two objectives and can’t have both. For example, we might want to make a car fast and gas efficient and safe and cheap, but that’s not possible. We can only choose a few of those. Regularization in mathematics plays the same game. In the above case, we want to balance the error and the energy of the reconstructed signal. We can’t do both, so we’ll have to find a balance between the two.
Quoting the wikipedia page on regularization,
A theoretical justification for regularization is that it attempts to impose Occam’s razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.
In the example above, we want to say the energy of our estimate shouldn’t grow too large and the error should be relatively small. We are balancing the error and the norm (or energy). We can represent the balance of error of the estimate and energy in the estimate with an addition. If $\lambda > 0$ is some constant that determines how important the error is relative to the energy,
which just minimizes the error of what we observe and our estimate and the energy in our estimate or $\sum_i x_i^2$. When $\lambda \approx 0$, this solution is the minimum norm least squares solution. When $\lambda \gg 1$, the solution converges to $x_i = 0$ for all $i$.
This results in a better approximation, one that is much more tolerant to noise. To see that, I have repeated the same example above. Here, we’ll graph the “ground truth” $\xb$ and our estimation $\est{\xb}$. Our noisy observations will be $\yb$ and involve the noise free observations plus noise of standard deviation $\sigma$. That is,
And here’s another interactive widget!
This is a much more noisetolerant estimate. Why?
This regularization parameter just makes sure that each value in our estimate doesn’t grow too large. This regularization parameter enforces that and restricts the estimate $\est{\xb}$ from growing too large. Increasing the noise increases the variance of our noise, and this is a method to help limit that.
Another way to look at this is that we have two objectives: minimize the error and minimize the norm of estimate $\est{\xb}$. These two values happen at different points, and we can’t achieve both at the same time.
Of course, we can have different regularization parameters. What if we know that only a small number of coefficients are nonzero? What if that our signal has a sparse representation in some transform? This is discussed in Part II of this series!

Underdetermined systems pose many challenges because they have infinitely many solutions and are the focus for the rest of the series ↩