This post is a part 2 of a 3 part series: Part I, Part II, Part III

We often have fewer measurements than unknowns, which happens all the time in genomics and medical imaging. For example, we might be collecting 8,000 gene measurements in 300 patients and we’d like to determine which ones are most important in determining cancer.

This means that we typically have an underdetermined system because we’re collecting more measurement than unknowns. This is an unfavorable situation – there are infinitely may solutions to this problem. However, in the case of breast cancer, biological intuition might tell us that most of the 8,000 genes aren’t important and have zero important in cancer expression. \newcommand{\SSy}{\mathbf{y}} \newcommand{\SSX}{\mathbf{X}} \newcommand{\SSw}{\mathbf{w}} \newcommand{\SSb}{\mathbf{b}} \newcommand{\SSx}{\mathbf{x}} \newcommand{\SSxh}{\widehat{\SSx}} \newcommand{\SSA}{\mathbf{A}} \newcommand{\SSmatrix}[1]{\begin{bmatrix} #1 \end{bmatrix}} \newcommand{\SSalign}[1]{\begin{aligned} #1 \end{aligned}} \newcommand{\SSnorm}[1]{\left|\left| #1 \right|\right|} \newcommand{\SSabs}[1]{\left| #1 \right|} \newcommand{\pder}[2]{\frac{\partial#1}{\partial#2}}

How do we enforce that most of the variables are 0? This post will try and give intuition for the problem formulation and dig into the algorithm to solve the posed problem. I’ll use a real-world cancer dataset1 to predict which genes are important for cancer expression. It should be noted that we’re more concerned with the type of solution we obtain rather than how well it performs.

This post will build of Part I of this series. In this post, we will only change the type of solution we obtain by changing the regularization parameter. In Part I, we saw that we could get an acceptable solution by introducing a regularization parameter, $\norm{\xb}_2^2$. In this post, we’ll examine changing that to $\norm{\xb}_1$.

Before getting started, we’ll have to use norms, as they provide a nice syntax for working with vectors and matrices. We define the $\ell_p$ norms as

which means that $\ell_2$ norm of $\SSx$ is $\norm{\xb}_2 := \sqrt{\sum_i x_i^2}$ (meaning $\norm{\xb}_2^2 = \sum_i x_i^2$) and $\ell_1$ norm of $\SSx$ is $\norm{\xb}_1 := \sum_i \abs{x_i}$. This definition doesn’t define $\ell_0$ norms, but we define it to be the number of non-zero terms.

## LASSO problem formulation

Through external information, we know that most of our solution is 0. Therefore, we want to limit the number of non-zeros in our solution. We can enforce adding a penalty for the number of non-zeros.

That is, after observing $\SSy$ and $\SSX$, we’re trying to find a solution $\widehat{\SSw}$ that minimizes the squared error and has a small number of zeros. We can do this by adding a penalty on the number of non-zeros:

but this problem is exceptionally hard to solve because this is non-convex and NP-hard. The best algorithms to solve this by allowing $k$ variables to be non-zero and runs through all $2^n$ ways to do that. This takes exponential time… how can we find a more efficient method?

The closest convex relaxation of the $\ell_0$ norm is the $\ell_1$ norm. In our new problem with the $\ell_1$ norm as defined above, we can make the $\ell_1$ norm small by making many of the terms zero. This means we’re trying to solve

The type of regularization matters characterizes the signal we get as output. In this problem formulation, we are including the term $\norm{\wb}_1$ or the $\ell_1$ regularization parameter. This gives a much different result than the $\ell_2$ regularization parameter $\norm{\wb}_2^2$. To help see this, I have developed an interactive widget that highlights the difference between $\ell_1$ and $\ell_2$ regularization!

$\lambda =$ 0.05

The type of regularization parameter we use matters a ton – there’s a huge difference in the output when using $\norm{\wb}^2_2$ instead of $\norm{\wb}_1$.

Why? We can think about the physical interpretations. If trying to optimize for the engine to buy and normalizing for appropriate units, we might use the

• $\ell_2$ norm if we’re trying to use as little gas as possible. This corresponds to using as little energy as possible. This makes sense, because energy typically comes in squares (i.e., kinetic energy is $\frac{1}{2}mv^2$. See Table 3 at tauday.com for more examples).
• $\ell_0$ or $\ell_1$ norm if we want to run the engine as little as possible. We don’t care about how much gas we use, just how long it’s running. Because the engine is off most of the time, this corresponds to a sparse solution.
• $\ell_\infty$ norm, or the maximum element in a vector. This would correspond to being limited to how much power we can use. We can have the engine on as long as we want and use as much gas as we want. For example, the state could regulate that cars have to be less powerful than some limit.

To help provide more intuition, I have provided a 2D example using an interactive widget. In this 2D example, we can think that $\SSX = [c_1, c_2]$ and $y$ as a scalar. We’re trying to find $\wb = [w_1, w_2]^T \in \R^2$, but we only have one measurement; there are infinitely many solutions.

• All possible solutions to $y = c_1 w_1 + c_2 w_2$ are graphed by the purple line. We know $y$ and are trying to estimate $w_1$ and $w_2$ from our knowledge of $c_1$ and $c_2$.

• The $\ell_1$ solution vector and the the $\ell_2$ solution vector are in blue and green. The norm balls for the solution vectors are also graphed.

When we change $c_1$, we see that the solution tends to minimize the distance between the norm ball and the line of all possible solutions. We can see when we increase $\lambda$, our estimate gets smaller. This makes sense because we are placing more emphasis on this value, and it reaches it’s optimum at the origin.