This post is a part 2 of a 3 part series: Part I, Part II, Part III

We often have fewer measurements than unknowns, which happens all the time in genomics and medical imaging. For example, we might be collecting 8,000 gene measurements in 300 patients and we’d like to determine which ones are most important in determining cancer.

This means that we typically have an underdetermined system because we’re collecting more measurement than unknowns. This is an unfavorable situation – there are infinitely may solutions to this problem. However, in the case of breast cancer, biological intuition might tell us that most of the 8,000 genes aren’t important and have zero important in cancer expression.

How do we enforce that most of the variables are 0? This post will try and give intuition for the problem formulation and dig into the algorithm to solve the posed problem. I’ll use a real-world cancer dataset1 to predict which genes are important for cancer expression. It should be noted that we’re more concerned with the type of solution we obtain rather than how well it performs.

This post will build of Part I of this series. In this post, we will only change the type of solution we obtain by changing the regularization parameter. In Part I, we saw that we could get an acceptable solution by introducing a regularization parameter, $\norm{\xb}_2^2$. In this post, we’ll examine changing that to $\norm{\xb}_1$.

Before getting started, we’ll have to use norms, as they provide a nice syntax for working with vectors and matrices. We define the $\ell_p$ norms as

$$ \norm{\xb}_p = \parens{\sum \abs{x_i}^p}^{1/p} $$

which means that $\ell_2$ norm of $\xb$ is $\norm{\xb}_2 := \sqrt{\sum_i x_i^2}$ (meaning $\norm{\xb}_2^2 = \sum_i x_i^2$) and $\ell_1$ norm of $\xb$ is $\norm{\xb}_1 := \sum_i \abs{x_i}$. This definition doesn’t define $\ell_0$ norms, but we define it to be the number of non-zero terms.

LASSO problem formulation

Through external information, we know that most of our solution is 0. Therefore, we want to limit the number of non-zeros in our solution. We can enforce adding a penalty for the number of non-zeros.

That is, after observing $\yb$ and $\Xb$, we’re trying to find a solution $\widehat{\wb}$ that minimizes the squared error and has a small number of zeros. We can do this by adding a penalty on the number of non-zeros:

$$ \align{ \widehat{\wb} &= \arg \min_\wb \norm{\yb - \Xb\wb}_2^2 + \lambda\sum_i 1_{w_i \not= 0} } $$

but this problem is exceptionally hard to solve because this is non-convex and NP-hard. The best algorithms to solve this by allowing $k$ variables to be non-zero and runs through all $2^n$ ways to do that. This takes exponential time… how can we find a more efficient method?

The closest convex relaxation of the $\ell_0$ norm is the $\ell_1$ norm. In our new problem with the $\ell_1$ norm as defined above, we can make the $\ell_1$ norm small by making many of the terms zero. This means we’re trying to solve

$$ \widehat{\wb} = \arg \min_\wb \norm{\yb - \Xb\wb}_2^2 + \lambda\norm{\wb}_1 $$

The type of regularization matters characterizes the signal we get as output. In this problem formulation, we are including the term $\norm{\wb}_1$ or the $\ell_1$ regularization parameter. This gives a much different result than the $\ell_2$ regularization parameter $\norm{\wb}_2^2$. To help see this, I have developed an interactive widget that highlights the difference between $\ell_1$ and $\ell_2$ regularization!

The type of regularization parameter we use matters a ton – there’s a huge difference in the output when using $\norm{\wb}^2_2$ instead of $\norm{\wb}_1$.

Why? We can think about the physical interpretations. If trying to optimize for the engine to buy and normalizing for appropriate units, we might use the

  • $\ell_2$ norm if we’re trying to use as little gas as possible. This corresponds to using as little energy as possible. This makes sense, because energy typically comes in squares (i.e., kinetic energy is $\frac{1}{2}mv^2$. See Table 3 at for more examples).
  • $\ell_0$ or $\ell_1$ norm if we want to run the engine as little as possible. We don’t care about how much gas we use, just how long it’s running. Because the engine is off most of the time, this corresponds to a sparse solution.
  • $\ell_\infty$ norm, or the maximum element in a vector. This would correspond to being limited to how much power we can use. We can have the engine on as long as we want and use as much gas as we want. For example, the state could regulate that cars have to be less powerful than some limit.

To help provide more intuition, I have provided a 2D example using an interactive widget. In this 2D example, we can think that $\Xb = [c_1, c_2]$ and $y$ as a scalar. We’re trying to find $\wb = [w_1, w_2]^T \in \R^2$, but we only have one measurement; there are infinitely many solutions.

  • All possible solutions to $y = c_1 w_1 + c_2 w_2$ are graphed by the purple line. We know $y$ and are trying to estimate $w_1$ and $w_2$ from our knowledge of $c_1$ and $c_2$.

  • The $\ell_1$ solution vector and the the $\ell_2$ solution vector are in blue and green. The norm balls for the solution vectors are also graphed.

When we change $c_1$, we see that the solution tends to minimize the distance between the norm ball and the line of all possible solutions. We can see when we increase $\lambda$, our estimate gets smaller. This makes sense because we are placing more emphasis on this value, and it reaches it’s optimum at the origin.

We can think of this optimization for $\ell_2$ and $\ell_1$ as minimizing the distance between the norm balls and the line of all possible solutions. We see that the $\ell_1$ norm tends to give solutions with more zeros in them such as $(1, 0)$ or $(0, 1)$. The $\ell_2$ solution gives more non-zero values off the axis which means, by definition, the $\ell_1$ solution is more sparse than the $\ell_2$ solution.

Now that we know what tool to use we can start to tackle this cancer problem!

Predicting breast cancer

In a class this semester Rob Nowak and Laurent Lessard introduced a breast cancer dataset described in a New England Journal of Medicine article.2 This dataset tests cancerous and health patients for gene expression levels (295 patients in total, roughly 8000 genes). We’d like to design a predictor for breast cancer based off levels of gene expression.

In this dataset, we observe if someone has cancer or not, indicated by $\yb$, with each element being $\pm 1$ indicating the presence of cancer. For these 295 patients, this dataset also provides tests the expression levels of 8,000 genes, as expressed by $\Xb$. The $i$th row in $\Xb$ corresponds to $\yb_i$ – it contains the gene expression levels for patient $i$.

In this problem, we’d like to determine how important each gene is for cancer. We will assign a weight to each gene, and a weight of 0 means it’s not at all important. We will assume that the underlying model takes the sign of our prediction, or $\yb = \sign{\left(\Xb\wb\right)}$ where $\sign{\left(\cdot\right)}$ is applied element-wise.

We will solve this problem formulation we saw above, the LASSO problem formulation:

$$ \widehat{\wb} = \arg \min_\wb \norm{\yb - \Xb\wb}_2^2 + \lambda \norm{\wb}_1 $$

As we saw above, it will encourage that most $w_i$’s are 0, or have no importance in determining if someone has breast cancer or not. We saw above why this formulation made sense, but now let’s see how to solve it!

The solution to this optimization problem has no closed form solution meaning we have to find our solution iteratively. We’ll choose to solve this with an alternating minimization method (a method for biconvex optimization).

This method utilizes the fact that both the error term and the regularization parameter term are positive. Given two positive parameters, it’s natural to optimize one then the other to minimize their sum. If you were minimizing the product, it’d be natural to drive one as close to 0 as possible.

A method of alternating optimization is the proximal gradient method with rigorous justification in academic papers.3456 Given suitable loss functions and regularization parameters, this method does two steps (typically in a for-loop):

  1. Take a step towards the solution that minimizes the error as defined by the loss function. This takes a step in the negative gradient (scalar case: derivative) direction because the gradient/derivative points to the direction the function increases.
  2. Enforces that the solution that minimizes the loss function should also minimize the regularization function. This takes the solutions found in (1) and enforces that they must be acceptable.

These steps can be written as

$$ \bm{z} = \wb_k - \tau \nabla F\left(\wb_k\right)) $$
$$ \wb_{k+1} = \arg \min_\wb \norm{\wb - \bm{z}} + \tau \lambda \norm{\wb}_1 $$

where $\nabla F(\wb_k)$ represents the gradient of $\norm{\yb - \Xb\wb}$ at the point $\wb_k$ (which represents the estimate at iteration $k$) and $\tau$ represents some step size. The equations correspond to steps (1) and (2) above.

The derivations are in the appendix and results in

$$ \bm{z} = \wb_k - \tau \cdot 2 \Xb^T \left(\Xb\wb - \yb\right) $$
$$ \wb_{k+1} = \textrm{sign}(\bm{z}) \cdot \left(\abs{\bm{z}} - \tau\lambda/2\right)_+ $$

where $(\cdot)_+$ is the soft thresholding operator that keeps the positive elements of the input and sets the rest to $0$. All the operations ($\textrm{sign}$, $(\cdot)_+$, multiplication, etc) are done element-wise.

We can implement these functions as follows:

def proximity(z, threshold):
    """ The prox operator for L1 regularization/LASSO. Returns
            sign(z) * (abs(z) - threshold)_+
        where (.)_+ is the soft thresholding operator """
    x_hat = np.abs(z) - threshold/2
    x_hat[x_hat < 0] = 0
    return x_hat * np.sign(z)

def gradient(A, x, y):
    """ Computes the gradient of least squares loss at the point x """
    return 2*A.T @ (A@x - y)

We can also implement the alternating minimization. As equations $(1)$ and $(2)$ mention, the output of the gradient step is fed into the proximity operator.

# X is a fat matrix, y a label vector, y_i \in {-1, 1}
X, y = breast_cancer_data()

# our initial guess; most of the values stay zero
w_k = np.zeros(X.shape[1])
tau = 1 / np.linalg.norm(X)**2 # guarantees convergence
for k in range(int(1e3)):
    z = w_k - tau * gradient(X, w_k, y)
    w_k = proximity(z, tau*lambda)

# binary classification tends to take the sign
# (we're only interested in the proprieties of the weights here, not y_hat)
y_hat = np.sign(A @ w_k)

After wrapping this function in a class, this turns into the basics of sklearn.linear_model.Lasso. I have done this and the source is available on GitHub.

When we find the optimal weights, we don’t want to use all the data. We can save a set of training data and never see it until we test on it. Using only 80% of the dataset to train our weights, we get the following set of weights!

With this $\lambda$, this prediction gives accuracy of 81.36% when predicting on 20% dataset of the data reserved for testing. If we had a balanced dataset (we don’t) we’d get accuracy of 50% by flipping a coin. While we don’t have ideal accuracy, we do have a solution with many zeros which is what we set out to do.

The content of this blog post was finding sparse solutions, but how can we improve these results? We are performing binary classification – we’re predicting either “yes” or “no”… but we don’t really use that information. We just naïvely used a least squares loss which penalizes points that are easy to classify and too far on the correct side of the decision boundary!

The next post will focus on support vector machines, classifiers that don’t have punish accuracies that are too correct. They do this by using hinge loss and logistic loss.


For ease, we will drop the bold face math, meaning $x := \xb, A := \Ab, y:=\yb$. Also note that all operators are evaluated element-wise (expect matrix multiplication). This applies to $(\cdot)_+$, and use element-wise multiplication and sign operators.

This “proof” details the iterative soft thresholding algorithm. This method can be accelerated by the algorithms FISTA7 or FASTA8 by choosing a different step size at each iteration with $\tau_k$.

For rigorous justification why the proximal gradient method is justified, see academic papers.3456

Least squares gradient

Given $\phi(x) = \norm{Ax - y}_2^2$, the gradient is $2A^T(Ax - y)$.

Proof: We can choose to represent squared error as $(Ax - y)^T (Ax-y)$. Then using intuition from scalar calculus and some gradient identities,

$$ \begin{aligned} x + y &= 1\\ x &= y \end{aligned} $$
$$ \begin{aligned} \nabla \phi(x) &= \frac{d (Ax - y)}{dx} \cdot 2\cdot (Ax - y)\\ &= 2 A^T (Ax - y) \end{aligned} $$

Proximity operator

Given the proximity operator for $\ell_1$ regularization as $\phi(x) = \norm{y - x}_2^2 + \lambda\norm{x}_1$, the optimum solution is given by $\widehat{x} = \sign(y)\left(\abs{y} - \lambda/2\right)_+$ where $(\cdot)_+$ denotes the soft thresholding operator.

Proof: The definition of the proximity operator results in a separable equation that allows us to write

$$ \begin{aligned} \phi(x) &= \norm{y - x}_2^2 + \lambda \norm{x_i}_1\\ &= \sum_i (y_i - x_i)^2 + \lambda \abs{x_i} \end{aligned} $$

This equation can be minimized by minimizing each term separately.

$$ \pder{\phi(x)_i}{x_i} = -2(y_i - x_i) + \lambda\pder{\abs{x_i}}{x_i} $$
$$ \pder{x}{y} $$

This last term on the end, $\pder{\abs{x_i}}{x_i}$ is tricky: at $x_i = 0$, this term is not differentiable. After using subgradients to skirt around that fact, we can say that $\pder{\abs{x_i}}{x_i} = \sign(x_i) = \pm 1$ which makes sense when we’re not at the origin.

This function is convex which allows us to set the derivative to 0 and find the global minima.

$$ \pder{\phi(x)_i}{x_i} = 0 = -2(y_i - x_i) + \lambda\sign(x_i) $$

$y_i > 0$ implies that $x_i > 0$ which allows us to write

$$ \Align{ x_i &= y_i - \frac{\lambda}{2}\sign(y_i)\\ &= \sign(y_i) (\abs{y_i} - \lambda/2) } $$

But when $\abs{y_i} < \lambda/2$ we run into difficulty because $y_i > 0 \implies x_i > 0$. To get around this fact, we set all the values where $\abs{y_i} < 0$ to 0 and subtract $\lambda/2$ from the rest (this is detailed further in a StackExchange post). With this, we can now write

$$ x_i = \sign(y_i) \cdot (\abs{y_i} - \lambda/2)_+ $$

where $(x)_+ := \max(x, 0)$.

  1. This data set is detailed in the section titled Predicting Breast Cancer 

  2. One of the trickiest issues is finding real world data to apply your problems too. 

  3. Wright, S. J., Nowak, R. D., & Figueiredo, M. A. (2009). Sparse reconstruction by separable approximation. Signal Processing, IEEE Transactions on, 57(7), 2479-2493.  2

  4. Hale, E. T., Yin, W., & Zhang, Y. (2007). A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University, 43, 44.  2

  5. Daubechies, I., Defrise, M., & De Mol, C. (2003). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. arXiv preprint math/0307152.  2

  6. Figueiredo, M. A., Bioucas-Dias, J. M., & Nowak, R. D. (2007). Majorization–minimization algorithms for wavelet-based image restoration. Image Processing, IEEE Transactions on, 16(12), 2980-2991.  2

  7. Beck, A., and Teboulle, M.. “A fast iterative shrinkage-thresholding algorithm for linear inverse problems.” SIAM journal on imaging sciences 2.1 (2009): 183-202. 

  8. Goldstein, T., Christoph S., and Richard B.A field guide to forward-backward splitting with a FASTA implementation.” arXiv preprint arXiv:1411.3406 (2014)., website