This post is a part 3 of a 3 part series: Part I, Part II, Part III.

We often make observations from some system and would like to infer something about the system parameters, and many practical problems such as the Netflix Prize can be reformulated this way. Typically, this involves making observations of the form $y = f(x)$ or $\yb = \Ab \cdot \xb$1 where $y$ is observed, $f/\Ab$ is known and $x$ is the unknown variable of interest.

Finding the true $x$ that gave us our observations $y$ involves inverting a function/matrix which can be costly time-wise and in the matrix case often impossible. Instead, methods such as gradient descent are often involved, a technique common in machine learning and optimization.

In this post, I will try to provide calculus-level intuition for gradient descent. I will also introduce and show the heavy-ball acceleration method for gradient descent2 and provide a physical interpretation.

In doing this, I will interchangeably use the words derivative (aka $\deriv$) and gradient (aka $\grad$). The gradient is just the high-dimensional version of the derivative; all the intuition for the derivative applies to the gradient.

Intuition for gradient descent

In the problem setup it’s assumed we know the derivative/gradient of this function, a fair assumption. The derivative yields information about (a) the direction the function increases and (b) how fast it increases.

For example, we might be given a function, say $f(x) = x^2$. At the point $x = 3$, the derivative $\deriv f(x)\at{x=3} = 2\cdot 3$ tells us the function increases in the positive $x$ direction at a rate of 6.

The important piece of this is the direction the function increases in. Because the derivate points in the direction the function increases in, the negative gradient points in the direction the function is decreases in. Function minimization is common in optimization, meaning that the negative gradient direction is typically used. If we’re at $x_k$ and we take a step in the negative gradient direction of $f$ our function will get smaller, or

$$ x_{k+1} = x_k - \tau \grad f(x_k) $$

where $\tau$ is some step-size. This will converge to a minima, possibly local. We’re always stepping in the direction of the negative gradient; in every step, we know that the function value gets smaller.

To implement this, we would use the code below. Note that to implement this piece of code in higher dimensions, we would only define x_hat = np.zeros(N) and make small changes to our function grad.

x_hat = 2
tau = 0.02

for k in range(40):
    # grad is the gradient for some function, not shown
    x_hat = x_hat - tau*grad(x_grad, tau)

This makes no guarantees that the minima is global; as soon as the gradient is 0, this process stops. It can’t get past any bumps because it reads the derivative at a single point. In it’s simplest form, gradient descent makes sense. We know the function gets smaller in a certain direction; we should step in that direction.

Heavy-ball acceleration

Our cost function may include many smaller bumps, as in the picture below. Gradient descent will fail here because the gradient goes to 0 before the global minima is reached.

Gradient descent seems fragile in this sense. If it runs into any spots where the gradient is 0, it will stop. It can run into a local minima and can’t get past it.

One way of getting around that is by using some momentum. Instead of focusing on the gradient at one point, it would be advantageous to include momentum to overcome temporary setbacks. This is analogous to thinking of as a heavy ball is rolling hill and bumps only moderately effect it. The differential equation that governs this motion in a gravity well described by $f$ is

$$ \ddot{x} + a \dot{x} + b \grad f(x) = 0 $$

for positive constants $a > 0$ and $b > 0$ but isn’t that useful for computers. The discretization of this equation is given by

$$ (x_{k+1}-x_{k-1}) + a(x_{k+1} - x_k) + b\grad f(x_k) = 0 $$

After some algebraic manipulations (shown in the appendix) and defining $\tau := \frac{b}{1 + a}$ and $\beta := \frac{1}{1 + a}$, we can find

$$ x_{k+1} = x_k - \tau \grad f(x_k) + \beta \cdot(x_{k-1} - x_{k}) $$

When crafting this as a ball rolling down a hill, $a$ is friction and $b$ is the strength of gravity. We would never expect friction to accelerate objects; if it did, the ball would never settle and would climb out of any bowl. Correspondingly, when $a < 0$ aka when $\beta > 1$ this algorithm diverges!

Physical intuition has been provided, but does this hold in simulation? While simulating, we should compare with the gradient descent method. The code below implements this ball-accelerated method and is used to produce the video below the code.

def ball_acceleration(x, c):
    """
    :input x: iterates. x[0] is most recent iterate, x[1] is last iterate
    :input c: Array of constants. c[0] is step size, c[1] is momentum constant
    :returns: Array of new iterates.
    """
    # grad is the gradient for some function, not shown
    update = x[0] - c[0]*grad(x[0]) + c[1]*(x[1] - x[0])
    return [update, x[0]]

x_hat = [2, 2]
tau, weight = 0.02, 0.8

for k in range(40):
    x_hat = ball_acceleration(x_hat, [tau, weight])

We can see that this heavy-ball method acts like a ball rolling down a hill with friction. It nearly stops and falls back down into the local minima. It settles at the bottom, near the global minima.

This heavy-ball method is not guaranteed to converge to the global minima, even though it does in this example. Typically, the heavy-ball method is used to get in the same region as the global minima, and then a normal optimization method is used that is guaranteed to converge to the global minima.

Higher dimensions

To show this in higher dimensions, I quickly coded up the same algorithm above but in higher dimensions (as shown in the appendix). Essentially, only made small changes to the function grad were made. While doing this, I used $\yb = \Ab \xb^\star$ to generate ground truth and plotted the distance from $\xb^\star$. I made $\Ab \in \R^{50 \times 35}$ a tall matrix to represent an overdetermined system and to find the pseudoinverse. This is a problem formulation that where gradient descent methods are used.

I then graphed the convergence rate. In this, I knew ground truth and plotted and plotted the $\ell_2$ distance to the true solution for both classic gradient descent and this accelerated ball method.

This only shows that this ball-acceleration method is faster for linear systems – it doesn’t have any saddle points like a non-convex function!

Further reading

Appendix

High dimensional gradient descent code

M, N = 50, 35
A = np.random.rand(M, N)
x = np.random.rand(N)
y = A @ x

np.random.seed(42)
initial_guess = np.random.rand(N)
x_ball = [initial_guess.copy(), initial_guess.copy()]
x_grad = initial_guess.copy()

c = [1/np.linalg.norm(A)**2, 0.9]

for k in range(int(500)):
    x_grad = x_grad - c[0]*grad(A, x_grad, y)
    x_ball = ball_acceleration(A, x_ball, y, c)

Algebraic manipulations

The discretization of $\ddot{x} + a\dot{x} + \grad{f}(x) = 0$ is given by

$$ \align{ (x_{k+1} - x_{k-1}) + a (x_{k+1} - x_k) + \grad{f}(x) = 0\\ } $$

Simplifying, we see that (if and only if’s left out for this algebraic manipulation)

$$ \align{ (1 + a) x_{k+1} &= ax_k + x_{k-1} - b \grad{f}(x_k)\\ (1 + a) x_{k+1} &= (a + 1)x_k + x_{k-1} - x_k - b \grad{f}(x_k)\\ x_{k+1} &= x_k + \frac{1}{1+a} (x_{k-1} - x_k) - \frac{b}{1+a} \grad{f}(x_k)\\ } $$
  1. I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet

  2. With a video