Gradient descent and physical intuition for heavyball acceleration with visualization
This post is a part 3 of a 3 part series: Part I, Part II, Part III.
We often make observations from some system and would like to infer something about the system parameters, and many practical problems such as the Netflix Prize can be reformulated this way. Typically, this involves making observations of the form $y = f(x)$ or $\yb = \Ab \cdot \xb$^{1} where $y$ is observed, $f/\Ab$ is known and $x$ is the unknown variable of interest.
Finding the true $x$ that gave us our observations $y$ involves inverting a function/matrix which can be costly timewise and in the matrix case often impossible. Instead, methods such as gradient descent are often involved, a technique common in machine learning and optimization.
In this post, I will try to provide calculuslevel intuition for gradient descent. I will also introduce and show the heavyball acceleration method for gradient descent^{2} and provide a physical interpretation.
In doing this, I will interchangeably use the words derivative (aka $\deriv$) and gradient (aka $\grad$). The gradient is just the highdimensional version of the derivative; all the intuition for the derivative applies to the gradient.
Intuition for gradient descent
In the problem setup it’s assumed we know the derivative/gradient of this function, a fair assumption. The derivative yields information about (a) the direction the function increases and (b) how fast it increases.
For example, we might be given a function, say $f(x) = x^2$. At the point $x = 3$, the derivative $\deriv f(x)\at{x=3} = 2\cdot 3$ tells us the function increases in the positive $x$ direction at a rate of 6.
The important piece of this is the direction the function increases in. Because the derivate points in the direction the function increases in, the negative gradient points in the direction the function is decreases in. Function minimization is common in optimization, meaning that the negative gradient direction is typically used. If we’re at $x_k$ and we take a step in the negative gradient direction of $f$ our function will get smaller, or
where $\tau$ is some stepsize. This will converge to a minima, possibly local. We’re always stepping in the direction of the negative gradient; in every step, we know that the function value gets smaller.
To implement this, we would use the code below. Note that to implement this
piece of code in higher dimensions, we would only define x_hat = np.zeros(N)
and make small changes to our function grad
.
x_hat = 2
tau = 0.02
for k in range(40):
# grad is the gradient for some function, not shown
x_hat = x_hat  tau*grad(x_grad, tau)
This makes no guarantees that the minima is global; as soon as the gradient is 0, this process stops. It can’t get past any bumps because it reads the derivative at a single point. In it’s simplest form, gradient descent makes sense. We know the function gets smaller in a certain direction; we should step in that direction.
Heavyball acceleration
Our cost function may include many smaller bumps, as in the picture below. Gradient descent will fail here because the gradient goes to 0 before the global minima is reached.
Gradient descent seems fragile in this sense. If it runs into any spots where the gradient is 0, it will stop. It can run into a local minima and can’t get past it.
One way of getting around that is by using some momentum. Instead of focusing on the gradient at one point, it would be advantageous to include momentum to overcome temporary setbacks. This is analogous to thinking of as a heavy ball is rolling hill and bumps only moderately effect it. The differential equation that governs this motion in a gravity well described by $f$ is
for positive constants $a > 0$ and $b > 0$ but isn’t that useful for computers. The discretization of this equation is given by
After some algebraic manipulations (shown in the appendix) and defining $\tau := \frac{b}{1 + a}$ and $\beta := \frac{1}{1 + a}$, we can find
When crafting this as a ball rolling down a hill, $a$ is friction and $b$ is the strength of gravity. We would never expect friction to accelerate objects; if it did, the ball would never settle and would climb out of any bowl. Correspondingly, when $a < 0$ aka when $\beta > 1$ this algorithm diverges!
Physical intuition has been provided, but does this hold in simulation? While simulating, we should compare with the gradient descent method. The code below implements this ballaccelerated method and is used to produce the video below the code.
def ball_acceleration(x, c):
"""
:input x: iterates. x[0] is most recent iterate, x[1] is last iterate
:input c: Array of constants. c[0] is step size, c[1] is momentum constant
:returns: Array of new iterates.
"""
# grad is the gradient for some function, not shown
update = x[0]  c[0]*grad(x[0]) + c[1]*(x[1]  x[0])
return [update, x[0]]
x_hat = [2, 2]
tau, weight = 0.02, 0.8
for k in range(40):
x_hat = ball_acceleration(x_hat, [tau, weight])
We can see that this heavyball method acts like a ball rolling down a hill with friction. It nearly stops and falls back down into the local minima. It settles at the bottom, near the global minima.
This heavyball method is not guaranteed to converge to the global minima, even though it does in this example. Typically, the heavyball method is used to get in the same region as the global minima, and then a normal optimization method is used that is guaranteed to converge to the global minima.
Higher dimensions
To show this in higher dimensions, I quickly coded up the same algorithm above
but in higher dimensions (as shown in the appendix). Essentially, only made
small changes to the function grad
were made. While doing this, I used
$\yb = \Ab \xb^\star$ to generate ground truth and plotted the distance from
$\xb^\star$. I made $\Ab \in \R^{50 \times 35}$ a tall matrix to represent
an overdetermined system and to find the pseudoinverse. This is a problem
formulation that where gradient descent methods are used.
I then graphed the convergence rate. In this, I knew ground truth and plotted and plotted the $\ell_2$ distance to the true solution for both classic gradient descent and this accelerated ball method.
This only shows that this ballacceleration method is faster for linear systems – it doesn’t have any saddle points like a nonconvex function!
Further reading
 An introduction to gradient descent, an explanation of gradient descent with code examples.
 A mathematical approach to the heavyball acceleration method
 Another mathematical approach on Nesterov acceleration.
 The lecture notes for lectures 20101010 and 20121015 in Ben Recht’s class CS 726: Nonlinear optimization I.
 The academic paper that introduced the heavyball method, published by the Russian mathematician Polyak in 1964.
Appendix
High dimensional gradient descent code
M, N = 50, 35
A = np.random.rand(M, N)
x = np.random.rand(N)
y = A @ x
np.random.seed(42)
initial_guess = np.random.rand(N)
x_ball = [initial_guess.copy(), initial_guess.copy()]
x_grad = initial_guess.copy()
c = [1/np.linalg.norm(A)**2, 0.9]
for k in range(int(500)):
x_grad = x_grad  c[0]*grad(A, x_grad, y)
x_ball = ball_acceleration(A, x_ball, y, c)
Algebraic manipulations
The discretization of $\ddot{x} + a\dot{x} + \grad{f}(x) = 0$ is given by
Simplifying, we see that (if and only if’s left out for this algebraic manipulation)

I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet. ↩