Gradient descent and physical intuition for heavyball acceleration with visualization
This post is a part 3 of a 3 part series: Part I, Part II, Part III.
Practical problems like the Netflix Prize make observations and try to find the underlying model as in Part I of this series. Typically, this involves making observations of the form $y = f(x)$ or $\yb = \Ab \cdot \xb$^{1} where $\yb$ is observed, $f$ or $\Ab$ is known and $\xb$ is the underlying model or variable of interest.
Finding the true $\xb$ that gave us our observations $\yb$ involves inverting a function/matrix which can be costly timewise and in the matrix case often impossible. Instead, methods such as gradient descent are often involved, a technique common in machine learning and optimization.
In this post, I will try to provide calculuslevel intuition for gradient descent. I will also introduce and show the heavyball acceleration method for gradient descent^{2} and provide a physical interpretation.^{3}
In doing this, I will interchangeably use the words derivative (aka $\deriv$) and gradient (aka $\grad$). The gradient is just the highdimensional version of the derivative; all the intuition for the derivative applies to the gradient.
Intuition for gradient descent
In the problem setup it’s assumed we know the derivative/gradient of this function, a fair assumption. The derivative yields information about (a) the direction the function increases and (b) how fast it increases.
For example, we might be given a function, say $f(x) = x^2$. At the point $x = 3$, the derivative $\deriv f(x)\at{x=3} = 2\cdot 3$ tells us the function increases in the positive $x$ direction at a rate of 6.
The important piece of this is the direction the function increases in. Because the derivate points in the direction the function increases in, the negative gradient points in the direction the function is decreases in. Function minimization is common in optimization, meaning that the negative gradient direction is typically used. If we’re at $x_k$ and we take a step in the negative gradient direction of $f$ our function will get smaller, or
where $\tau$ is some stepsize. This will converge to a minima, possibly local. We’re always stepping in the direction of the negative gradient; in every step, we know that the function value gets smaller.
To implement this, we would use the code below. Note that to implement this
piece of code in higher dimensions, we would only define x_hat = np.zeros(N)
and make small changes to our function grad
.
x_hat = 2
lr = 0.02 # "learning rate" or "step size"
for k in range(40):
# grad is the gradient for some function, not shown
x_hat = x_hat  lr*grad(x_hat)
This makes no guarantees that the minima is global; as soon as the gradient is 0, this process stops. It can’t get past any bumps because it reads the derivative at a single point. In it’s simplest form, gradient descent makes sense. We know the function gets smaller in a certain direction; we should step in that direction.
Heavyball acceleration
Our cost function may include many smaller bumps, as in the picture below.^{4} Gradient descent will fail here because the gradient goes to 0 before the global minima is reached.
Gradient descent seems fragile in this sense. If it runs into any spots where the gradient is 0, it will stop. It can run into a local minima and can’t get past it.
One way of getting around that is by using some momentum. Instead of focusing on the gradient at one point, it would be advantageous to include momentum to overcome temporary setbacks. This is analogous to thinking of as a heavy ball is rolling hill and bumps only moderately effect the velocity/location of the ball. The differential equation that governs this motion over a surface $f$ is
for positive constants $\mu > 0$ and $g > 0$. The coefficient of friction $\mu$ is (nearly always) between 0 and 1,^{5} and the force gravity is given by $g$.
The discretization of this equation is a lot more useful practically and given by
After some algebraic manipulations (shown in the appendix) and defining $\tau := \frac{g}{1 + \mu}$ and $\beta := \frac{1}{1 + \mu}$, we can find
We would never expect friction to accelerate objects; if it did, the ball would never settle and would climb out of any bowl. Correspondingly, when $\mu < 0$ aka when $\beta > 1$ this algorithm diverges!
Physical intuition has been provided, but does this hold in simulation? While simulating, we should compare with the gradient descent method. The code below implements this ballaccelerated method and is used to produce the video below the code.
def ball_acceleration(x, step_size=0.02, momentum=0.8):
"""
x : List[int, int]
x[0] is current location, x[1] is the most recent location
step_size : float
The step size. Small step sizes mirror continious diff eq, large step sizes diverge.
momentum : float
The momentum coefficient. How strongly do previous locations `x[1]`
influence the result of this function? Note: `0 <= momentum <= 1`.
Returns: List[int, int]
An array of new locations. It can be used as an input to this function.
"""
# grad is the gradient for some function, not shown
update = step_size*grad(x[0]) + weight*(x[1]  x[0])
new_point = x[0]  update
return [new_point, x[0]]
x_hat = [2, 2]
cof = 0.25 # coefficient of friction
momentum = 1 / (1 + cof)
lr = 0.02 # small step size for good approximation
for k in range(40):
x_hat = ball_acceleration(x_hat, step_size=lr, momentum=momentum)
We can see that this heavyball method acts like a ball rolling down a hill with friction. It nearly stops and falls back down into the local minima. It settles at the bottom, near the global minima.
This heavyball method is not guaranteed to converge to the global minima, even though it does in this example. Typically, the heavyball method is used to get in the same region as the global minima, and then a normal optimization method is used that is guaranteed to converge to the global minima.
Higher dimensions
To show this in higher dimensions, I quickly coded up the same algorithm above
but in higher dimensions (as shown in the appendix). Essentially, only made
small changes to the function grad
were made. While doing this, I used
$\yb = \Ab \xb^\star$ to generate ground truth and plotted the distance from
$\xb^\star$. I made $\Ab \in \R^{50 \times 35}$ a tall matrix to represent
an overdetermined system and to find the pseudoinverse. This is a problem
formulation that where gradient descent methods are used.
I then graphed the convergence rate. In this, I knew ground truth and plotted and plotted the $\ell_2$ distance to the true solution for both classic gradient descent and this accelerated ball method.
This only shows that this ballacceleration method is faster for linear systems. Simple linear functions are convex functions, so they correspond to a round bowl without any bumps. A ball rolling down a smooth bowl should accelerate, which means the heavyball acceleration will reach the minimum faster than gradient descent.^{6}
Further reading
“A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights” has some good analysis and intuition on the differential equation governing this post. It’s far more detailed than this post.
Here are some more mathematical introductions to heavyball methods:
 Two more mathematical introductions to heavyball acceleration: one, two.
 The lecture notes for lectures 20101010 and 20121015 in Ben Recht’s class CS 726: Nonlinear optimization I.
Here’s the academic articles that pioneered this method:
 “Some methods of speeding up the convergence of iteration methods” by the Russian mathematician Polyak in 1964.
 “A method of solving a convex programming problem with convergence rate $O(1/k^2)$” by Yuri Nesterov, published in volume 27 of Soviet Math Dokl (pdf).
And here’s an introduction to gradient descent:
 An introduction to gradient descent, an explanation of gradient descent with code examples.
Appendix
Newton’s second law
For any body, the net force on an object $F = ma$ by Newton’s first law of motion. Let’s write down the relevant forces^{7} for a ball rolling in a bumpy landscape $f$:
with body mass $m$ and coefficient of friction $\mu$. This assumes friction is the same regardless of the slope $\grad f(x)$; that would only apply if the friction came from the fluid surrounding the ball (e.g., air or water).
By Newton’s third law, all the forces in a closed system have to sum to 0. Under mild assumptions,^{8} this means that
Algebraic manipulations
The discretization of $\ddot{x} + \mu\dot{x} + g\grad{f}(x) = 0$ is given by
Simplifying,
High dimensional gradient descent code
M, N = 50, 35
A = np.random.rand(M, N)
x = np.random.rand(N)
y = A @ x
np.random.seed(42)
initial_guess = np.random.rand(N)
x_ball = [initial_guess.copy(), initial_guess.copy()]
x_grad = initial_guess.copy()
c = [1/np.linalg.norm(A)**2, 0.9]
for k in range(int(500)):
x_grad = x_grad  c[0]*grad(A, x_grad, y)
x_ball = ball_acceleration(A, x_ball, y, c)

I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet. ↩

As introduced by Yuri Nesterov in “A method of solving a convex programming problem with convergence rate $O(1/k^2)$”, published in volume 27 of Soviet Math Dokl (pdf). ↩

See “A Differential Equation for Modeling Nesterov’s Accelerated Gradient Methods: Theory and Insights” by Su, Boyd and Candes for a more detailed version of this post. ↩

Formally, this function is nonconvex. ↩

If the friction comes from the fluid the ball is rolling through (e.g., air or water), then “viscous damping coefficient” is always between 0 and 1. ↩

Unless the step size for gradient descent is tuned properly. ↩

The unrelevant forces are the purely vertical forces exerted by the surface and by gravity. ↩

Assumptions like “there’s enough fluid fluid and it’s not sticky enough to influence the force of friction.” ↩