This post is a part 3 of a 3 part series: Part I, Part II, Part III.

Practical problems like the Netflix Prize make observations and try to find the underlying model as in Part I of this series. Typically, this involves making observations of the form $y = f(x)$ or $\yb = \Ab \cdot \xb$1 where $\yb$ is observed, $f$ or $\Ab$ is known and $\xb$ is the underlying model or variable of interest.

Finding the true $\xb$ that gave us our observations $\yb$ involves inverting a function/matrix which can be costly time-wise and in the matrix case often impossible. Instead, methods such as gradient descent are often involved, a technique common in machine learning and optimization.

In this post, I will try to provide calculus-level intuition for gradient descent. I will also introduce and show the heavy-ball acceleration method for gradient descent2 and provide a physical interpretation.3

In doing this, I will interchangeably use the words derivative (aka $\deriv$) and gradient (aka $\grad$). The gradient is just the high-dimensional version of the derivative; all the intuition for the derivative applies to the gradient.

In the problem setup it’s assumed we know the derivative/gradient of this function, a fair assumption. The derivative yields information about (a) the direction the function increases and (b) how fast it increases.

For example, we might be given a function, say $f(x) = x^2$. At the point $x = 3$, the derivative $\deriv f(x)\at{x=3} = 2\cdot 3$ tells us the function increases in the positive $x$ direction at a rate of 6.

The important piece of this is the direction the function increases in. Because the derivate points in the direction the function increases in, the negative gradient points in the direction the function is decreases in. Function minimization is common in optimization, meaning that the negative gradient direction is typically used. If we’re at $x_k$ and we take a step in the negative gradient direction of $f$ our function will get smaller, or

$$x_{k+1} = x_k - \tau \grad f(x_k)$$

where $\tau$ is some step-size. This will converge to a minima, possibly local. We’re always stepping in the direction of the negative gradient; in every step, we know that the function value gets smaller.

To implement this, we would use the code below. Note that to implement this piece of code in higher dimensions, we would only define x_hat = np.zeros(N) and make small changes to our function grad.

x_hat = 2
lr = 0.02  # "learning rate" or "step size"

for k in range(40):


This makes no guarantees that the minima is global; as soon as the gradient is 0, this process stops. It can’t get past any bumps because it reads the derivative at a single point. In it’s simplest form, gradient descent makes sense. We know the function gets smaller in a certain direction; we should step in that direction.

## Heavy-ball acceleration

Our cost function may include many smaller bumps, as in the picture below.4 Gradient descent will fail here because the gradient goes to 0 before the global minima is reached. Gradient descent seems fragile in this sense. If it runs into any spots where the gradient is 0, it will stop. It can run into a local minima and can’t get past it.

One way of getting around that is by using some momentum. Instead of focusing on the gradient at one point, it would be advantageous to include momentum to overcome temporary setbacks. This is analogous to thinking of as a heavy ball is rolling hill and bumps only moderately effect the velocity/location of the ball. The differential equation that governs this motion over a surface $f$ is

$$\ddot{x} + \mu \dot{x} + g \grad f(x) = 0$$

for positive constants $\mu > 0$ and $g > 0$. The coefficient of friction $\mu$ is (nearly always) between 0 and 1,5 and the force gravity is given by $g$.

The discretization of this equation is a lot more useful practically and given by

$$(x_{k+1}-x_{k-1}) + \mu(x_{k+1} - x_k) + g\grad f(x_k) = 0$$

After some algebraic manipulations (shown in the appendix) and defining $\tau := \frac{g}{1 + \mu}$ and $\beta := \frac{1}{1 + \mu}$, we can find

$$x_{k+1} = x_k - \tau \grad f(x_k) + \beta \cdot(x_{k-1} - x_{k})$$

We would never expect friction to accelerate objects; if it did, the ball would never settle and would climb out of any bowl. Correspondingly, when $\mu < 0$ aka when $\beta > 1$ this algorithm diverges!

Physical intuition has been provided, but does this hold in simulation? While simulating, we should compare with the gradient descent method. The code below implements this ball-accelerated method and is used to produce the video below the code.

def ball_acceleration(x, step_size=0.02, momentum=0.8):
"""
x : List[int, int]
x is current location, x is the most recent location
step_size : float
The step size. Small step sizes mirror continious diff eq, large step sizes diverge.
momentum : float
The momentum coefficient. How strongly do previous locations x
influence the result of this function? Note: 0 <= momentum <= 1.

Returns: List[int, int]
An array of new locations. It can be used as an input to this function.
"""
update = step_size*grad(x) + weight*(x - x)
new_point = x - update
return [new_point, x]

x_hat = [2, 2]
cof = 0.25  # coefficient of friction

momentum = 1 / (1 + cof)
lr = 0.02  # small step size for good approximation

for k in range(40):
x_hat = ball_acceleration(x_hat, step_size=lr, momentum=momentum)


We can see that this heavy-ball method acts like a ball rolling down a hill with friction. It nearly stops and falls back down into the local minima. It settles at the bottom, near the global minima.

This heavy-ball method is not guaranteed to converge to the global minima, even though it does in this example. Typically, the heavy-ball method is used to get in the same region as the global minima, and then a normal optimization method is used that is guaranteed to converge to the global minima.

## Higher dimensions

To show this in higher dimensions, I quickly coded up the same algorithm above but in higher dimensions (as shown in the appendix). Essentially, only made small changes to the function grad were made. While doing this, I used $\yb = \Ab \xb^\star$ to generate ground truth and plotted the distance from $\xb^\star$. I made $\Ab \in \R^{50 \times 35}$ a tall matrix to represent an overdetermined system and to find the pseudoinverse. This is a problem formulation that where gradient descent methods are used.

I then graphed the convergence rate. In this, I knew ground truth and plotted and plotted the $\ell_2$ distance to the true solution for both classic gradient descent and this accelerated ball method. This only shows that this ball-acceleration method is faster for linear systems. Simple linear functions are convex functions, so they correspond to a round bowl without any bumps. A ball rolling down a smooth bowl should accelerate, which means the heavy-ball acceleration will reach the minimum faster than gradient descent.6

A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights” has some good analysis and intuition on the differential equation governing this post. It’s far more detailed than this post.

Here are some more mathematical introductions to heavy-ball methods:

Here’s the academic articles that pioneered this method:

And here’s an introduction to gradient descent:

## Appendix

### Newton’s second law

For any body, the net force on an object $F = ma$ by Newton’s first law of motion. Let’s write down the relevant forces7 for a ball rolling in a bumpy landscape $f$:

$$F= m\ddot{x} = -\underbrace{\mu m\dot{x}}_{\text{friction}} - \underbrace{mg\grad f(x)}_{\text{gravity}}$$

with body mass $m$ and coefficient of friction $\mu$. This assumes friction is the same regardless of the slope $\grad f(x)$; that would only apply if the friction came from the fluid surrounding the ball (e.g., air or water).

By Newton’s third law, all the forces in a closed system have to sum to 0. Under mild assumptions,8 this means that

$$0 = \ddot{x} +\mu \dot{x} + g\grad f(x)$$

### Algebraic manipulations

The discretization of $\ddot{x} + \mu\dot{x} + g\grad{f}(x) = 0$ is given by

\align{ (x_{k+1} - x_{k-1}) + \mu (x_{k+1} - x_k) + g\grad{f}(x) = 0\\ }

Simplifying,

\align{ (1 + \mu) x_{k+1} &= \mu x_k + x_{k-1} - g \grad{f}(x_k)\\ (1 + \mu) x_{k+1} &= (\mu + 1)x_k + x_{k-1} - x_k - b \grad{f}(x_k)\\ x_{k+1} &= x_k + \frac{1}{1+\mu} (x_{k-1} - x_k) - \frac{g}{1+\mu} \grad{f}(x_k)\\ }

### High dimensional gradient descent code

M, N = 50, 35
A = np.random.rand(M, N)
x = np.random.rand(N)
y = A @ x

np.random.seed(42)
initial_guess = np.random.rand(N)
x_ball = [initial_guess.copy(), initial_guess.copy()]

c = [1/np.linalg.norm(A)**2, 0.9]

for k in range(int(500)):
x_ball = ball_acceleration(A, x_ball, y, c)

1. I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet

2. As introduced by Yuri Nesterov in “A method of solving a convex programming problem with convergence rate $O(1/k^2)$”, published in volume 27 of Soviet Math Dokl (pdf).

3. See “A Differential Equation for Modeling Nesterov’s Accelerated Gradient Methods: Theory and Insights” by Su, Boyd and Candes for a more detailed version of this post.

4. Formally, this function is non-convex.

5. If the friction comes from the fluid the ball is rolling through (e.g., air or water), then “viscous damping coefficient” is always between 0 and 1.

6. Unless the step size for gradient descent is tuned properly.

7. The un-relevant forces are the purely vertical forces exerted by the surface and by gravity.

8. Assumptions like “there’s enough fluid fluid and it’s not sticky enough to influence the force of friction.”