I recently came across PyTorch, a new technology prime for optimization and machine learning. The docs make it look attractive, so immediately I wondered “how does it compare with NumPy?”

Turns out it’s a pretty nice framework that’s fast and straightforward to use. I’ll detail the speed before talking about ease-of-use.

Speed

The largest difference is gradient computation, and the largest potential slow-down. PyTorch automatically computes the gradient given past computations, whereas in NumPy they have to be explicitly computed. Computing gradients are part of my daily workflow, and slowness here would mean that I could not use PyTorch.

I expected NumPy to be faster while computing gradients. How could it not be? It’s been around for a long time and has been heavily optimized. It’s a mature piece of software and widely used. Because of this, I expected NumPy to be at least 2x faster than PyTorch. Turns out that’s what we see when our implementation is tuned.1.

PyTorch is fast. For even moderate dimensions (say 1000 observations) it’s within a factor of 2 of NumPy. For more realistic dimensions (like 10,000 observations) it’s remarkably close.

which shows the time to compute the least squares gradient (the gradient with respect to $x$ of $\norm{y - Ax}^2_2$ when $A \in \R^{10d~\times~d}$ where $10\cdot d$ is the number of observations).

For moderate dimensions, PyTorch is as fast as NumPy when bound to the CPU – using a GPU with PyTorch can provide additional acceleration. Plus, PyTorch avoids the nasty pitfalls like the one above; due to a small mistake, my NumPy code ran 8x slower than it could.

I got curious and applied this to HIPS/autograd as well – it’s the most straightforward solution to connect automatic differentation with NumPy. It’s fairly close to the PyTorch performance, at least for not-small $d$. I believe that HIPS/autograd and PyTorch are both using reverse mode automatic differentiation.2

If I had a GPU on my local machine PyTorch would be even faster. I could have made NumPy faster by using Numbas CUDA GPU support and my earlier post “NumPy GPU acceleration”, but I wanted to test Anaconda’s default configuration3.

There are other libraries4 that have these same speed results – what else does PyTorch offer?

Extending PyTorch

PyTorch is not a Python binding to a monolithic C++ framework. Instead, most of the functionality is implemented as Python classes. This means that it’s easy to subclass these methods to write the code you want while having the functionality of PyTorch, and it’s easy to compare against other methods implemented in PyTorch. They even have a page titled “Extending PyTorch” in their docs!

NumPy/SciPy integration

The conversion between PyTorch tensors and NumPy arrays is simple as Tensor the NumPy ndarray and PyTorch Tensor share the same memory locations (source). This can lead to significant time savings, especially when large arrays are used.

This means that it’s easy and fast to extend PyTorch with NumPy and SciPy. In the docs, they step through creating an extension with SciPy.

This is significant, and there are large speed benefits to this! When I compare converting to a NumPy $n\times n$ array from a Tensorflow or PyTorch tensor, I see this timing comparison:

This is run with Tensorflow’s static computation graph.

PyTorch is over 1000x faster than TensorFlow when converting to a 1000 $\times$ 1000 NumPy array!

This means we can use all of NumPy and SciPy without any fear of slowing our program down.

Dynamic computation graph

The biggest difference between PyTorch and other ML frameworks (Tensorflow, CNTK, MXNet, etc) is that PyTorch has a dynamic computational graph, not a static computational graph. This allows for significant ease of use.

One benefit of this is that code executes when you expect. With dynamic computation graphs, tracebacks are easy to follow and they can use control flow as expected. Libraries that have a static computation graph have to define their own control flow; they need to implement control flow. For example, see Tensorflow’s control flow docs or an SO question on difficulty on timing in Tensorflow

In my experience, it’s required to hold more mental state for Tensorflow models then with PyTorch. PyTorch has clear function arguments are because the code executes when expected. It’s not necessary to link together the input data to the model and (in my experience) there are fewer global variables.

Tensorflow’s dynamic computation graph

Note: I created this section on 2020-07-22.

How does modern Tensorflow’s dynamic graph compare? Tensorflow enabled “eager execution” by default in v2.0.0. How does that affect performance?

Unlike the graph above, we are converting an array of length $N$, not a $N \times N$ array. Either way, it’s still converting up to 10 million floating point numbers. That’s pretty common: the smallest deep neural network has 11 million weights:

>>> from torchvision.models import resnet18
>>> layer_weights = [x.nelement() for x in resnet18().parameters]
>>> "{:0.2f} million parameters".format(sum(layer_weights) / 1e6)
11.69 million parameters


Further benefits

• torch.multiprocessing. Similar to the standard Python multiprocessing, but “with magical memory sharing of torch Tensors across processes.”
• torch.distributed to communicate between distributed machines.
• GPU access which can speed up code as exemplified above.
• PyTorch is memory efficient: “The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives”, according to pytorch.org.

PyTorch is already an attractive package, but they also offer

Notes

1. Refactoring my NumPy code to use 2 * (x*y) instead of (2*x) * y lead to a 8x improvement in speed, as discovered in ScratchNet#3

2. I added this paragraph on 2018-08-04.

3. Which includes MKL and has other optimization (maybe Intel’s TBB?)

4. like Tensorflow, MXNet and Theano (google trends).