PyTorch: fast and simple
I recently came across PyTorch, a new technology prime for optimization and machine learning. The docs make it look attractive, so immediately I wondered “how does it compare with NumPy?”
Turns out it’s a pretty nice framework that’s fast and straightforward to use. I’ll detail the speed before talking about easeofuse.
Speed
The largest difference is gradient computation, and the largest potential slowdown. PyTorch automatically computes the gradient given past computations, whereas in NumPy they have to be explicitly computed. Computing gradients are part of my daily workflow, and slowness here would mean that I could not use PyTorch.
I expected NumPy to be faster while computing gradients. How could it not be? It’s been around for a long time and has been heavily optimized. It’s a mature piece of software and widely used. Because of this, I expected NumPy to be at least 2x faster than PyTorch. Turns out that’s what we see when our implementation is tuned.^{1}.
PyTorch is fast. For even moderate dimensions (say 1000 observations) it’s within a factor of 2 of NumPy. For more realistic dimensions (like 10,000 observations) it’s remarkably close.
which shows the time to compute the least squares gradient (the gradient with respect to $x$ of $\norm{y  Ax}^2_2$ when $A \in \R^{10d~\times~d}$ where $10\cdot d$ is the number of observations).
For moderate dimensions, PyTorch is as fast as NumPy when bound to the CPU – using a GPU with PyTorch can provide additional acceleration. Plus, PyTorch avoids the nasty pitfalls like the one above; due to a small mistake, my NumPy code ran 8x slower than it could.
I got curious and applied this to HIPS/autograd as well – it’s the most straightforward solution to connect automatic differentation with NumPy. It’s fairly close to the PyTorch performance, at least for notsmall $d$. I believe that HIPS/autograd and PyTorch are both using reverse mode automatic differentiation.^{2}
If I had a GPU on my local machine PyTorch would be even faster. I could have made NumPy faster by using Numbas CUDA GPU support and my earlier post “NumPy GPU acceleration”, but I wanted to test Anaconda’s default configuration^{3}.
There are other libraries^{4} that have these same speed results – what else does PyTorch offer?
Extending PyTorch
PyTorch is not a Python binding to a monolithic C++ framework. Instead, most of the functionality is implemented as Python classes. This means that it’s easy to subclass these methods to write the code you want while having the functionality of PyTorch, and it’s easy to compare against other methods implemented in PyTorch. They even have a page titled “Extending PyTorch” in their docs!
NumPy/SciPy integration
The conversion between PyTorch tensors and NumPy arrays is simple as Tensor
the NumPy ndarray
and PyTorch Tensor
share the same memory locations
(source). This can lead to significant time savings, especially when
large arrays are used.
This means that it’s easy and fast to extend PyTorch with NumPy and SciPy. In the docs, they step through creating an extension with SciPy.
This is significant, and there are large speed benefits to this! When I compare converting to a NumPy $n\times n$ array from a Tensorflow or PyTorch tensor, I see this timing comparison:
This is run with Tensorflow’s static computation graph.
PyTorch is over 1000x faster than TensorFlow when converting to a 1000 $\times$ 1000 NumPy array!
This means we can use all of NumPy and SciPy without any fear of slowing our program down.
Dynamic computation graph
The biggest difference between PyTorch and other ML frameworks (Tensorflow, CNTK, MXNet, etc) is that PyTorch has a dynamic computational graph, not a static computational graph. This allows for significant ease of use.
One benefit of this is that code executes when you expect. With dynamic computation graphs, tracebacks are easy to follow and they can use control flow as expected. Libraries that have a static computation graph have to define their own control flow; they need to implement control flow. For example, see Tensorflow’s control flow docs or an SO question on difficulty on timing in Tensorflow
In my experience, it’s required to hold more mental state for Tensorflow models then with PyTorch. PyTorch has clear function arguments are because the code executes when expected. It’s not necessary to link together the input data to the model and (in my experience) there are fewer global variables.
Tensorflow’s dynamic computation graph
Note: I created this section on 20200722.
How does modern Tensorflow’s dynamic graph compare? Tensorflow enabled “eager execution” by default in v2.0.0. How does that affect performance?
Unlike the graph above, we are converting an array of length $N$, not a $N \times N$ array. Either way, it’s still converting up to 10 million floating point numbers. That’s pretty common: the smallest deep neural network has 11 million weights:
>>> from torchvision.models import resnet18
>>> layer_weights = [x.nelement() for x in resnet18().parameters]
>>> "{:0.2f} million parameters".format(sum(layer_weights) / 1e6)
11.69 million parameters
Further benefits
 torch.multiprocessing. Similar to the standard Python multiprocessing,
but “with magical memory sharing of torch Tensors across processes.”
 They even have an example Hogwild implementation!
 torch.distributed to communicate between distributed machines.
 GPU access which can speed up code as exemplified above.
 PyTorch is memory efficient: “The memory usage in PyTorch is extremely efficient compared to Torch or some of the alternatives”, according to pytorch.org.
PyTorch is already an attractive package, but they also offer
 Datasets and pretrained models at pytorch/vision
 Many examples and implementations, with a subset available at
 A strong community with a discussion board and an SO tag
Notes
 Chainer has a good comparison of many deep learning frameworks including PyTorch, Tensorflow and MXNet: http://chainer.readthedocs.io/en/latest/comparison.html (added 20170919)
 If you care about speed (added 20170919):
 A speed comparison between many different frameworks can be found at soumith/convnetbenchmarks. This measures many different frameworks, but notably Torch and not PyTorch (but this tweet by a PyTorch core dev says they both call the same C libraries).
 Some anecdotal evidence (tensorflow#9322, tensorflow#7065, PyTorch forum thread) points to PyTorch being faster than Tensorflow.
 On exporting models (added 20180217):
 PyTorch has good support for exporting to the Open Neural Network Exchange (ONNX) through torch.onnx
 A well written article on deploying PyTorch trained NN on iOS: “How I Shipped a Neural Network on iOS with CoreML, PyTorch, and React Native”
 And here’s a good graph I found on the export options of different libraries and the systems the can run on: http://tvmlang.org/2017/10/06/nnvmcompilerannouncement.html
 ONNX to CoreML: https://github.com/onnx/onnxcoreml
 fast.ai announced they’re “Introducing PyTorch for fast.ai” (and added
20170908). Their motivation includes
 “in a recent Kaggle competition [PyTorch] was used by nearly all of the top 10 finishers”
 “Much to our surprise, we also found that many models trained quite a lot faster on pytorch than they had on Tensorflow.”
 “Because Pytorch allowed us, and our students, to use all of the flexibility and capability of regular python code to build and train neural networks, we were able to tackle a much wider range of problems.”
 O’Reilly podcast on PyTorch, part of my motivation for checking out PyTorch
 PyTorch’s core development team has 4 members
 I think PyTorch performs reversemode autodifferentiation.
 Other autograd implementations (and inspiration for PyTorch): HIPS/autograd, twitter/torchautograd, Chainer.
 It looks like it performs reverse accumulation automatic differentiation
 PyTorch can work with tensorboard with tensorboardpytorch
 A good overview between Theano+Lasagne, PyTorch and Tensorflow on Reddit’s /r/machinelearning by /u/ajmooch
 Inspired by Chainer and similar to TensorFlow, Theano, Caffe and CNTK

Refactoring my NumPy code to use
2 * (x*y)
instead of(2*x) * y
lead to a 8x improvement in speed, as discovered in ScratchNet#3. ↩ 
I added this paragraph on 20180804. ↩

Which includes MKL and has other optimization (maybe Intel’s TBB?) ↩

like Tensorflow, MXNet and Theano (google trends). ↩