Scott SievertA grad student at the University of Wisconsin, interested in optimization and passionate about skiing!
http://stsievert.com/
Fri, 08 Sep 2017 21:28:44 -0500Fri, 08 Sep 2017 21:28:44 -0500Jekyll v3.1.3PyTorch: fast and simple<p>I recently came across PyTorch, a new technology prime for optimization and
machine learning. The docs make it look attractive, so immediately I wondered
“how does it compare with NumPy?”</p>
<p>Turns out it’s a pretty nice framework that’s fast and straightforward to use.
I’ll detail the speed before talking about ease-of-use.</p>
<!--More-->
<h2 id="speed">Speed</h2>
<p>The largest difference is gradient computation, and the largest potential
slow-down. PyTorch automatically computes the gradient given past computations,
whereas in NumPy they have to be explicitly computed. Computing gradients are
part of my daily workflow, and slowness here would mean that I could not use
PyTorch.</p>
<p>I expected NumPy to be faster while computing gradients. How could it not be?
It’s been around for a long time and has been heavily optimized. It’s a mature
piece of software and widely used. Because of this, I expected NumPy to be at
least 2x faster than PyTorch.</p>
<p>PyTorch is faster. Not by a small margin like 10% (which would still be
significant!) but by a whopping 8x. That’s right – an explicit NumPy gradient
computation takes 8 times longer than doing more work to automatically compute
the gradient with PyTorch. This is because by default PyTorch uses parallelism,
something that’s not obvious how to get with NumPy (even with Anaconda).</p>
<p>That convinced me that PyTorch is a serious contender. I decided to verify
other less fancy computations (i.e., <code class="highlighter-rouge">svd</code> or <code class="highlighter-rouge">sqrt</code>) were as fast as NumPy.
The most important result from <a href="https://github.com/stsievert/pytorch-timing-comparisons">my timing comparisons</a> is</p>
<p><img src="/images/posts/2017-pytorch/pytorch-vs-numpy.png" alt="" /></p>
<p>which shows the time to compute the least squares gradient (the gradient with
respect to $x$ of $\norm{y - Ax}^2_2$ when $A \in \R^{10d~\times~d}$).</p>
<p>This test can be run in parallel. My machine has 8 virtual cores, which is why
PyTorch is 8x faster than NumPy for large $d$. If I had a GPU on my local
machine this would be even faster. I could have made NumPy faster by using
<a href="http://numba.pydata.org/numba-doc/dev/cuda/index.html">Numbas CUDA GPU support</a> and my earlier post “<a href="/blog/2016/07/01/numpy-gpu/">NumPy GPU acceleration</a>”, but I
wanted to test Anaconda’s default configuration<sup id="fnref:mkl"><a href="#fn:mkl" class="footnote">1</a></sup>.</p>
<p>There are other libraries<sup id="fnref:lib"><a href="#fn:lib" class="footnote">2</a></sup> that have these same speed results – what else does
PyTorch offer?</p>
<h2 id="extending-pytorch">Extending PyTorch</h2>
<p>PyTorch is not a Python binding to a monolithic C++ framework. Instead, most of
the functionality is implemented as Python classes. This means that it’s easy
to subclass these methods to write the code you want while having the
functionality of PyTorch, and it’s easy to compare against other methods
implemented in PyTorch. They even have a page titled “<a href="http://pytorch.org/docs/master/notes/extending.html">Extending PyTorch</a>” in their
docs!</p>
<h3 id="numpyscipy-integration">NumPy/SciPy integration</h3>
<p>The conversion between PyTorch tensors and NumPy arrays is <em>simple</em> as <code class="highlighter-rouge">Tensor</code>
the NumPy <code class="highlighter-rouge">ndarray</code> and PyTorch <code class="highlighter-rouge">Tensor</code> share the same memory locations
(<a href="http://pytorch.org/tutorials/beginner/former_torchies/tensor_tutorial.html#numpy-bridge">source</a>). This can lead to significant time savings, especially when
large arrays are used.</p>
<p>This means that it’s easy and fast to extend PyTorch with NumPy and SciPy. In
the docs, they step through creating <a href="http://pytorch.org/tutorials/advanced/numpy_extensions_tutorial.html">an extension with SciPy</a>.</p>
<p>This is significant, and there are large speed benefits to this! When I compare
converting to a NumPy $n\times n$ array from a Tensorflow or PyTorch tensor, I
see this timing comparison:</p>
<p><img src="/images/posts/2017-pytorch/tensorflow-eval-log.png" alt="" /></p>
<p>That’s right – PyTorch is over 1000x faster than TensorFlow when converting to
a 1000 $\times$ 1000 NumPy array!</p>
<p>This means we can use all of NumPy and SciPy without any fear of slowing our
program down.</p>
<h2 id="dynamic-computation-graph">Dynamic computation graph</h2>
<p>The biggest difference between PyTorch and other ML frameworks (Tensorflow,
CNTK, MXNet, etc) is that PyTorch has a <strong>dynamic</strong> computational graph, not a
<strong>static</strong> computational graph. This allows for significant ease of use.</p>
<p>One benefit of this is that code executes when you expect. Tensorflow uses
function definitions with a asynchronous C++ library, meaning that the
computational graph is defined <em>before</em> running. In PyTorch, the graph is
defined <em>by</em> running. This means that PyTorch tracebacks are easy to follow as
a result – they’re not an additional asynchronous traceback on top of the
traceback of interest.</p>
<p>In my experience, it’s required to hold more mental state for Tensorflow models
then with PyTorch. PyTorch has clear function arguments are because the code
executes when expected. It’s not necessary to link together the input data to
the model and (in my experience) there are fewer global variables.</p>
<h2 id="further-benefits">Further benefits</h2>
<ul>
<li><strong><a href="http://pytorch.org/docs/master/multiprocessing.html">torch.multiprocessing</a></strong>. Similar to the standard Python multiprocessing,
but “with magical memory sharing of torch Tensors across processes.”
<ul>
<li>They even have <a href="http://pytorch.org/docs/master/notes/multiprocessing.html#hogwild">an example Hogwild implementation</a>!</li>
</ul>
</li>
<li><strong><a href="http://pytorch.org/docs/master/distributed.html">torch.distributed</a></strong> to communicate between distributed machines.</li>
<li><strong>GPU access</strong> which can speed up code as exemplified above.</li>
<li>PyTorch is memory efficient: “The memory usage in PyTorch is extremely
efficient compared to Torch or some of the alternatives”, according to <a href="http://pytorch.org">pytorch.org</a></li>
</ul>
<p>PyTorch is already an attractive package, but they also offer</p>
<ul>
<li><strong>Datasets and pretrained models</strong> at <a href="https://github.com/pytorch/vision">pytorch/vision</a></li>
<li><strong>Many examples and implementations</strong>, with a subset available at
<ul>
<li><a href="https://github.com/pytorch/examples">pytorch/examples</a></li>
<li><a href="https://github.com/bharathgs/Awesome-pytorch-list">bharathgs/Awesome-pytorch-list</a></li>
<li><a href="https://github.com/ritchieng/the-incredible-pytorch">ritchieng/the-incredible-pytorch</a></li>
</ul>
</li>
<li><strong>A strong community</strong> with <a href="https://discuss.pytorch.org/">a discussion board</a> and <a href="https://stackoverflow.com/questions/tagged/pytorch">an SO tag</a></li>
</ul>
<h2 id="notes">Notes</h2>
<ul>
<li><a href="https://www.oreilly.com/ideas/why-ai-and-machine-learning-researchers-are-beginning-to-embrace-pytorch">O’Reilly podcast on PyTorch</a>, part of my motivation for checking out PyTorch
<ul>
<li>PyTorch’s core development team has 4 members</li>
</ul>
</li>
<li>I think PyTorch performs reverse-mode auto-differentiation.
<ul>
<li>Other autograd implementations (and inspiration for PyTorch):
<a href="https://github.com/HIPS/autograd">HIPS/autograd</a>, <a href="https://github.com/twitter/torch-autograd">twitter/torch-autograd</a>, <a href="https://chainer.org">Chainer</a>.</li>
<li>It looks like it performs <a href="https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation">reverse accumulation automatic differentiation</a></li>
</ul>
</li>
<li>PyTorch can work with <a href="https://github.com/tensorflow/tensorboard">tensorboard</a> with <a href="https://github.com/lanpa/tensorboard-pytorch">tensorboard-pytorch</a></li>
<li><a href="https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/de72nnr/">A good overview</a> between Theano+Lasagne, PyTorch and Tensorflow on Reddit’s
/r/machinelearning by <a href="https://www.reddit.com/user/ajmooch">/u/ajmooch</a></li>
<li>Inspired by <a href="https://chainer.org">Chainer</a> and similar to <a href="https://www.tensorflow.org">TensorFlow</a>, <a href="http://www.deeplearning.net/software/theano/">Theano</a>, <a href="https://caffe2.ai">Caffe</a> and <a href="https://docs.microsoft.com/en-us/cognitive-toolkit/">CNTK</a></li>
<li>[added 2017-09-08] fast.ai announced they’re “<a href="http://www.fast.ai/2017/09/08/introducing-pytorch-for-fastai/">Introducing PyTorch for fast.ai</a>”. Their motivation includes
<ul>
<li>“in <a href="https://www.kaggle.com/c/planet-understanding-the-amazon-from-space">a recent Kaggle competition</a> [PyTorch] was used by nearly all of the top 10
finishers”</li>
<li>“Much to our surprise, we also found that many models trained quite a lot
faster on pytorch than they had on Tensorflow.”</li>
<li>“Because Pytorch allowed us, and our students, to use all of the
flexibility and capability of regular python code to build and train
neural networks, we were able to tackle a much wider range of problems.”</li>
</ul>
</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:mkl">
<p>Which includes MKL and has other optimization (maybe Intel’s TBB?) <a href="#fnref:mkl" class="reversefootnote">↩</a></p>
</li>
<li id="fn:lib">
<p>like Tensorflow, MXNet and Theano (<a href="https://trends.google.com/trends/explore?date=today%205-y&q=mxnet,theano,tensorflow,pytorch">google trends</a>). <a href="#fnref:lib" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 07 Sep 2017 03:30:00 -0500
http://stsievert.com/blog/2017/09/07/pytorch/
http://stsievert.com/blog/2017/09/07/pytorch/optimizationcomputingspeedframeworkHoloviews interactive visualization<p>I often want to provide some simple interactive visualizations for this blog. I
like to include visualization to give some sense of how the data change as
various parameters are changed. Examples can be found in <em><a href="/blog/2015/12/09/inverse-part-2/">Finding sparse
solutions to linear systems</a></em>, <em><a href="/blog/2015/11/19/inverse-part-1/">Least squares and regularization</a></em>, and
<em><a href="/blog/2015/04/23/image-sqrt/">Computer color is only kinda broken</a></em>.</p>
<p>I have discovered a new tool, Holoviews to create these widgets. I want to
create these interactive widgets for my blog, meaning I want to embed these in
a static HTML page. Previously, I used Jake Vanderplas’s <a href="https://github.com/jakevdp/ipywidgets-static">ipywidgets-static</a>
but in this post I’ll walk through creating a widget.</p>
Please visit my blog to see the full post -- it includes an interactive widget! (the reason I can't include the full post here)
Sat, 22 Jul 2017 03:30:00 -0500
http://stsievert.com/blog/2017/07/22/holoviews/
http://stsievert.com/blog/2017/07/22/holoviews/data-visualizationwebApple CoreML model conversion<p>Apple has created a new file format for machine learning models. These files
can be used easily to predict, regardless of the creation process, which means
that “<a href="http://deepdojo.com/apple-introduces-core-ml">Apple Introduces Core ML</a>” draws an analogy between these files and
PDFs. It’s possible to generate predictions with <em>only</em> this file, and none of
the creation libraries.</p>
<p>Generating predictions is a pain point faced by data scientists today and often
involves the underlying math. At best, this involves using training the model
in Python and then calling the underlying C library in the production
app.</p>
<p>This file format will only become widely used if easy conversion from popular
machine learning libraries is possible and predictions are simple to generate.
Apple made these claims during their WWDC 2017 keynote. I want to investigate
their claim.</p>
<p><img src="/images/posts/2017-coreml/conversion.jpg" alt="" /></p>
<!--More-->
<p>Specifically, Apple claimed easy integration their <code class="highlighter-rouge">.mlmodel</code> file format and
various Python libraries. It’s easy to integrate these into an app (literally
via drag-and-drop) or another Python program.</p>
<h2 id="file-creation">File creation</h2>
<p>Apple’s <a href="https://pythonhosted.org/coremltools">coremltools</a> Python package make generation of this <code class="highlighter-rouge">.mlmodel</code> file
straightforward:</p>
<ol>
<li>Train a model via scikit-learn, Keras, Caffe or XGBoost (see docs for
<a href="https://pythonhosted.org/coremltools/#conversion-support">conversion support</a> for different library versions)</li>
<li>Generate a <code class="highlighter-rouge">coreml_model</code> with <code class="highlighter-rouge">converters.[library].convert(model)</code></li>
<li>(optional) Add metadata (e.g., feature names, author, short description)</li>
<li>Save the model with <code class="highlighter-rouge">coreml_model.save</code></li>
</ol>
<p>coremltools prints helpful error messages in my (brief) experience. When using
<code class="highlighter-rouge">converters.sklearn.convert</code> it gave a helpful error message indicating that
class labels should either be of type <code class="highlighter-rouge">int</code> or <code class="highlighter-rouge">str</code> (not <code class="highlighter-rouge">float</code> like I was
using).</p>
<p>Here’s the complete script for the <code class="highlighter-rouge">.mlmodel</code> file generation:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">coremltools</span>
<span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">LinearSVC</span>
<span class="k">def</span> <span class="nf">train_model</span><span class="p">():</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">LinearSVC</span><span class="p">()</span>
<span class="c"># ...</span>
<span class="k">return</span> <span class="n">model</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">train_model</span><span class="p">()</span>
<span class="n">coreml_model</span> <span class="o">=</span> <span class="n">coremltools</span><span class="o">.</span><span class="n">converters</span><span class="o">.</span><span class="n">sklearn</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="n">model</span><span class="p">)</span>
<span class="n">coreml_model</span><span class="o">.</span><span class="n">author</span> <span class="o">=</span> <span class="s">'Scott Sievert'</span> <span class="c"># other attributes can be added</span>
<span class="n">coreml_model</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="s">'sklearn.mlmodel'</span><span class="p">)</span>
</code></pre>
</div>
<p>Yup, creation of these <code class="highlighter-rouge">.mlmodel</code> files is as easy as Apple claims. Even
better, it appears this file format has <a href="http://pythonhosted.org/coremltools/generated/coremltools.models.MLModel.html">integration with named features</a> and
Pandas.</p>
<p>The generation of this file is <em>easy</em>. Now, where can these files be used?</p>
<p>These <code class="highlighter-rouge">.mlmodel</code> files can be included on any device that supports CoreML. It
will not be tied to iOS/macOS apps, though these files will certainly be used
there. It will allow general and easy use in Python for both saving and
prediction. Given Apple’s expansion of Swift to other operating systems, I
don’t believe it will be tied to a particular operating system.</p>
<h2 id="prediction">Prediction</h2>
<p>Prediction is easy as saving:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">coremlmodel</span> <span class="o">=</span> <span class="n">coremltools</span><span class="o">.</span><span class="n">models</span><span class="o">.</span><span class="n">MLModel</span><span class="p">(</span><span class="s">'sklearn.mlmodel'</span><span class="p">)</span>
<span class="n">coremlmodel</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">example</span><span class="p">)</span> <span class="c"># `example` format should mirror training examples</span>
</code></pre>
</div>
<p>However, I can’t test it as macOS 10.13 (currently in beta) is needed.</p>
<h2 id="difficulties">Difficulties</h2>
<p>This difficulties were resolved quickly. Here’s what I ran while generating
this post:</p>
<ul>
<li>CoreML depends on Python 2.7</li>
<li>Version support in converting (e.g., Keras 2 not supported but 1.2 is).</li>
</ul>
<p>The largest potential difficulty I see is with the limited (or not unlimited)
scope of coremltools. There could be issues with version of different
libraries, and not all classifiers in sklearn are supported (<a href="https://pythonhosted.org/coremltools/generated/coremltools.converters.sklearn.convert.html#coremltools.converters.sklearn.convert">supported
sklearn models</a>).</p>
Sun, 11 Jun 2017 03:30:00 -0500
http://stsievert.com/blog/2017/06/11/coreml/
http://stsievert.com/blog/2017/06/11/coreml/appleiosmachine-learningAtmosphere and entropy<p>I recently learned an abstract mathematical theorem, and stumbled across a
remarkably direct measure. I’ll give background to this theorem before
introducing it, then I’ll show the direct measure of this theorem with physical
data.</p>
<p>This theorem has to do with entropy, which is clouded in mystery. There are
<a href="https://en.wikipedia.org/wiki/Entropy_(disambiguation)">several types of entropy</a> and, during the naming of one type, <a href="https://en.wikipedia.org/wiki/John_von_Neumann">Von Neumann</a>
suggested the name “entropy” to <a href="https://en.wikiquote.org/wiki/Claude_Elwood_Shannon">Claude Shannon</a> in 1948 because</p>
<blockquote>
<p>In the first place your uncertainty function has been used in statistical
mechanics under that name, so it already has a name. In the second place,
and more important, no one really knows what entropy really is, so in a
debate you will always have the advantage.</p>
</blockquote>
<!--More-->
<p><a href="https://en.wikipedia.org/wiki/Entropy_(disambiguation)">Entropy</a> fundamentally measures the uncertainty in a system. In “information
theory” it formalizes the “randomness” of a random variable. If a random
variable is uncertain, it has entropy. If a random variable is deterministic or
takes only takes one value, it has 0 entropy.</p>
<p>Entropy is fundamentally related to the flow of information, which is studied
in information theory. If a message can only take one state, no information can
be transmitted; how would communication happen if everything is static? But if
it can take two states, it’s possible to communicate one bit of information
(i.e., an answer to “is the light on?”)</p>
<p>We care about maximizing entropy because we want to receive information
quickly. If a message can take 4 states instead of 2, it’s possible to transmit
twice as much information.</p>
<p>A typical statement about entropy maximization looks like:</p>
<blockquote>
<p>For a positive random variable $X$ with fixed mean $\mu$, the entropy of $X$ is
maximized when $X$ is an exponential random variable with mean
$\frac{1}{\mu}$.</p>
</blockquote>
<p>It doesn’t seem like there’d be a direct physical example that supports this
theorem. But, there is, and it has to do with air pressure as a function of
height.</p>
<p>Let’s take a column of the earth’s atmosphere, and ignore any weather or
temperature effects. <strong>How does the air pressure in this column vary with
height?</strong> The air pressure at sea level is very different than the air pressure
where the ISS orbits.</p>
<p>An air particle’s height is the random variable we’ll study. Height is a
positive variable, and a column of air will have a mean height $\mu$. We can
apply the statement above if we can assume air particles maximize entropy.
When this example was presented in lecture<sup id="fnref:info-theory-varun"><a href="#fn:info-theory-varun" class="footnote">1</a></sup>, I somewhat incredulously asked if this could be applied to Earth’s air pressure.</p>
<p>Air particles maximizing entropy is a fair assumption. An air particle’s
position is uniformly random when it’s contained in a small region. Given the
well-known fact that uniform random variables have maximum entropy, this seems
like a safe assumption.</p>
<p>So by the statement above we’d expect the air molecules position to follow an
exponential distribution. Pressure is a proxy for how many air particles are
present, and we’d expect that pressure as a function of
height to look like the <a href="https://en.wikipedia.org/wiki/Barometric_formula">barometric formula</a>:</p>
<script type="math/tex; mode=display">P(h) = c_1 e^{-c_2 h}</script>
<p>when $c_1$ and $c_2$ are constants that hide weather/temperature.</p>
<p>NOAA collects the data required to test this theory. The launch weather
balloons through the <a href="https://www.ncdc.noaa.gov/data-access/weather-balloon/integrated-global-radiosonde-archive">NOAA IGRA</a> program. These weather balloons collect
atmospheric data at different altitudes, and these data are available
online<sup id="fnref:data-descrip"><a href="#fn:data-descrip" class="footnote">2</a></sup>. They record monthly averages of pressure at different
heights at each station from 1909 to 2017<sup id="fnref:entropy-def-time"><a href="#fn:entropy-def-time" class="footnote">3</a></sup>. These data are
from at least 600,000 weather balloon launches from a total of 1,532
stations.</p>
<p>We use these data to visualize pressure at different heights. We know that
this curve is characterized by $\mu$, the average height of an air molecule.
I’ve calculated this value from these data and have plotted expected pressure
at any height, given by $P(h) = \frac{1}{\mu}\Exp{\frac{-h}{\mu}}$.</p>
<p>I show the measured pressures at different heights using the 314 million NOAA
data. This is shown alongside an appropriately scaled version of $P(h)$ given
the average air particle height $\mu$.</p>
<p><img src="/images/posts/2017-entropy/pressures.png" alt="" /></p>
<p>The theorem as stated earlier holds true for an air particles height: air
pressure<sup id="fnref:proxy"><a href="#fn:proxy" class="footnote">4</a></sup> at different heights follow an exponential distribution.</p>
<p><em>The plot for this post was generated at <a href="https://github.com/stsievert/air-pressure-heights">stsievert/air-pressure-height</a></em></p>
<div class="footnotes">
<ol>
<li id="fn:info-theory-varun">
<p>In ECE 729: Information Theory taught by Varun Jog <a href="#fnref:info-theory-varun" class="reversefootnote">↩</a></p>
</li>
<li id="fn:data-descrip">
<p>Summarized at <a href="https://www1.ncdc.noaa.gov/pub/data/igra/monthly/igra2-monthly-format.txt">igra2-monthly-format.txt</a> and available in <a href="https://www1.ncdc.noaa.gov/pub/data/igra/monthly/monthly-por/">monthly-por</a> as <code class="highlighter-rouge">ghgt_*.txt.zip</code> <a href="#fnref:data-descrip" class="reversefootnote">↩</a></p>
</li>
<li id="fn:entropy-def-time">
<p>Before information theory entropy was even defined! <a href="#fnref:entropy-def-time" class="reversefootnote">↩</a></p>
</li>
<li id="fn:proxy">
<p>a proxy for number of air particles <a href="#fnref:proxy" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 09 Apr 2017 03:30:00 -0500
http://stsievert.com/blog/2017/04/09/entropy/
http://stsievert.com/blog/2017/04/09/entropy/probabilityinformation-theoryMotivation for sexual reproduction<p>Of course, the purpose of sexual reproduction is to perpetuate our species by
having offspring. Combined with natural selection, it’s enable fit our genes to
our environment <em>quickly.</em> Buy why is it required to have two mates to produce
a single offspring? Would asexual reproduction or having 3+ parents be more
advantageous?</p>
<!--More-->
<p>The end goal of reproduction is have the offspring breed their own offspring.
Since genes influence the probability of survival/breeding, one goal of
reproduction is to have as many genes that will allow offspring
survival/reproduction as possible. This is a very information theoretic
approach – how quickly can biology transfer important information?</p>
<p>A species can evolve more quickly (quicker to optimal fitness) when they share
genetic information! The sharing of genetic material is important when
combined with natural selection. As a species we can evolve quicker this way!
Correspondingly, most species use two mates when reproducing.</p>
<p>This isn’t surprising when compared with asexual reproduction which involves
one partner. Sharing information used to survive is more advantageous than
not sharing that information.</p>
<p>But <strong>would having three or more parents help our genome advance more
quickly?</strong> I’ll simulate to guess an answer to this question in this post.</p>
<p><em>This post is inspired by a information theory lecture by <a href="https://sites.google.com/wisc.edu/vjog/">Varun Jog</a>, which is
in turn inspired by a chapter in the textbook “<a href="http://www.inference.phy.cam.ac.uk/itprnn/book.pdf">Information Theory, Inference,
and Learning Algorithms</a>” by David MacKay.</em></p>
<h2 id="model">Model</h2>
<p>Simulation of evolution needs to have 4 main parts:</p>
<ul>
<li>an individual’s gene</li>
<li>determination of a individual’s fitness</li>
<li>inheritance of genes when producing offspring</li>
<li>natural selection</li>
</ul>
<h3 id="individual-gene-and-fitness">Individual gene and fitness</h3>
<p>Each individual’s DNA will be a binary sequence. This is a fair representation
of DNA – the only difference is that actual DNA has 2 bits of information per
base pair.</p>
<p>We’ll model fitness when an individual has genes $g_i \in \braces{0, 1}$ as</p>
<script type="math/tex; mode=display">\text{fitness } = \sum_i g_i</script>
<p>This is a sensible model because it mirrors actual fitness. If one is already
really fit, a little more fitness won’t help much (i.e, fitness of 99 → 100 is
a small percentage change). If one is not fit, getting a little fit helps a ton
(i.e., fitness of 1 → 2 is a large percentage change).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">fitness</span><span class="p">(</span><span class="n">member</span><span class="p">):</span>
<span class="n">member</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">member</span><span class="p">)</span>
<span class="k">if</span> <span class="n">member</span><span class="o">.</span><span class="n">ndim</span> <span class="o">==</span> <span class="mi">2</span><span class="p">:</span>
<span class="n">_</span><span class="p">,</span> <span class="n">n_genes</span> <span class="o">=</span> <span class="n">member</span><span class="o">.</span><span class="n">shape</span>
<span class="n">indiv_fitnesses</span> <span class="o">=</span> <span class="n">member</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">indiv_fitnesses</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="k">return</span> <span class="n">member</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
</code></pre>
</div>
<h3 id="inheritance">Inheritance</h3>
<p>We’ll model the probability of pulling parent $i$’s gene as $\frac{1}{n}$ when there are $n$ parents. This mirrors how reproduction works.</p>
<p>While implementing this we include mutation. This will flip each gene with
probability $p$. This is a naturally occurring process.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">produce_offspring</span><span class="p">(</span><span class="n">parents</span><span class="p">,</span> <span class="n">mutate</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mf">0.00</span><span class="p">):</span>
<span class="n">n_parents</span><span class="p">,</span> <span class="n">n_genes</span> <span class="o">=</span> <span class="n">parents</span><span class="o">.</span><span class="n">shape</span>
<span class="n">gene_to_pull</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="n">n_parents</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">n_genes</span><span class="p">)</span>
<span class="n">child</span> <span class="o">=</span> <span class="p">[</span><span class="n">parents</span><span class="p">[</span><span class="n">gene_to_pull</span><span class="p">[</span><span class="n">k</span><span class="p">]][</span><span class="n">k</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_genes</span><span class="p">)]</span>
<span class="n">child</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">child</span><span class="p">)</span>
<span class="k">if</span> <span class="n">mutate</span><span class="p">:</span>
<span class="n">genes_to_flip</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="n">child</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="o">-</span><span class="n">p</span><span class="p">,</span> <span class="n">p</span><span class="p">])</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argwhere</span><span class="p">(</span><span class="n">genes_to_flip</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">child</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">-</span> <span class="n">child</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">return</span> <span class="n">child</span>
</code></pre>
</div>
<h3 id="one-generation">One generation</h3>
<p>Each generation of parents will produce twice as many children. We’ll kill half
those children to simulate natural selection.</p>
<p>We’ll produce $2N$ children when we have $N$ parents, regardless of how many
parents are required to produce each offspring. If we produced more children
for some groups that there would be more children to hand to natural selection.
This would lead to a bias because natural selection selects the strongest
candidates.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">one_generation</span><span class="p">(</span><span class="n">members</span><span class="p">,</span> <span class="n">n_parents</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">):</span>
<span class="s">""" members: 2D np.ndarray
members.shape = (n_parents, n_genes) """</span>
<span class="n">parents</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">permutation</span><span class="p">(</span><span class="n">members</span><span class="p">)</span>
<span class="n">children</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">parent_group</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">parents</span><span class="p">)</span> <span class="o">//</span> <span class="n">n_parents</span><span class="p">):</span>
<span class="n">parents_group</span> <span class="o">=</span> <span class="n">parents</span><span class="p">[</span><span class="n">parent_group</span> <span class="o">*</span> <span class="n">n_parents</span> <span class="p">:</span> <span class="p">(</span><span class="n">parent_group</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">n_parents</span><span class="p">]</span>
<span class="n">children</span> <span class="o">+=</span> <span class="p">[</span><span class="n">produce_offspring</span><span class="p">(</span><span class="n">parents_group</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">n_parents</span><span class="p">)]</span>
<span class="n">children</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">children</span><span class="p">)</span>
<span class="c"># make sure we produce (approximately) 2*N children when we have N parents</span>
<span class="k">assert</span> <span class="n">np</span><span class="o">.</span><span class="nb">abs</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">children</span><span class="p">)</span> <span class="o">-</span> <span class="mi">2</span><span class="o">*</span><span class="nb">len</span><span class="p">(</span><span class="n">parents</span><span class="p">))</span> <span class="o"><</span> <span class="n">n_mates</span> <span class="o">+</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">children</span>
</code></pre>
</div>
<h2 id="simulations">Simulations</h2>
<p>Now want to simulate evolution with an initial population:</p>
<p>In the natural selection process we’ll kill off half the children, meaning
there will be $N$ parents for the next generation.</p>
<p>At each generation we’ll record relevant data. We’ll look at the fitness below.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">evolve</span><span class="p">(</span><span class="n">n_parents</span><span class="o">=</span><span class="mi">2000</span><span class="p">,</span> <span class="n">n_mates</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">n_genes</span><span class="o">=</span><span class="mi">200</span><span class="p">,</span> <span class="n">n_generations</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">p_fit</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">verbose</span><span class="o">=</span><span class="mi">10</span><span class="p">):</span>
<span class="n">parents</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_parents</span><span class="p">,</span> <span class="n">n_genes</span><span class="p">),</span> <span class="n">p</span><span class="o">=</span><span class="p">[</span><span class="mi">1</span><span class="o">-</span><span class="n">p_fit</span><span class="p">,</span> <span class="n">p_fit</span><span class="p">])</span>
<span class="n">child</span> <span class="o">=</span> <span class="n">produce_offspring</span><span class="p">(</span><span class="n">parents</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">generation</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">n_generations</span><span class="p">):</span>
<span class="k">if</span> <span class="n">verbose</span> <span class="ow">and</span> <span class="n">generation</span> <span class="o">%</span> <span class="n">verbose</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Generation {} for n_mates = {} with {} parents'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">generation</span><span class="p">,</span> <span class="n">n_mates</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">parents</span><span class="p">)))</span>
<span class="n">children</span> <span class="o">=</span> <span class="n">one_generation</span><span class="p">(</span><span class="n">parents</span><span class="p">,</span> <span class="n">n_parents</span><span class="o">=</span><span class="n">n_mates</span><span class="p">,</span> <span class="n">mutate</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mf">0.01</span><span class="p">)</span>
<span class="c"># kill half the children</span>
<span class="n">children_fitness</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">fitness</span><span class="p">(</span><span class="n">child</span><span class="p">)</span> <span class="k">for</span> <span class="n">child</span> <span class="ow">in</span> <span class="n">children</span><span class="p">])</span>
<span class="n">i</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">argsort</span><span class="p">(</span><span class="n">children_fitness</span><span class="p">)</span>
<span class="n">children</span> <span class="o">=</span> <span class="n">children</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">parents</span> <span class="o">=</span> <span class="n">children</span><span class="p">[</span><span class="nb">len</span><span class="p">(</span><span class="n">children</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span><span class="p">:]</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">data</span> <span class="o">+=</span> <span class="p">[{</span><span class="s">'fitness'</span><span class="p">:</span> <span class="n">fitness</span><span class="p">(</span><span class="n">parents</span><span class="p">),</span> <span class="s">'generation'</span><span class="p">:</span> <span class="n">generation</span><span class="p">,</span>
<span class="s">'n_mates'</span><span class="p">:</span> <span class="n">n_mates</span><span class="p">,</span> <span class="s">'n_parents'</span><span class="p">:</span> <span class="n">n_parents</span><span class="p">,</span>
<span class="s">'n_genes'</span><span class="p">:</span> <span class="n">n_genes</span><span class="p">,</span> <span class="s">'n_generations'</span><span class="p">:</span> <span class="n">n_generations</span><span class="p">}]</span>
<span class="k">return</span> <span class="n">data</span>
</code></pre>
</div>
<h3 id="data-collection">Data collection</h3>
<p>Then we can run this for a different number of mates required to produce one
offspring:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">data</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">n_mates</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]:</span>
<span class="n">data</span> <span class="o">+=</span> <span class="n">evolve</span><span class="p">(</span><span class="n">n_mates</span><span class="o">=</span><span class="n">n_mates</span><span class="p">,</span> <span class="n">p_fit</span><span class="o">=</span><span class="mf">0.50</span><span class="p">)</span>
</code></pre>
</div>
<h3 id="results">Results</h3>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">altair</span> <span class="kn">import</span> <span class="n">Chart</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">Chart</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">.</span><span class="n">mark_line</span><span class="p">()</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span>
<span class="n">x</span><span class="o">=</span><span class="s">'generation'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="s">'fitness'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'n_mates'</span><span class="p">)</span>
</code></pre>
</div>
<p><img src="/images/posts/2017-sex/fitness.png" alt="" /></p>
<p>Asexual reproduction requires ~75 generations to reach the fitness
sexual reproduction reaches in ~15 generations. Sexual reproduction
appears to be fairly close to optimal in this model.</p>
<p><em>This post download as a Jupyter notebook,
<a href="/assets/2017-reproduction/Reproduction.ipynb">Reproduction.ipynb</a></em></p>
Sat, 11 Mar 2017 02:30:00 -0600
http://stsievert.com/blog/2017/03/11/sexual-reproduction/
http://stsievert.com/blog/2017/03/11/sexual-reproduction/reproductioninformation-theoryEasy powerful parallel code execution and use on a UW cluster<p>I often have highly optimized code that I want to run independently for
different parameters. For example, I might want to see how reconstruction
quality varies as I change two parameters. My code takes a moderate amount of
time to run, maybe 1 minute. This isn’t huge, but if I want to average
performance over 5 random runs for $20^2$ different input combinations, using a
naïve for-loop means about 1.5 days. Using <a href="http://distributed.readthedocs.io/en/latest/">dask.distributed</a>, I distribute these
independent jobs across different machines and different cores for a
significant speedup.</p>
<!--More-->
<p>Testing these input combinations requires at least one embarassingly simple
for-loop – each iteration is run independently of the last iteration. The
simplified example takes the form of</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="k">def</span> <span class="nf">test_model</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="c"># ...</span>
<span class="k">return</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">[</span><span class="n">test_model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
</code></pre>
</div>
<p><a href="http://distributed.readthedocs.io/en/latest/">dask.distributed</a> is a tool to optimize these for loops<sup id="fnref:optimization"><a href="#fn:optimization" class="footnote">1</a></sup>. It can
distribute a single loop of this for-loop onto different cores and different
machines. This is perfect for me – as an grad student involved with the
Wisconsin Institute for Discovery, I have a cluster of about 30 machines ready
for my use.</p>
<p>I’ll first illustrate basic dask use then explain how I personally set it up on
the cluster. I’ll then go over some advanced use that covers how to use it with
the cluster at UW–Madison.</p>
<h2 id="daskdistributed-example">dask.distributed example</h2>
<p>Using dask.distributed is easy, and the <a href="http://distributed.readthedocs.io/en/latest/">dask.distributed documentation</a> is helpful.
Functions for submitting jobs such as <code class="highlighter-rouge">Client.map</code> and <code class="highlighter-rouge">Client.submit</code>
exist, and <code class="highlighter-rouge">Client.gather</code> exists to collect pending jobs.</p>
<p>In the use case above,</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">distributed</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">()</span> <span class="c"># will start scheduler and worker automatically</span>
<span class="k">def</span> <span class="nf">test_model</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="c"># ...</span>
<span class="k">return</span> <span class="n">random</span><span class="o">.</span><span class="n">choice</span><span class="p">([</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">])</span>
<span class="n">y</span> <span class="o">=</span> <span class="p">[</span><span class="n">client</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">test_model</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">y</span><span class="p">)</span> <span class="c"># collect the results</span>
</code></pre>
</div>
<p>This is a setup – we have to add 3 lines and change 1. For the speedup that
dask.distributed gives access to that’s remarkably simple.</p>
<p>Using <code class="highlighter-rouge">Client()</code> is easy – it starts a worker and scheduler for you. This
only works for one machine though; other tools exist to use many machines as
detailed on <em><a href="http://distributed.readthedocs.io/en/latest/setup.html">Setup Network</a></em>.</p>
<p>That really covers all you need to know; the <a href="http://distributed.readthedocs.io/en/latest/">dask.distributed docs</a> are decent and
the above example is enough to get started. In what follows, I’ll explain my
work flow: using dask.distributed on the UW–Madison optimization cluster with a
Jupyter notebook.</p>
<h2 id="using-daskdistributed-with-the-uw-cluster">Using dask.distributed with the UW cluster</h2>
<p>After installing a personal Python install on the UW cluster, following
dask.distributed’s <em><a href="http://distributed.readthedocs.io/en/latest/quickstart.html">Quickstart</a></em> gets you 99% of the way to using dask.distributed on
the UW Optimization Cluster. The <a href="http://distributed.readthedocs.io/en/latest/">dask.distributed documentation</a> are rather
complete – please, give them a look. Most of the content below can be found in
<em><a href="http://distributed.readthedocs.io/en/latest/quickstart.html">Quickstart</a></em>, <em><a href="http://distributed.readthedocs.io/en/latest/web.html">Web interface</a></em> and <em><a href="http://distributed.readthedocs.io/en/latest/faq.html">FAQ</a></em>.</p>
<p>Setting up many workers on the cluster with many machines is a little trickier
because the cluster is not my personal machine and I (thankfully) don’t manage
it. I’ll describe how I use dask.distributed and what workarounds I had to find
to get dask.distributed to run on the UW–Madison cluster. Additionally I’ll
describe how I use this in conjunction with Jupyter notebooks.</p>
<p>I setup dask.distributed like below, using SSH port
forwarding to view the web UI.</p>
<div class="language-shell highlighter-rouge"><pre class="highlight"><code><span class="c"># visit localhost:8070/status to see dask's web UI</span>
<span class="gp">scott@local$ </span>ssh -L 8070:localhost:8787 ssievert@cluser-1
<span class="gp">scott@cluster-1$ </span>dask-scheduler
<span class="c"># `dask-scheduler` prints "INFO - Scheduler at 123.1.28.1:8786"</span>
<span class="gp">scott@cluster-1$ </span><span class="nb">export </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>1; dask-worker 123.1.28.1:8786
<span class="gp">scott@cluster-2$ </span><span class="nb">export </span><span class="nv">OMP_NUM_THREADS</span><span class="o">=</span>1; dask-worker 123.1.28.1:8786
</code></pre>
</div>
<p>When I run dask-worker without setting <code class="highlighter-rouge">OMP_NUM_THREADS</code>, the worker throws an
error and fails. Setting <code class="highlighter-rouge">OMP_NUM_THREADS=1</code> resolves this issue, and see a SO
question titled <a href="http://stackoverflow.com/questions/39422092/error-with-omp-num-threads-when-using-dask-distributed">“Error with OMP_NUM_THREADS when using dask distributed”</a> for
more detail.</p>
<p>A nice tool to manage running the same commands on many machines is <a href="https://github.com/brockgr/csshx">csshx</a>
for OS X’s Terminal.app (not iTerm) and <a href="https://github.com/duncs/clusterssh">cssh</a> for linux (<code class="highlighter-rouge">cssh</code> stands for
“Cluter SSH”).</p>
<p>I use <code class="highlighter-rouge">tmux</code> to handle my dask-scheduler and dask-workers. This allows me to</p>
<ul>
<li>logout and not kill the processes I want running in the background.</li>
<li>view the output of both dask-scheduler and dask-worker even after logging out</li>
<li>always have a scheduler and workers available when I log in</li>
</ul>
<p>This is enough to use dask.distributed on this cluster. Now, I’ll touch on how I use
it with Jupyter notebooks using port forwarding. This allows me to quickly
visualize the result on the cluster and provides a native editing environment.</p>
<h3 id="jupyter-notebooks--daskdistributed">Jupyter notebooks + dask.distributed</h3>
<p>I also use dask.distributed with the Jupyter notebook, which provides a nice
interface to view the results and edit the code. This means I don’t have to
<code class="highlighter-rouge">rsync</code> my results to my local machine to visualize the results. Additionally,
I feel like I’m editing on my local machine while editing code on this remote
cluster.</p>
<div class="language-shell highlighter-rouge"><pre class="highlight"><code><span class="gp">scott@local$ </span>ssh -L 8080:localhost:8888 -L 8081:localhost:8787 scott@cluster-1
<span class="gp">scott@cluster-1$ </span>jupyter notebook
<span class="c"># on local machine, localhost:8080 views notebook running on the cluster</span>
</code></pre>
</div>
<p>With the process above, I can <em>quickly</em> visualize results <em>directly</em> on the
server. Even better, I can fully utilize the cluster and use as many machines
as I wish.</p>
<p>With this, I can also view <a href="http://distributed.readthedocs.io/en/latest/web.html">dask.distributed’s web UI</a>. This allows me to see the
progress of the jobs on the cluster; I can check to see how far I’ve come and
how close I am to finishing.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Notebook</th>
<th style="text-align: center">Web UI</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center"><img src="/images/posts/2016-dask/notebook.png" alt="" /></td>
<td style="text-align: center"><img src="/images/posts/2016-dask/webui.png" alt="" /></td>
</tr>
</tbody>
</table>
<h2 id="actual-use--visualization">Actual use + visualization</h2>
<p>Often times, I am finding model performance for different input combinations.
During this I typically average the results by calling <code class="highlighter-rouge">test_model</code> many
times.</p>
<p>In the example below, I show a personal use case of dask.distributed. In this, I
include the method of visualization (which relies on pandas and seaborn).</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="kn">as</span> <span class="nn">sns</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">distributed</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="kn">from</span> <span class="nn">distributed.diagnostics</span> <span class="kn">import</span> <span class="n">progress</span>
<span class="k">def</span> <span class="nf">test_model</span><span class="p">(</span><span class="n">k</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="mi">42</span><span class="p">):</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span>
<span class="c"># ...</span>
<span class="k">return</span> <span class="p">{</span><span class="s">'sparsity'</span><span class="p">:</span> <span class="n">k</span><span class="p">,</span> <span class="s">'n_observations'</span><span class="p">:</span> <span class="n">n</span><span class="p">,</span> <span class="s">'success'</span><span class="p">:</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">error</span> <span class="o"><</span> <span class="mf">0.1</span> <span class="k">else</span> <span class="mi">0</span><span class="p">}</span>
<span class="n">client</span> <span class="o">=</span> <span class="n">Client</span><span class="p">(</span><span class="s">'127.61.142.160:8786'</span><span class="p">)</span>
<span class="n">data</span> <span class="o">=</span> <span class="p">[</span><span class="n">client</span><span class="o">.</span><span class="n">submit</span><span class="p">(</span><span class="n">test_model</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">n</span><span class="p">,</span> <span class="n">seed</span><span class="o">=</span><span class="n">repeat</span><span class="o">*</span><span class="n">n</span><span class="o">*</span><span class="n">k</span><span class="p">)</span>
<span class="k">for</span> <span class="n">n</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">60</span><span class="p">)</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="n">np</span><span class="o">.</span><span class="n">logspace</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mf">1.7</span><span class="p">,</span> <span class="n">num</span><span class="o">=</span><span class="mi">40</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">int</span><span class="p">)</span>
<span class="k">for</span> <span class="n">repeat</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">10</span><span class="p">)]</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">progress</span><span class="p">(</span><span class="n">data</span><span class="p">)</span> <span class="c"># allows notebook/console progress bar</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">client</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
<span class="n">show</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">pivot_table</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="s">'sparsity'</span><span class="p">,</span> <span class="n">columns</span><span class="o">=</span><span class="s">'n_observations'</span><span class="p">,</span> <span class="n">values</span><span class="o">=</span><span class="s">'success'</span><span class="p">)</span>
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">show</span><span class="p">)</span>
<span class="c"># small amount of matplotlib/seaborn code</span>
</code></pre>
</div>
<p><img src="/images/posts/2016-dask/heatmap-blog-p-recovery.png" alt="" /></p>
<p>This plot is $40\times 60$ with each job averaged over 10 trials. In total,
that makes for $40 \cdot 60 \cdot 10 = 24\cdot 10^3$ jobs. This plot was generated on
my local machine with 8 cores; at most, we can see a speedup of 8.</p>
<h3 id="further-speedups">Further speedups</h3>
<p>This approach only parallelizes different jobs, not tasks within that job. This
means that if a core finishes quickly and another job isn’t available, that
core sits empty and isn’t used.</p>
<p>For more details on this setup, see dask.distributed’s page on <em><a href="http://distributed.readthedocs.io/en/latest/related-work.html">Related Work</a></em>.
Using any of these frameworks should allow for further speedups. I would
recommend <a href="http://dask.pydata.org/en/latest/">dask</a> the most as it has <code class="highlighter-rouge">dask.array</code> and <code class="highlighter-rouge">dask.DataFrame</code>, parallel
implementations of NumPy’s <code class="highlighter-rouge">array</code> and Panda’s <code class="highlighter-rouge">DataFrame</code>.</p>
<p>Additionally, <a href="http://dask.pydata.org/en/latest/">dask</a> also has a <a href="http://dask.pydata.org/en/latest/delayed-overview.html#delayed-function">delayed</a> function decorator. This allows
running functions decorated with <code class="highlighter-rouge">@delayed</code> on all available cores of one
machine. Of course, make you need to optimize before decorating a function.</p>
<h3 id="notes">Notes</h3>
<ul>
<li>I couldn’t include nested map functions or use <a href="http://distributed.readthedocs.io/en/latest/joblib.html">dask.distributed’s joblib
frontend</a> inside a function submitted to dask.distributed as detailed in
<a href="https://github.com/dask/distributed/issues/465">dask.distributed#465</a> and <a href="https://github.com/joblib/joblib/issues/389">joblib#389</a>. Note that <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html"><code class="highlighter-rouge">pd.pivot_table</code></a> alleviates
many of these concerns as illustrated above.</li>
<li>The psuedo-random number generated the same random number. To get around
this and generate different seeds for every iteration, I passed <code class="highlighter-rouge">i_repeat *
some_model_param</code> as the seed.</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:optimization">
<p>Of course before you optimize, be sure you need to optimize. <a href="#fnref:optimization" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 09 Sep 2016 03:30:00 -0500
http://stsievert.com/blog/2016/09/09/dask-cluster/
http://stsievert.com/blog/2016/09/09/dask-cluster/pythonnumpyspeedparallelNumPy GPU acceleration<p>I recently had to compute many inner products with a given matrix $\Ab$ for
many different vectors $\xb_i$, or <script type="math/tex">\xb_i^T \Ab \xb_i</script>. Each vector $\xb_i$
represents a shoe from Zappos and there are 50k vectors $\xb_i \in \R^{1000}$.
This is computation took place behind a user-facing web interface and during
testing had a delay of 5 minutes. This is clearly unacceptable; how can we make
it faster?<sup id="fnref:other_libraries"><a href="#fn:other_libraries" class="footnote">1</a></sup></p>
<!--More-->
<p>I spent a couple hours trying to get the best possible performance from my
functions… and through this, I found a speed optimization<sup id="fnref:optimization"><a href="#fn:optimization" class="footnote">2</a></sup> that
put most of the computation on NumPy’s shoulders. After I made this change, the
naïve for-loop and NumPy were about a factor of 2 apart, not enough to write a
blog post about.</p>
<p>Use of a NVIDIA GPU significantly outperformed NumPy. Given that most of the
optimization seemed to be focused on a single matrix multiplication, let’s
focus on speed in matrix multiplication.</p>
<p>We know that matrix multiplication has <a href="https://en.wikipedia.org/wiki/Computational_complexity_theory">computational complexity</a> of something
like $O(n^{2.8074})$<sup id="fnref:strassan"><a href="#fn:strassan" class="footnote">3</a></sup>, but very likely greater than
$O(n^{2.375477})$<sup id="fnref:copper"><a href="#fn:copper" class="footnote">4</a></sup> when multiplying two $n\times n$ matrices. We can’t
get around this without diving into theory, but we can change the constant that
dictates exactly how fast these algorithms run.</p>
<p>The tools I’ll test are</p>
<ul>
<li>the default NumPy install, with no MKL (even though it’s now provided by
default with Anaconda)</li>
<li><a href="https://software.intel.com/en-us/intel-mkl/">Intel MKL</a>, a tool that provides acceleration for BLAS/LAPACK</li>
<li>the GPU. To do this, I’ll need an Amazon AWS machine and the NVIDIA CUDA
Toolkit. An easy interface is available through <a href="https://github.com/cudamat/cudamat">cudamat</a>
but <a href="https://github.com/lebedov/scikit-cuda">scikit-cuda</a> and <a href="https://docs.continuum.io/accelerate/index">Accelerate</a> also have nice interfaces and provide more access.</li>
</ul>
<p>I had planned to test other tools but these tests didn’t pan out for reasons in
the <a href="#footnotes">footnotes</a>. My test script can be summarized as follows:</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">import</span> <span class="nn">cudamat</span> <span class="kn">as</span> <span class="nn">cm</span>
<span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mf">2e3</span><span class="p">),</span> <span class="nb">int</span><span class="p">(</span><span class="mf">40e3</span><span class="p">)</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>
<span class="n">B</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span>
<span class="o">%</span><span class="n">timeit</span> <span class="n">A</span> <span class="err">@</span> <span class="n">B</span>
<span class="n">cm</span><span class="o">.</span><span class="n">cublas_init</span><span class="p">()</span>
<span class="n">cm</span><span class="o">.</span><span class="n">CUDAMatrix</span><span class="o">.</span><span class="n">init_random</span><span class="p">()</span>
<span class="n">A_cm</span> <span class="o">=</span> <span class="n">cm</span><span class="o">.</span><span class="n">empty</span><span class="p">((</span><span class="n">n</span><span class="p">,</span> <span class="n">p</span><span class="p">))</span><span class="o">.</span><span class="n">fill_with_randn</span><span class="p">()</span>
<span class="n">B_cm</span> <span class="o">=</span> <span class="n">cm</span><span class="o">.</span><span class="n">empty</span><span class="p">((</span><span class="n">p</span><span class="p">,</span> <span class="n">n</span><span class="p">))</span><span class="o">.</span><span class="n">fill_with_randn</span><span class="p">()</span>
<span class="o">%</span><span class="n">timeit</span> <span class="n">A_cm</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">B_cm</span><span class="p">)</span>
<span class="n">cm</span><span class="o">.</span><span class="n">cublas_shutdown</span><span class="p">()</span>
</code></pre>
</div>
<p>When doing this, I generate the following graph:</p>
<p><img src="/images/posts/2016-numpy-speed/matmul-timings.png" alt="" /></p>
<table>
<thead>
<tr>
<th style="text-align: center">Environment</th>
<th style="text-align: center">NumPy + no MKL</th>
<th style="text-align: center">NumPy + MKL</th>
<th style="text-align: center">cudamat</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Time (seconds)</td>
<td style="text-align: center">7.18</td>
<td style="text-align: center">4.057</td>
<td style="text-align: center">0.2898</td>
</tr>
</tbody>
</table>
<p><br /> Under the default Anaconda environment (i.e., with MKL), we see that our
script runs <strong>80%</strong> slower without MKL and has a <strong>14x</strong> speedup under cudamat!</p>
<p>This simple test shows that using the GPU is powerful, but cudamat is limited
in that it only provides basic mathematical capability for the GPU (the dot
product is as far as it goes).</p>
<p>However, the libraries <a href="https://docs.continuum.io/accelerate/index">Accelerate</a> and <a href="https://github.com/lebedov/scikit-cuda">scikit-cuda</a> use the GPU and both
provide more complex mathematical functions, including <code class="highlighter-rouge">fft</code>, <code class="highlighter-rouge">svd</code> and <code class="highlighter-rouge">eig</code>.</p>
<p>Accelerate and scikit-learn are both fairly similar. In choosing whether to use
Accelerate or scikit-learn, there are two obvious tradeoffs:</p>
<ul>
<li>scikit-cuda has access to linear algebra functions (e.g., <code class="highlighter-rouge">eig</code>) and Anaconda
does not. However, access to these higher level mathematical functions
comes through <a href="http://www.culatools.com">CULA</a>, another framework that requires a license (free
academic licenses are available).</li>
<li>Anaconda can accept raw <code class="highlighter-rouge">ndarray</code>s and scikit-cuda needs to have <code class="highlighter-rouge">gpuarray</code>s
passed in (meaning more setup/cleanup).</li>
</ul>
<p><em>edit: Other GPU libraries <a href="http://numba.pydata.org/numba-doc/dev/cuda/index.html"><code class="highlighter-rouge">numba.cuda.jit</code></a>, <a href="http://numba.pydata.org/numba-doc/dev/hsa/index.html"><code class="highlighter-rouge">numba.hsa.jit</code></a> also exist.</em></p>
<p>Whichever is chosen, large speed enhancements exist. I have timed a common
function (<code class="highlighter-rouge">fft</code>) over different values of <code class="highlighter-rouge">n</code>; there is some overhead to moving
to the GPU and I wanted to see where that is. I provide a summary of my testing
script in the <a href="#appendix">appendix</a>.</p>
<p><img src="/images/posts/2016-numpy-speed/fft-gpu.png" alt="" /></p>
<p>CULA has benchmarks for a few higher-level mathematical functions
(source: the <a href="http://www.culatools.com/dense/">CULA Dense homepage</a>):</p>
<p><img src="/images/posts/2016-numpy-speed/dense-summary-bench.png" alt="" /></p>
<h2 id="appendix">Appendix</h2>
<h3 id="other-untested-gpu-libraries">Other untested GPU libraries</h3>
<ul>
<li>PyCUDA and PyOpenCL are not tested because they
require C++ code (<a href="https://andreask.cs.illinois.edu/PyCuda/Examples/MatrixmulSimple">PyCUDA example</a>, <a href="https://github.com/stefanv/PyOpenCL/blob/master/examples/matrix-multiply.py">PyOpenCL example</a>).</li>
<li>gnumpy not tested becuase it didn’t support Python 3 and hasn’t been
touched in 4 years</li>
<li>I tried to install <a href="https://github.com/andersbll/cudarray">cudarray</a> but ran into install difficulties</li>
<li><a href="http://www.deeplearning.net/software/theano/">theano</a> supports the GPU (see “<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">Using the GPU</a>”) but not tested – this
seems to be primarily a machine learning library</li>
</ul>
<p>…and of course I didn’t optimize any loop-based functions. To do optimize
loop speed, I would look at <a href="http://numba.pydata.org">numba</a> first and then possibly <a href="http://cython.org">Cython</a>.</p>
<h3 id="fft-timing-script-summary">FFT timing script summary</h3>
<p>In this script, I show preparing for the FFT and preparing for linear algebra
functions (e.g., <code class="highlighter-rouge">cilinalg.init()</code>). I found that it’s useful to look at the
<a href="https://github.com/lebedov/scikit-cuda/tree/master/demos">scikit-cuda demos</a>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>
<span class="kn">from</span> <span class="nn">accelerate.cuda.blas</span> <span class="kn">import</span> <span class="n">Blas</span>
<span class="kn">import</span> <span class="nn">accelerate.cuda.fft</span> <span class="kn">as</span> <span class="nn">acc_fft</span>
<span class="kn">import</span> <span class="nn">pycuda.autoinit</span>
<span class="kn">import</span> <span class="nn">pycuda.gpuarray</span> <span class="kn">as</span> <span class="nn">gpuarray</span>
<span class="kn">import</span> <span class="nn">skcuda.fft</span> <span class="kn">as</span> <span class="nn">cu_fft</span>
<span class="kn">import</span> <span class="nn">skcuda.linalg</span> <span class="kn">as</span> <span class="nn">culinalg</span>
<span class="kn">import</span> <span class="nn">skcuda.misc</span> <span class="kn">as</span> <span class="nn">cumisc</span>
<span class="c"># for scikit-learn</span>
<span class="n">culinalg</span><span class="o">.</span><span class="n">init</span><span class="p">()</span>
<span class="c"># for accelerate when calling wrapped BLAS functions (e.g., blas.dot)</span>
<span class="n">blas</span> <span class="o">=</span> <span class="n">Blas</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">fft_accelerate</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">acc_fft</span><span class="o">.</span><span class="n">FFTPlan</span><span class="p">(</span><span class="n">shape</span><span class="o">=</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">itype</span><span class="o">=</span><span class="n">x</span><span class="o">.</span><span class="n">dtype</span><span class="p">,</span> <span class="n">otype</span><span class="o">=</span><span class="n">y</span><span class="o">.</span><span class="n">dtype</span><span class="p">)</span>
<span class="n">f</span><span class="o">.</span><span class="n">forward</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">out</span><span class="o">=</span><span class="n">y</span><span class="p">)</span> <span class="c"># note: we're passing np.ndarrays</span>
<span class="k">return</span> <span class="n">y</span>
<span class="k">def</span> <span class="nf">fft_scikit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">plan_forward</span> <span class="o">=</span> <span class="n">cu_fft</span><span class="o">.</span><span class="n">Plan</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">complex64</span><span class="p">)</span>
<span class="n">cu_fft</span><span class="o">.</span><span class="n">fft</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">plan_forward</span><span class="p">)</span>
<span class="k">return</span> <span class="n">y</span><span class="o">.</span><span class="n">get</span><span class="p">()</span>
<span class="n">n</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="mf">40e4</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'float32'</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="s">'complex64'</span><span class="p">)</span> <span class="c"># needed because fft has complex output</span>
<span class="o">%</span><span class="n">timeit</span> <span class="n">fft_accelerate</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">gpuarray</span><span class="o">.</span><span class="n">to_gpu</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">gpuarray</span><span class="o">.</span><span class="n">empty</span><span class="p">(</span><span class="n">n</span><span class="o">//</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">complex64</span><span class="p">)</span>
<span class="o">%</span><span class="n">timeit</span> <span class="n">fft_scikit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre>
</div>
<div class="footnotes">
<ol>
<li id="fn:other_libraries">
<p>Note: I link to the libraries I discovered in “<a href="#other-gpu-libraries">Other GPU libraries</a>” and show some speed results (both generated and from other sources) <a href="#fnref:other_libraries" class="reversefootnote">↩</a></p>
</li>
<li id="fn:optimization">
<p>which was calculating $\Ab \Xb^T$ outside the loop <a href="#fnref:optimization" class="reversefootnote">↩</a></p>
</li>
<li id="fn:strassan">
<p>using the <a href="https://en.wikipedia.org/wiki/Strassen_algorithm">Strassen algorithm</a> <a href="#fnref:strassan" class="reversefootnote">↩</a></p>
</li>
<li id="fn:copper">
<p>using the <a href="https://en.wikipedia.org/wiki/Coppersmith–Winograd_algorithm">Coppersmith-Winograd algorithm</a> <a href="#fnref:copper" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 01 Jul 2016 03:30:00 -0500
http://stsievert.com/blog/2016/07/01/numpy-gpu/
http://stsievert.com/blog/2016/07/01/numpy-gpu/pythonnumpygpuspeedparallelProbability of a powder day<p>This last spring break, I had a <em>ton</em> of fun! Why?</p>
<p><img src="/assets/2016-powder/faceshot.jpg" alt="" /></p>
<p>I had the good fortune of catching a powder day with powder skis this spring
break! While riding the Born Free chair at Vail, I wondered what the chances of
this happening in a given trip<sup id="fnref:github"><a href="#fn:github" class="footnote">1</a></sup>?</p>
<!--More-->
<p>A naïve approach is to simply count up the number of powder days in a season
and divide by the number of weeks in a season (assuming a week-long trip)…
but that doesn’t capture any of the time dependence. In the depths of winter,
it is more likely for big storms. When I typically go, during spring break I
believe it’s less likely for a powder day to occur.</p>
<p>To estimate the probability of a probability occurring during our trip, let’s
plot the number of powder days of the last 31 years using the data from <a href="http://www.ncdc.noaa.gov/cdo-web/datasets/ANNUAL/stations/COOP:058575/detail">NOAA weather
station 058575</a> in Eagle County, CO, the location of my ski trips.</p>
<p><img src="/assets/2016-powder/vail/avg-snow-fall.png" alt="" /></p>
<p>How does this information help us determine the probability of a powder day? At
first, this seems like a challenging problem. If you live in Colorado, you’ll
certainly see a powder day, a clue that the chance of a powder day is not just
adding up the probabilities on each day.</p>
<p>Fortunately, probabilists have spent time developing frameworks for exactly
this type of problem. The method that naturally lends itself to waiting for a
powder day is a <a href="https://en.wikipedia.org/wiki/Poisson_point_process">Poisson process</a>. Quoting Wikipedia,</p>
<blockquote>
<p>[a Poisson process can model] customers arriving and being served or phone
calls arriving at a phone exchange.</p>
</blockquote>
<p>This seems perfect framework for powder days! But first, how much snow do we
need before we declare a powder day? Speaking from experience, probably 1.5
inches or about 38mm. Because the weather station is in the valley, this is
reasonable. If the bottom of the valley sees 1.5 inches, the peak might see 4-5
inches.</p>
<p>Luckily, this threshold doesn’t matter for our end use case: in practice, I’m
more interested in <em>when</em> to book my trip. This means I want the highest
probability of a powder day, which tends to remain fairly constant with
different thresholds.</p>
<p>The weather station reports the amount of snow each day. Using this definition
of a powder day, we can graph the number of powder days that occur on (for
example) January 4th over 31 years.</p>
<p><img src="/assets/2016-powder/vail/num-pow-days.png" alt="" /></p>
<p>We have when powder days occur, and Poisson processes have some nice properties
that align with powder days. Most importantly, the value that characterizes
Poisson processes, $\lambda$, is characterized by the number of events seen in
a time interval. We can define $N(a, b)$ to be the number of events we see
between days $a$ and $b$, and Poisson processes have the property that</p>
<script type="math/tex; mode=display">\E{N(a, b]} = \lambda (b - a)</script>
<p>or the expected value in an interval is just some number times the length of
time. This means that we can easily compute $\lambda$ in a time range: it’s
just the number of snow storms that we observe. This parameter corresponds to
the frequency of events. Given a high $\lambda$, the chances of a powder day
are much higher.</p>
<p>In this sense, waiting for a powder day is definitely <em>not</em> Poisson process.
The probability of a powder day changes over time. During the summer, a powder
day definitely won’t happen. We can get around this because weather is a
<em>slow</em> changing system. We can model each week as a stationary homogeneous
Poisson process that follows the equation above.</p>
<p>When estimating $\lambda$, we have to choose an amount of time that the system
can be modeled as a homogeneous Poisson process, or how long does the weather
stay the same? We’ll decide on two weeks/14 days for this.</p>
<p><img src="/assets/2016-powder/vail/lambda.png" alt="" /></p>
<p>To generate this graph, I did use some physical intuition. I said that the
probability of a power day couldn’t change too quickly over time, and smoothed
out this curve over time. This corresponds to taking some [Bayesian prior] on
the likelihood of a powder day.</p>
<p>This plot just tells us when it snows, and it doesn’t say anything about how
much it snows. We know that in January, Colorado receives more snow (as
indicated by the first plot), but here $\lambda$ is lower.</p>
<p>But now we have the value of $\lambda$. Let’s see how probable powder days are
throughout the year! This will definitely depend on how long our trip to
Colorado is – if we lived out there, we’d definitely see a powder day.</p>
<p>To calculate the probability that at least 1 powder day will occur on a 5 day
trip and given the value of $\lambda$ in the graph above, we compute</p>
<script type="math/tex; mode=display">% <![CDATA[
\Align{
\prob{N(a, b] > 1} &= 1 - \prob{N(a, b] = 0}\\
&=1 - \Exp{\lambda (b - a)}
} %]]></script>
<p>which is directly related to the <a href="https://en.wikipedia.org/wiki/Poisson_distribution">Poisson random variable</a> probability density
function (which means that it’s easy to implement and found in <code class="highlighter-rouge">scipy.stats</code>).</p>
<p><img src="/assets/2016-powder/vail/prob-of-pow-on-5-day-trip.png" alt="" /></p>
<hr />
<p>This is of great practical important! I’m planning on taking a weekend trip to
<a href="http://www.mtbohemia.com">Mt. Bohemia</a> in Michigan to tap that <a href="https://en.wikipedia.org/wiki/Lake-effect_snow">lake effect snow</a>. Using <a href="http://www.ncdc.noaa.gov/cdo-web/datasets/GHCND/stations/GHCND:USC00201780/detail">NOAA weather
station 201789</a>, we can find out when to book a trip.</p>
<p><img src="/assets/2016-powder/michigan/prob-of-pow-on-2-day-trip.png" alt="" /></p>
<p>It looks like I’ll schedule my trip to be in early January or mid-February!</p>
<div class="footnotes">
<ol>
<li id="fn:github">
<p>The source for this post is available on GitHub at <a href="https://github.com/stsievert/powder-day-probability">stsievert/powder-day-probability</a> <a href="#fnref:github" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 22 May 2016 03:30:00 -0500
http://stsievert.com/blog/2016/05/22/powder-days/
http://stsievert.com/blog/2016/05/22/powder-days/skiingpowderprobabilityA Bayesian analysis of Clintons 6 heads<p>Clinton recently won 6 coin flips during an Iowa caucus. On facebook and in the
news, I’ve only seen information about how unlikely this is – the chances
of 6 heads are 1.56% with a fair coin.</p>
<p>Yes, 6 heads is unlikely but these coin flips could have occurred by chance. I
mean, on the <a href="https://www.washingtonpost.com/news/the-fix/wp/2016/02/02/heres-just-how-unlikely-hillary-clintons-6-for-6-coin-toss-victories-were/">Washington Post coin flip demo</a>, I got all heads on my 5th try.
Instead, it makes more sense a different question: given we observed these 6
heads, what are the chances this coin wasn’t fair?<sup id="fnref:hypothesis"><a href="#fn:hypothesis" class="footnote">1</a></sup></p>
<div class="footnotes">
<ol>
<li id="fn:hypothesis">
<p>If we were <em>really</em> testing to see if the coin was unfair, it’d make more sense to do <a href="https://en.wikipedia.org/wiki/Statistical_hypothesis_testing">hypothesis testing</a> <a href="#fnref:hypothesis" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Please visit my blog to see the full post -- it includes an interactive widget! (the reason I can't include the full post here)
Fri, 26 Feb 2016 06:30:00 -0600
http://stsievert.com/blog/2016/02/26/coin-flips/
http://stsievert.com/blog/2016/02/26/coin-flips/probabilitystatisticsbayesianGradient descent and physical intuition for heavy-ball acceleration with visualization<p><em>This post is a part 3 of a 3 part series: <a href="/blog/2015/11/19/inverse-part-1/">Part I</a>, <a href="/blog/2015/12/09/inverse-part-2/">Part II</a>, <a href="/blog/2016/01/30/inverse-3/">Part
III</a>.</em></p>
<p>We often make observations from some system and would like to infer something
about the system parameters, and many practical problems such as the <a href="https://en.wikipedia.org/wiki/Netflix_Prize">Netflix
Prize</a> can be reformulated this way. Typically, this involves making
observations of the form
$y = f(x)$ or $\yb = \Ab \cdot \xb$<sup id="fnref:scalar"><a href="#fn:scalar" class="footnote">1</a></sup> where $y$ is observed, $f/\Ab$ is
known and $x$ is the unknown variable of interest.</p>
<p>Finding the true $x$ that gave us our observations $y$ involves inverting a
function/matrix which can be costly time-wise and in the matrix case often
impossible. Instead, methods such as <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a> are often involved, a
technique common in machine learning and optimization.</p>
<p>In this post, I will try to provide calculus-level intuition for gradient
descent. I will also introduce and show the heavy-ball acceleration method for
gradient descent<sup id="fnref:video"><a href="#fn:video" class="footnote">2</a></sup> and provide a physical interpretation.</p>
<!--More-->
<p>In doing this, I will interchangeably use the words <a href="https://en.wikipedia.org/wiki/Derivative">derivative</a> (aka
$\deriv$) and <a href="https://en.wikipedia.org/wiki/Gradient">gradient</a> (aka $\grad$). The gradient is just the
<a href="https://en.wikipedia.org/wiki/Dimension_(vector_space)">high-dimensional</a> version of the derivative; all the intuition for the
derivative applies to the gradient.</p>
<h2 id="intuition-for-gradient-descent">Intuition for <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a></h2>
<p>In the problem setup it’s assumed we know the derivative/gradient of this
function, a fair assumption. The derivative yields information about (a) the
direction the function increases and (b) how fast it increases.</p>
<p>For example, we might be given a function, say $f(x) = x^2$. At the point
$x = 3$, the derivative $\deriv f(x)\at{x=3} = 2\cdot 3$ tells us the function
increases in the positive $x$ direction at a rate of 6.</p>
<p>The important piece of this is the <em>direction</em> the function increases in.
Because the derivate points in the direction the function increases in, the
negative gradient points in the direction the function is <em>decreases</em> in.
Function minimization is common in optimization, meaning that the negative
gradient direction is typically used. If we’re at $x_k$ and we take a step in
the negative gradient direction of $f$ our function will get smaller, or</p>
<script type="math/tex; mode=display">x_{k+1} = x_k - \tau \grad{f}(x_k)</script>
<p>where $\tau$ is some step-size. This will converge to a minima, possibly local.
We’re always stepping in the direction of the negative gradient; in every step,
we know that the function value gets smaller.</p>
<p>To implement this, we would use the code below. Note that to implement this
piece of code in higher dimensions, we would only define <code class="highlighter-rouge">x_hat = np.zeros(N)</code>
and make small changes to our function <code class="highlighter-rouge">grad</code>.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">x_hat</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">tau</span> <span class="o">=</span> <span class="mf">0.02</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">40</span><span class="p">):</span>
<span class="c"># grad is the gradient for some function, not shown</span>
<span class="n">x_hat</span> <span class="o">=</span> <span class="n">x_hat</span> <span class="o">-</span> <span class="n">tau</span><span class="o">*</span><span class="n">grad</span><span class="p">(</span><span class="n">x_grad</span><span class="p">,</span> <span class="n">tau</span><span class="p">)</span>
</code></pre>
</div>
<p>This makes no guarantees that the minima is global; as soon as the gradient is
0, this process stops. It can’t get past any bumps because it reads the
derivative at a single point. In it’s simplest form, gradient descent makes
sense. We know the function gets smaller in a certain direction; we should step
in that direction.</p>
<h2 id="heavy-ball-acceleration">Heavy-ball acceleration</h2>
<p>Our cost function may include many smaller bumps, as in the picture below.
Gradient descent will fail here because the gradient goes to 0 before the
global minima is reached.</p>
<p><img src="/assets/2016-inverse-3/objective_function_stay.png" alt="" width="350px" class="right" /></p>
<p>Gradient descent seems fragile in this sense. If it runs into any spots where
the gradient is 0, it will stop. It can run into a local minima and can’t get
past it.</p>
<p>One way of getting around that is by using some momentum. Instead of focusing
on the gradient at one point, it would be advantageous to include momentum to
overcome temporary setbacks. This is analogous to thinking of as a heavy ball
is rolling hill and bumps only moderately effect it. The differential equation
that governs this motion in a gravity well described by $f$ is</p>
<script type="math/tex; mode=display">\ddot{x} + a \dot{x} + b \grad{f(x)} = 0</script>
<p>for positive constants $a > 0$ and $b > 0$ but isn’t that useful for computers. The discretization of this equation is given by</p>
<script type="math/tex; mode=display">(x_{k+1}-x_{k-1}) + a(x_{k+1} - x_k) + b\grad{f}(x_k) = 0\\</script>
<p>After some algebraic manipulations (shown in <a href="/blog/2016/01/19/inverse-3/#appendix">the appendix</a>) and defining $\tau := \frac{b}{1 + a}$ and $\beta := \frac{1}{1 + a}$, we can find</p>
<script type="math/tex; mode=display">x_{k+1} = x_k - \tau \grad{f}(x_k) + \beta \cdot(x_{k-1} - x_{k})</script>
<p>When crafting this as a ball rolling down a hill, $a$ is friction and $b$ is
the strength of gravity. We would never expect friction to accelerate objects;
if it did, the ball would never settle and would climb out of any bowl.
Correspondingly, when $a < 0$ aka when $\beta > 1$ this algorithm diverges!</p>
<p>Physical intuition has been provided, but does this hold in simulation? While
simulating, we should compare with the gradient descent method. The code below
implements this ball-accelerated method and is used to produce the video below
the code.</p>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ball_acceleration</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">c</span><span class="p">):</span>
<span class="s">"""
:input x: iterates. x[0] is most recent iterate, x[1] is last iterate
:input c: Array of constants. c[0] is step size, c[1] is momentum constant
:returns: Array of new iterates.
"""</span>
<span class="c"># grad is the gradient for some function, not shown</span>
<span class="n">update</span> <span class="o">=</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">-</span> <span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">grad</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">+</span> <span class="n">c</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span>
<span class="k">return</span> <span class="p">[</span><span class="n">update</span><span class="p">,</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]]</span>
<span class="n">x_hat</span> <span class="o">=</span> <span class="p">[</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">]</span>
<span class="n">tau</span><span class="p">,</span> <span class="n">weight</span> <span class="o">=</span> <span class="mf">0.02</span><span class="p">,</span> <span class="mf">0.8</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">40</span><span class="p">):</span>
<span class="n">x_hat</span> <span class="o">=</span> <span class="n">ball_acceleration</span><span class="p">(</span><span class="n">x_hat</span><span class="p">,</span> <span class="p">[</span><span class="n">tau</span><span class="p">,</span> <span class="n">weight</span><span class="p">])</span>
</code></pre>
</div>
<p><a name="video"></a></p>
<div class="row text-center">
<video width="350" height="320" controls="">
<source src="/assets/2016-inverse-3/grad_descent.mp4" type="video/mp4" />
</video>
</div>
<p>We can see that this heavy-ball method acts like a ball rolling down a hill
with friction. It nearly stops and falls back down into the local minima. It
settles at the bottom, near the global minima.</p>
<p>This heavy-ball method is not guaranteed to converge to the global minima, even
though it does in this example. Typically, the heavy-ball method is used to get
in the same region as the global minima, and then a normal optimization method
is used that is guaranteed to converge to the global minima.</p>
<h2 id="higher-dimensions">Higher dimensions</h2>
<p>To show this in higher dimensions, I quickly coded up the same algorithm above
but in higher dimensions (as shown in <a href="/blog/2016/01/19/inverse-3/#appendix">the appendix</a>). Essentially, only made
small changes to the function <code class="highlighter-rouge">grad</code> were made. While doing this, I used
$\yb = \Ab \xb^\star$ to generate ground truth and plotted the distance from
$\xb^\star$. I made $\Ab \in \R^{50 \times 35}$ a tall matrix to represent
an overdetermined system and to find the <a href="https://en.wikipedia.org/wiki/Moore–Penrose_pseudoinverse">pseudoinverse</a>. This is a problem
formulation that where gradient descent methods are used.</p>
<p>I then graphed the convergence rate. In this, I knew ground truth and plotted
and plotted the $\ell_2$ distance to the true solution for both classic
gradient descent and this accelerated ball method.</p>
<p><img src="/assets/2016-inverse-3/high_dim.png" alt="" width="500px" class="center" /></p>
<p>This only shows that this ball-acceleration method is faster for linear systems
– it doesn’t have any <a href="https://en.wikipedia.org/wiki/Saddle_point">saddle points</a> like a non-convex function!</p>
<h3 id="further-reading">Further reading</h3>
<ul>
<li>An <a href="http://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/">introduction to gradient descent</a>, an explanation of gradient descent
with code examples.</li>
<li><a href="http://blog.mrtz.org/2013/09/07/the-zen-of-gradient-descent.html">A mathematical approach</a> to the heavy-ball acceleration method</li>
<li><a href="https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/">Another mathematical approach</a> on Nesterov acceleration.</li>
<li>The lecture notes for lectures <a href="http://pages.cs.wisc.edu/~brecht/cs726docs/HeavyBall.pdf">2010-10-10</a> and <a href="http://pages.cs.wisc.edu/~brecht/cs726docs/HeavyBallLinear.pdf">2012-10-15</a> in <a href="http://www.eecs.berkeley.edu/~brecht/">Ben Recht</a>’s
class <a href="http://pages.cs.wisc.edu/~brecht/cs726.html">CS 726: Nonlinear optimization I</a>.</li>
<li><a href="http://www.sciencedirect.com/science/article/pii/0041555364901375">The academic paper</a> that introduced the heavy-ball method, published by the
Russian mathematician Polyak in 1964.</li>
</ul>
<h2 id="appendix">Appendix</h2>
<h3 id="high-dimensional-gradient-descent-code">High dimensional gradient descent code</h3>
<div class="language-python highlighter-rouge"><pre class="highlight"><code><span class="n">M</span><span class="p">,</span> <span class="n">N</span> <span class="o">=</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">35</span>
<span class="n">A</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">M</span><span class="p">,</span> <span class="n">N</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">A</span> <span class="err">@</span> <span class="n">x</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">42</span><span class="p">)</span>
<span class="n">initial_guess</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
<span class="n">x_ball</span> <span class="o">=</span> <span class="p">[</span><span class="n">initial_guess</span><span class="o">.</span><span class="n">copy</span><span class="p">(),</span> <span class="n">initial_guess</span><span class="o">.</span><span class="n">copy</span><span class="p">()]</span>
<span class="n">x_grad</span> <span class="o">=</span> <span class="n">initial_guess</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span>
<span class="n">c</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">]</span>
<span class="k">for</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">int</span><span class="p">(</span><span class="mi">500</span><span class="p">)):</span>
<span class="n">x_grad</span> <span class="o">=</span> <span class="n">x_grad</span> <span class="o">-</span> <span class="n">c</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">grad</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">x_grad</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">x_ball</span> <span class="o">=</span> <span class="n">ball_acceleration</span><span class="p">(</span><span class="n">A</span><span class="p">,</span> <span class="n">x_ball</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">c</span><span class="p">)</span>
</code></pre>
</div>
<h3 id="algebraic-manipulations">Algebraic manipulations</h3>
<p>The discretization of $\ddot{x} + a\dot{x} + \grad{f}(x) = 0$ is given by</p>
<script type="math/tex; mode=display">\align{
(x_{k+1} - x_{k-1}) + a (x_{k+1} - x_k) + \grad{f}(x) = 0\\
}</script>
<p>Simplifying, we see that (if and only if’s left out for this algebraic
manipulation)</p>
<script type="math/tex; mode=display">% <![CDATA[
\align{
(1 + a) x_{k+1} &= ax_k + x_{k-1} - b \grad{f}(x_k)\\
(1 + a) x_{k+1} &= (a + 1)x_k + x_{k-1} - x_k - b \grad{f}(x_k)\\
x_{k+1} &= x_k + \frac{1}{1+a} (x_{k-1} - x_k) - \frac{b}{1+a} \grad{f}(x_k)\\
} %]]></script>
<!--Introduction to gradient descent-->
<!-- Zen of Gradient Descent -->
<div class="footnotes">
<ol>
<li id="fn:scalar">
<p>I’ll use plain font/bold font for scalars/vectors (respectively) as per my <a href="/math/notation">notation sheet</a>. <a href="#fnref:scalar" class="reversefootnote">↩</a></p>
</li>
<li id="fn:video">
<p>With <a href="/blog/2016/01/30/inverse-3/#video">a video</a>! <a href="#fnref:video" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sat, 30 Jan 2016 06:30:00 -0600
http://stsievert.com/blog/2016/01/30/inverse-3/
http://stsievert.com/blog/2016/01/30/inverse-3/machine-learningoptimizationlinear-algebra