Dask’s machine learning package, Dask-ML now implements Hyperband, an\nadvanced “hyperparameter optimization” algorithm that performs rather well.\nThis post will

\n\n- \n
- describe “hyperparameter optimization”, a common problem in machine learning \n
- describe Hyperband’s benefits and why it works \n
- show how to use Hyperband via example alongside performance comparisons \n

In this post, I’ll walk through a practical example and highlight key portions\nof the paper “Better and faster hyperparameter optimization with Dask”, which is also\nsummarized in a ~25 minute SciPy 2019 talk.

\n\n", "content_html": "Dask’s machine learning package, Dask-ML now implements Hyperband, an\nadvanced “hyperparameter optimization” algorithm that performs rather well.\nThis post will

\n\n- \n
- describe “hyperparameter optimization”, a common problem in machine learning \n
- describe Hyperband’s benefits and why it works \n
- show how to use Hyperband via example alongside performance comparisons \n

In this post, I’ll walk through a practical example and highlight key portions\nof the paper “Better and faster hyperparameter optimization with Dask”, which is also\nsummarized in a ~25 minute SciPy 2019 talk.

\n\n\n\nMachine learning requires data, an untrained model and “hyperparameters”, parameters that are chosen before training begins that\nhelp with cohesion between the model and data. The user needs to specify values\nfor these hyperparameters in order to use the model. A good example is\nadapting ridge regression or LASSO to the amount of noise in the\ndata with the regularization parameter.^{1}

Model performance strongly depends on the hyperparameters provided. A fairly complex example is with a particular\nvisualization tool, t-SNE. This tool requires (at least) three\nhyperparameters and performance depends radically on the hyperparameters. In fact, the first section in “How to Use t-SNE\nEffectively” is titled “Those hyperparameters really matter”.

\n\nFinding good values for these hyperparameters is critical and has an entire\nScikit-learn documentation page, “Tuning the hyperparameters of an\nestimator.” Briefly, finding decent values of hyperparameters\nis difficult and requires guessing or searching.

\n\n**How can these hyperparameters be found quickly and efficiently with an\nadvanced task scheduler like Dask?** Parallelism will pose some challenges, but\nthe Dask architecture enables some advanced algorithms.

*Note: this post presumes knowledge of Dask basics. This material is covered in\nDask’s documentation on Why Dask?, a ~15 minute video introduction to\nDask, a video introduction to Dask-ML and a\nblog post I wrote on my first use of Dask.*

Dask-ML can quickly find high-performing hyperparameters. I will back this\nclaim with intuition and experimental evidence.

\n\nSpecifically, this is because\nDask-ML now\nimplements an algorithm introduced by Li et. al. in “Hyperband: A novel\nbandit-based approach to hyperparameter optimization”.\nPairing of Dask and Hyperband enables some exciting new performance opportunities,\nespecially because Hyperband has a simple implementation and Dask is an\nadvanced task scheduler.^{2}

Let’s go\nthrough the basics of Hyperband then illustrate its use and performance with\nan example.\nThis will highlight some key points of the corresponding paper.

\n\nThe motivation for Hyperband is to find high performing hyperparameters with minimal\ntraining. Given this goal, it makes sense to spend more time training high\nperforming models – why waste more time training time a model if it’s done poorly in the past?

\n\nOne method to spend more time on high performing models is to initialize many\nmodels, start training all of them, and then stop training low performing models\nbefore training is finished. That’s what Hyperband does. At the most basic\nlevel, Hyperband is a (principled) early-stopping scheme for\nRandomizedSearchCV.

\n\nDeciding when to stop the training of models depends on how strongly\nthe training data effects the score. There are two extremes:

\n\n- \n
- when only the training data matter\n
- \n
- i.e., when the hyperparameters don’t influence the score at all \n

\n - when only the hyperparameters matter\n
- \n
- i.e., when the training data don’t influence the score at all \n

\n

Hyperband balances these two extremes by sweeping over how frequently\nmodels are stopped. This sweep allows a mathematical proof that Hyperband\nwill find the best model possible with minimal `partial_fit`

\ncalls^{3}.

Hyperband has significant parallelism because it has two “embarrassingly\nparallel” for-loops – Dask can exploit this. Hyperband has been implemented\nin Dask, specifically in Dask’s machine library Dask-ML.

\n\nHow well does it perform? Let’s illustrate via example. Some setup is required\nbefore the performance comparison in *Performance*.

*Note: want to try HyperbandSearchCV out yourself? Dask has an example use.\nIt can even be run in-browser!*

I’ll illustrate with a synthetic example. Let’s build a dataset with 4 classes:

\n\n`>>> from experiment import make_circles\n>>> X, y = make_circles(n_classes=4, n_features=6, n_informative=2)\n>>> scatter(X[:, :2], color=y)\n`

*Note: this content is pulled from\nstsievert/dask-hyperband-comparison, or makes slight modifications.*

Let’s build a fully connected neural net with 24 neurons for classification:

\n\n`>>> from sklearn.neural_network import MLPClassifier\n>>> model = MLPClassifier()\n`

Building the neural net with PyTorch is also possible^{4} (and what I used in development).

This neural net’s behavior is dictated by 6 hyperparameters. Only one controls\nthe model of the optimal architecture (`hidden_layer_sizes`

, the number of\nneurons in each layer). The rest control finding the best model of that\narchitecture. Details on the hyperparameters are in the\n*Appendix*.

`>>> params = ... # details in appendix\n>>> params.keys()\ndict_keys(['hidden_layer_sizes', 'alpha', 'batch_size', 'learning_rate'\n 'learning_rate_init', 'power_t', 'momentum'])\n>>> params[\"hidden_layer_sizes\"] # always 24 neurons\n[(24, ), (12, 12), (6, 6, 6, 6), (4, 4, 4, 4, 4, 4), (12, 6, 3, 3)]\n`

I choose these hyperparameters to have a complex search space that mimics the\nsearches performed for most neural networks. These searches typically involve\nhyperparameters like “dropout”, “learning rate”, “momentum” and “weight\ndecay”.^{5}\nEnd users don’t care hyperparameters like these; they don’t change the\nmodel architecture, only finding the best model of a particular architecture.

How can high performing hyperparameter values be found quickly?

\n\nFirst, let’s look at the parameters required for Dask-ML’s implementation\nof Hyperband (which is in the class `HyperbandSearchCV`

).

`HyperbandSearchCV`

has two inputs:

- \n
`max_iter`

, which determines how many times to call`partial_fit`

\n - the chunk size of the Dask array, which determines how many data each\n
`partial_fit`

call receives. \n

These fall out pretty naturally once it’s known how long to train the best\nmodel and very approximately how many parameters to sample:

\n\n`n_examples = 50 * len(X_train) # 50 passes through dataset for best model\nn_params = 299 # sample about 300 parameters\n\n# inputs to hyperband\nmax_iter = n_params\nchunk_size = n_examples // n_params\n`

The inputs to this rule-of-thumb are exactly what the user cares about:

\n\n- \n
- a measure of how complex the search space is (via
`n_params`

) \n - how long to train the best model (via
`n_examples`

) \n

Notably, there’s no tradeoff between `n_examples`

and `n_params`

like with\nScikit-learn’s `RandomizedSearchCV`

because `n_examples`

is only for *some*\nmodels, not for *all* models. There’s more details on this\nrule-of-thumb in the “Notes” section of the `HyperbandSearchCV`

\ndocs.

With these inputs a `HyperbandSearchCV`

object can easily be created.

This model selection algorithm Hyperband is implemented in the class\n`HyperbandSearchCV`

. Let’s create an instance of that class:

`>>> from dask_ml.model_selection import HyperbandSearchCV\n>>>\n>>> search = HyperbandSearchCV(\n... est, params, max_iter=max_iter, aggressiveness=4\n... )\n`

`aggressiveness`

defaults to 3. `aggressiveness=4`

is chosen because this is an\n*initial* search; I know nothing about how this search space. Then, this search\nshould be more aggressive in culling off bad models.

Hyperband hides some details from the user (which enables the mathematical\nguarantees), specifically the details on the amount of training and\nthe number of models created. These details are available in the `metadata`

\nattribute:

`>>> search.metadata[\"n_models\"]\n378\n>>> search.metadata[\"partial_fit_calls\"]\n5721\n`

Now that we have some idea on how long the computation will take, let’s ask it\nto find the best set of hyperparameters:

\n\n`>>> from dask_ml.model_selection import train_test_split\n>>> X_train, y_train, X_test, y_test = train_test_split(X, y)\n>>>\n>>> X_train = X_train.rechunk(chunk_size)\n>>> y_train = y_train.rechunk(chunk_size)\n>>>\n>>> search.fit(X_train, y_train)\n`

The dashboard will be active during this time^{6}:

\n\n

\n\nHow well do these hyperparameters perform?

\n\n`>>> search.best_score_\n0.9019221418447483\n`

`HyperbandSearchCV`

mirrors Scikit-learn’s API for RandomizedSearchCV, so it\nhas access to all the expected attributes and methods:

`>>> search.best_params_\n{\"batch_size\": 64, \"hidden_layer_sizes\": [6, 6, 6, 6], ...}\n>>> search.score(X_test, y_test)\n0.8989070100111217\n>>> search.best_model_\nMLPClassifier(...)\n`

Details on the attributes and methods are in the HyperbandSearchCV\ndocumentation.

\n\nI ran this 200 times on my personal laptop with 4 cores.\nLet’s look at the distribution of final validation scores:

\n\n\n\nThe “passive” comparison is really `RandomizedSearchCV`

configured so it takes\nan equal amount of work as `HyperbandSearchCV`

. Let’s see how this does over\ntime:

This graph shows the mean score over the 200 runs with the solid line, and the\nshaded region represents the interquartile range. The dotted green\nline indicates the data required to train 4 models to completion.\n“Passes through the dataset” is a good proxy\nfor “time to solution” because there are only 4 workers.

\n\nThis graph shows that `HyperbandSearchCV`

will find parameters at least 3 times\nquicker than `RandomizedSearchCV`

.

What opportunities does combining Hyperband and Dask create?\n`HyperbandSearchCV`

has a lot of internal parallelism and Dask is an advanced task\nscheduler.

The most obvious opportunity involves job prioritization. Hyperband fits many\nmodels in parallel and Dask might not have that\nworkers available. This means some jobs have to wait for other jobs\nto finish. Of course, Dask can prioritize jobs^{7} and choose which models\nto fit first.

Let’s assign the priority for fitting a certain model to be the model’s most\nrecent score. How does this prioritization scheme influence the score? Let’s\ncompare the prioritization schemes in\na single run of the 200 above:

\n\n\n\nThese two lines are the same in every way except for\nthe prioritization scheme.\nThis graph compares the “high scores” prioritization scheme and the Dask’s\ndefault prioritization scheme (“fifo”).

\n\nThis graph is certainly helped by the fact that is run with only 4 workers.\nJob priority does not matter if every job can be run right away (there’s\nnothing to assign priority too!).

\n\nHow does Hyperband scale with the number of workers?

\n\nI ran another separate experiment to measure. This experiment is described more in the corresponding\npaper, but the relevant difference is that a PyTorch neural network is used\nthrough skorch instead of Scikit-learn’s MLPClassifier.

\n\nI ran the *same* experiment with a different number of Dask\nworkers.^{8} Here’s how `HyperbandSearchCV`

scales:

Training one model to completion requires 243 seconds (which is marked by the\nwhite line). This is a comparison with `patience`

, which stops training models\nif their scores aren’t increasing enough. Functionally, this is very useful\nbecause the user might accidentally specify `n_examples`

to be too large.

It looks like the speedups start to saturate somewhere\nbetween 16 and 24 workers, at least for this example.\nOf course, `patience`

doesn’t work as well for a large number of\nworkers.^{9}

There’s some ongoing pull requests to improve `HyperbandSearchCV`

. The most\nsignificant of these involves tweaking some Hyperband internals so `HyperbandSearchCV`

\nworks better with initial or very exploratory searches (dask/dask-ml #532).

The biggest improvement I see is treating *dataset size* as the scarce resource\nthat needs to be preserved instead of *training time*. This would allow\nHyperband to work with any model, instead of only models that implement\n`partial_fit`

.

Serialization is an important part of the distributed Hyperband implementation\nin `HyperbandSearchCV`

. Scikit-learn and PyTorch can easily handle this because\nthey support the Pickle protocol^{10}, but\nKeras/Tensorflow/MXNet present challenges. The use of `HyperbandSearchCV`

could\nbe increased by resolving this issue.

I choose to tune 7 hyperparameters, which are

\n\n- \n
`hidden_layer_sizes`

, which controls the activation function used at each\nneuron \n `alpha`

, which controls the amount of regularization \n

More hyperparameters control finding the best neural network:

\n\n- \n
`batch_size`

, which controls the number of examples the`optimizer`

uses to\napproximate the gradient \n `learning_rate`

,`learning_rate_init`

,`power_t`

, which control some basic\nhyperparameters for the SGD optimizer I’ll be using \n `momentum`

, a more advanced hyperparameter for SGD with Nesterov’s momentum. \n

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2018/01/04/dask-get-client/",
"url": "http://stsievert.com/blog/2018/01/04/dask-get-client/",
"title": "Launching tasks from workers with Dask",
"date_published": "2018-01-04T02:30:00-06:00",
"date_modified": "2018-01-04T02:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Which amounts to choosing

\n`alpha`

in Scikit-learn’s Ridge or LASSO ↩ \n - \n
To the best of my knowledge, this is the first implementation of Hyperband with an advanced task scheduler ↩

\n \n - \n
More accurately, Hyperband will find close to the best model possible with $N$

\n`partial_fit`

calls in expected score with high probability, where “close” means “within log terms of the upper bound on score”. For details, see Corollary 1 of the corresponding paper or Theorem 5 of Hyperband’s paper. ↩ \n - \n \n \n
- \n
There’s less tuning for adaptive step size methods like Adam or Adagrad, but they might under-perform on the test data (see “The Marginal Value of Adaptive Gradient Methods for Machine Learning”) ↩

\n \n - \n
But it probably won’t be this fast: the video is sped up by a factor of 3. ↩

\n \n - \n
See Dask’s documentation on Prioritizing Work ↩

\n \n - \n
Everything is the same between different runs: the hyperparameters sampled, the model’s internal random state, the data passed for fitting. Only the number of workers varies. ↩

\n \n - \n
There’s no time benefit to stopping jobs early if there are infinite workers; there’s never a queue of jobs waiting to be run ↩

\n \n - \n
“Pickle isn’t slow, it’s a protocol” by Matthew Rocklin ↩

\n \n

Dask recently surprised me with it’s flexibility in a recent use case, even\nmore than the basic use detailed in a previous post.\nI’ll walk through my use case and the interesting problem it highlights. I’ll\nshow a toy solution and point to the relevant parts of the Dask documentation.

\n\n", "content_html": "Dask recently surprised me with it’s flexibility in a recent use case, even\nmore than the basic use detailed in a previous post.\nI’ll walk through my use case and the interesting problem it highlights. I’ll\nshow a toy solution and point to the relevant parts of the Dask documentation.

\n\n\n\nI need to perform hyperparameter optimization on deep neural networks in my job. Most\nnetworks assume certain values are given, but actually *finding* those values\nis difficult and requires “hyperparameter optimization.”\nA scikit-learn documentation page has more detail on this issue.

For my use case, model evaluations take about 15 minutes to complete and I have\nto tune 4 different values for 18 different models. Grid search or random\nsearch (two popular methods for hyperparameter optimization) are infeasible\nbecause of they take too long for the resolution I desire^{1}.

I eventually gave up on tuning 4 parameters for each model, and decided to tune\nonly 1 parameter per model. However, grid or random search would still take too\nlong. These algorithms don’t adapt to previous evaluations and don’t use\ninformation they’ve already collected.

\n\nThis one hyperparameter is pretty simple, and doesn’t require a complex search.\nA simple recursive function would do and could be implemented quickly^{2}.\nEither way, my implementation highlighted an interesting software problem.

Dask is the right tool for my setup, especially given the fact that one machine\ncan only fit one model and I want to evaluate many models in parallel. I’ve\nillustrated my use case below with a Fibonacci function, which is the simplest\nexample that illustrates how Dask can be used to solve my problem.

\n\nBoth my hyperparameter search and this Fibonacci function are\nrecursive^{3}. The solution I first started out with is

`from distributed import Client\n\ndef fib(n):\n if n <= 1:\n return n\n return fib(n - 1) + fib(n - 2)\n\nif __name__ == \"__main__\":\n client = Client()\n\n x = [6, 7, 8, 9]\n futures = client.map(fib, x)\n y = client.gather(futures)\n`

This only distributes the evaluations of the first call to `fib`

, not any\nrecursive calls to `fib`

. This means that `fib(1)`

is evaluated by every node,\nand the scheduler never knows `fib(1)`

is evaluated.

But I want this recursive function to submit more jobs to the scheduler. That\nis, when each recursive function runs on a worker, I want that worker to submit\nmore tasks to the scheduler.

\n\nDask does support this, and it’s described on “Launching Tasks from Tasks”.\nIt steps through the same example in more detail, and I encourage you to read\nit. Discovery of this simplified the way I approached the problem; I didn’t need\nto perform all task submission on the scheduler.

\n\nThe core of the documentation page mentions that I can turn my function into

\n\n`from distributed import get_client, secede, rejoin\n\ndef fib(n):\n if n <= 1:\n return n\n client = get_client()\n futures = client.map(fib, [n - 1, n - 2])\n secede()\n a, b = client.gather(futures)\n rejoin()\n return a + b\n`

This functionality allows me to use my Amazon EC2 instances optimally. I can\ndedicate one instance to run the scheduler, and the rest of my spot instances\ncan be workers. I can add workers freely, and removing workers will have a\nlimited but not massive impact.

\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2017/09/07/pytorch/",
"url": "http://stsievert.com/blog/2017/09/07/pytorch/",
"title": "PyTorch: fast and simple",
"date_published": "2017-09-07T03:30:00-05:00",
"date_modified": "2017-09-07T03:30:00-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
And especially with Amazon’s constraints on

\n`p2.xlarge`

machines ↩ \n - \n
If it wasn’t so simple or I had more hyperparameters, I would have needed one of the many other hyperparameter optimization algorithms. ↩

\n \n - \n
The Fibonacci function doesn’t have to be recursive: see my post “Applying eigenvalues to the Fibonacci problem” ↩

\n \n

I recently came across PyTorch, a new technology prime for optimization and\nmachine learning. The docs make it look attractive, so immediately I wondered\n“how does it compare with NumPy?”

\n\nTurns out it’s a pretty nice framework that’s fast and straightforward to use.\nI’ll detail the speed before talking about ease-of-use.

\n\n", "content_html": "I recently came across PyTorch, a new technology prime for optimization and\nmachine learning. The docs make it look attractive, so immediately I wondered\n“how does it compare with NumPy?”

\n\nTurns out it’s a pretty nice framework that’s fast and straightforward to use.\nI’ll detail the speed before talking about ease-of-use.

\n\n\n\nThe largest difference is gradient computation, and the largest potential\nslow-down. PyTorch automatically computes the gradient given past computations,\nwhereas in NumPy they have to be explicitly computed. Computing gradients are\npart of my daily workflow, and slowness here would mean that I could not use\nPyTorch.

\n\nI expected NumPy to be faster while computing gradients. How could it not be?\nIt’s been around for a long time and has been heavily optimized. It’s a mature\npiece of software and widely used. Because of this, I expected NumPy to be at\nleast 2x faster than PyTorch. Turns out that’s what we see when our\nimplementation is tuned.^{1}.

PyTorch is fast. For even moderate dimensions (say 1000 observations) it’s\nwithin a factor of 2 of NumPy. For more realistic dimensions (like 10,000\nobservations) it’s remarkably close.

\n\n\n\nwhich shows the time to compute the least squares gradient (the gradient with\nrespect to $x$ of $\\norm{y - Ax}^2_2$ when $A \\in \\R^{10d~\\times~d}$ where\n$10\\cdot d$ is the number of observations).

\n\nFor moderate dimensions, PyTorch is as fast as NumPy when bound to the CPU –\nusing a GPU with PyTorch can provide additional acceleration. Plus, PyTorch\navoids the nasty pitfalls like the one above; due to a small mistake, my NumPy\ncode ran 8x slower than it could.

\n\nI got curious and applied this to HIPS/autograd as well – it’s the most\nstraightforward solution to connect automatic differentation with NumPy. It’s\nfairly close to the PyTorch performance, at least for not-small $d$. I believe\nthat HIPS/autograd and PyTorch are both using reverse mode automatic\ndifferentiation.^{2}

If I had a GPU on my local machine PyTorch would be even faster. I could have\nmade NumPy faster by using Numbas CUDA GPU support and my earlier post\n“NumPy GPU acceleration”, but I wanted to test Anaconda’s default\nconfiguration^{3}.

There are other libraries^{4} that have these same speed results – what else does\nPyTorch offer?

PyTorch is not a Python binding to a monolithic C++ framework. Instead, most of\nthe functionality is implemented as Python classes. This means that it’s easy\nto subclass these methods to write the code you want while having the\nfunctionality of PyTorch, and it’s easy to compare against other methods\nimplemented in PyTorch. They even have a page titled “Extending PyTorch” in their\ndocs!

\n\nThe conversion between PyTorch tensors and NumPy arrays is *simple* as `Tensor`

\nthe NumPy `ndarray`

and PyTorch `Tensor`

share the same memory locations\n(source). This can lead to significant time savings, especially when\nlarge arrays are used.

This means that it’s easy and fast to extend PyTorch with NumPy and SciPy. In\nthe docs, they step through creating an extension with SciPy.

\n\nThis is significant, and there are large speed benefits to this! When I compare\nconverting to a NumPy $n\\times n$ array from a Tensorflow or PyTorch tensor, I\nsee this timing comparison:

\n\n\n\nThat’s right – PyTorch is over 1000x faster than TensorFlow when converting to\na 1000 $\\times$ 1000 NumPy array!

\n\nThis means we can use all of NumPy and SciPy without any fear of slowing our\nprogram down.

\n\nThe biggest difference between PyTorch and other ML frameworks (Tensorflow,\nCNTK, MXNet, etc) is that PyTorch has a **dynamic** computational graph, not a\n**static** computational graph. This allows for significant ease of use.

One benefit of this is that code executes when you expect. With dynamic\ncomputation graphs, tracebacks are easy to follow and they can use control flow\nas expected. Libraries that have a static computation graph have to define\ntheir own control flow; they need to implement control flow. For example, see\nTensorflow’s control flow docs or an SO question on difficulty on timing in\nTensorflow

\n\nIn my experience, it’s required to hold more mental state for Tensorflow models\nthen with PyTorch. PyTorch has clear function arguments are because the code\nexecutes when expected. It’s not necessary to link together the input data to\nthe model and (in my experience) there are fewer global variables.

\n\n- \n
**torch.multiprocessing**. Similar to the standard Python multiprocessing,\nbut “with magical memory sharing of torch Tensors across processes.”\n- \n
- They even have an example Hogwild implementation! \n

\n **torch.distributed**to communicate between distributed machines. \n **GPU access**which can speed up code as exemplified above. \n - PyTorch is memory efficient: “The memory usage in PyTorch is extremely\nefficient compared to Torch or some of the alternatives”, according to pytorch.org. \n

PyTorch is already an attractive package, but they also offer

\n\n- \n
**Datasets and pretrained models**at pytorch/vision \n **Many examples and implementations**, with a subset available at\n \n \n **A strong community**with a discussion board and an SO tag \n

- \n
- Chainer has a good comparison of many deep learning frameworks including\nPyTorch, Tensorflow and MXNet:\nhttp://chainer.readthedocs.io/en/latest/comparison.html\n(added 2017-09-19) \n
- If you care about speed (added 2017-09-19):\n
- \n
- A speed comparison between many different frameworks can be found at\nsoumith/convnet-benchmarks. This measures many different frameworks,\nbut notably Torch and not PyTorch (but this tweet by a PyTorch core\ndev says they both call the same C libraries). \n
- Some anecdotal evidence (tensorflow#9322, tensorflow#7065, PyTorch\nforum thread) points to PyTorch being faster than Tensorflow. \n

\n - On exporting models (added 2018-02-17):\n
- \n
- PyTorch has good support for exporting to the Open Neural Network Exchange (ONNX) through torch.onnx \n
- A well written article on deploying PyTorch trained NN on iOS: “How I Shipped a Neural Network on iOS with CoreML, PyTorch, and React Native” \n
- And here’s a good graph I found on the export options of different\nlibraries and the systems the can run on: http://tvmlang.org/2017/10/06/nnvm-compiler-announcement.html \n
- ONNX to CoreML: https://github.com/onnx/onnx-coreml \n

\n - fast.ai announced they’re “Introducing PyTorch for fast.ai” (and added\n2017-09-08). Their motivation includes\n
- \n
- “in a recent Kaggle competition [PyTorch] was used by nearly all of\nthe top 10 finishers” \n
- “Much to our surprise, we also found that many models trained quite a lot\nfaster on pytorch than they had on Tensorflow.” \n
- “Because Pytorch allowed us, and our students, to use all of the\nflexibility and capability of regular python code to build and train\nneural networks, we were able to tackle a much wider range of problems.” \n

\n - O’Reilly podcast on PyTorch, part of my motivation for checking out PyTorch\n
- \n
- PyTorch’s core development team has 4 members \n

\n - I think PyTorch performs reverse-mode auto-differentiation.\n
- \n
- Other autograd implementations (and inspiration for PyTorch):\nHIPS/autograd, twitter/torch-autograd, Chainer. \n
- It looks like it performs reverse accumulation automatic\ndifferentiation \n

\n - PyTorch can work with tensorboard with tensorboard-pytorch \n
- A good overview between Theano+Lasagne, PyTorch and Tensorflow on Reddit’s\n/r/machinelearning by /u/ajmooch \n
- Inspired by Chainer and similar to TensorFlow, Theano, Caffe and\nCNTK \n

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2017/07/22/holoviews/",
"url": "http://stsievert.com/blog/2017/07/22/holoviews/",
"title": "Holoviews interactive visualization",
"date_published": "2017-07-22T03:30:00-05:00",
"date_modified": "2017-07-22T03:30:00-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Refactoring my NumPy code to use

\n`2 * (x*y)`

instead of`(2*x) * y`

lead to a 8x improvement in speed, as discovered in ScratchNet#3. ↩ \n - \n
I added this paragraph on 2018-08-04. ↩

\n \n - \n
Which includes MKL and has other optimization (maybe Intel’s TBB?) ↩

\n \n - \n
like Tensorflow, MXNet and Theano (google trends). ↩

\n \n

I often want to provide some simple interactive visualizations for this blog. I\nlike to include visualization to give some sense of how the data change as\nvarious parameters are changed. Examples can be found in *Finding sparse\nsolutions to linear systems*, *Least squares and regularization*, and\n*Computer color is only kinda broken*.

I have discovered a new tool, Holoviews to create these widgets. I want to\ncreate these interactive widgets for my blog, meaning I want to embed these in\na static HTML page. Previously, I used Jake Vanderplas’s ipywidgets-static\nbut in this post I’ll walk through creating a widget.

\n\n", "content_html": "I often want to provide some simple interactive visualizations for this blog. I\nlike to include visualization to give some sense of how the data change as\nvarious parameters are changed. Examples can be found in *Finding sparse\nsolutions to linear systems*, *Least squares and regularization*, and\n*Computer color is only kinda broken*.

I have discovered a new tool, Holoviews to create these widgets. I want to\ncreate these interactive widgets for my blog, meaning I want to embed these in\na static HTML page. Previously, I used Jake Vanderplas’s ipywidgets-static\nbut in this post I’ll walk through creating a widget.

\n\n\n\n`import holoviews as hv\nhv.extension('bokeh')\nimport numpy as np\nimport pandas as pd\n`

When I create a widget, I’m often showing the result of some function over\ndifferent parameters. Let’s define that function:

\n\n`def f(param1, param2, opt_param=0):\n result = param1 + param2\n result += np.random.randn() * opt_param\n return {'param1': param1, 'param2': param2,\n 'result': result, 'opt_param': opt_param}\n`

The visualization will use `param1`

and `param2`

. It will provide some\ninteractivity with `opt_param`

. But of course we need to evaluate this function\nwith these parameters:

`param1 = np.logspace(0, 1)\nparam2 = np.logspace(0, 1)\nopt_param = np.linspace(0, 1, num=5)\n\ndata = [f(p1, p2, opt_param=op)\n for p1 in param1 for p2 in param2 for op in opt_param]\ndf = pd.DataFrame(data)\n`

Now, the Holoviews specific material starts though it’s pretty minimal. First,\nI’m going to convert my `pd.DataFrame`

to a Holoview’s Table:

`# ignore extra columns to holoviews doesn't show extra sliders\nto_keep = ['param1', 'param2', 'result', 'opt_param']\ntable = hv.Table(df[to_keep])\n`

Now that we have the data, lets visualize:

\n\n`%%opts HeatMap (cmap='viridis') [tools=['hover'] xticks=10 yticks=5 colorbar=True toolbar='above' logx=True show_title=False]\n%%output filename=\"holoviews\" fig=\"html\"\ntable.to.heatmap(kdims=['param1', 'param2'], vdims='result')\n`

\n

\n\n\nAnd that’s an easy-to-create static HTML interactive visualization.

\n\n- \n
`kdims`

stands for “key dimensions”,`vdims`

stands for “value dimensions”.\nSee Annotating your data for more detail. \n - More detail on these interactive visualization can be found in Gridded\nDatasets and Tabular Datasets.\n
- \n
- Using tabular data to show a gridded visualization is not space\nefficient; an example that uses pandas is at holoviews#1745\n(comment) \n

\n - Holoviews uses param to declare parameters for classes… meaning that the\ndocumentation is good! Try misspelling a value; param will print a list of\naccepted values.\n
- \n
- This means that the code is rather nice; it specifies exactly what\nparameters can be passed in, and includes documentation. \n

\n - Issue #1745: hv.Table(df).to.image throws warning and error\n
- \n
- Filed because using
`table.to.image`

would remove the lines in the heatmap \n

\n - Filed because using
- Issue #84: Improving customizability and layout of widgets\n
- \n
- Resolving will allow positioning/styling slider \n

\n

Apple has created a new file format for machine learning models. These files\ncan be used easily to predict, regardless of the creation process, which means\nthat “Apple Introduces Core ML” draws an analogy between these files and\nPDFs. It’s possible to generate predictions with *only* this file, and none of\nthe creation libraries.

Generating predictions is a pain point faced by data scientists today and often\ninvolves the underlying math. At best, this involves using training the model\nin Python and then calling the underlying C library in the production\napp.

\n\nThis file format will only become widely used if easy conversion from popular\nmachine learning libraries is possible and predictions are simple to generate.\nApple made these claims during their WWDC 2017 keynote. I want to investigate\ntheir claim.

\n\n\n\n", "content_html": "Apple has created a new file format for machine learning models. These files\ncan be used easily to predict, regardless of the creation process, which means\nthat “Apple Introduces Core ML” draws an analogy between these files and\nPDFs. It’s possible to generate predictions with *only* this file, and none of\nthe creation libraries.

Generating predictions is a pain point faced by data scientists today and often\ninvolves the underlying math. At best, this involves using training the model\nin Python and then calling the underlying C library in the production\napp.

\n\nThis file format will only become widely used if easy conversion from popular\nmachine learning libraries is possible and predictions are simple to generate.\nApple made these claims during their WWDC 2017 keynote. I want to investigate\ntheir claim.

\n\n\n\n\n\nSpecifically, Apple claimed easy integration their `.mlmodel`

file format and\nvarious Python libraries. It’s easy to integrate these into an app (literally\nvia drag-and-drop) or another Python program.

Apple’s coremltools Python package make generation of this `.mlmodel`

file\nstraightforward:

- \n
- Train a model via scikit-learn, Keras, Caffe or XGBoost (see docs for\nconversion support for different library versions) \n
- Generate a
`coreml_model`

with`converters.[library].convert(model)`

\n - (optional) Add metadata (e.g., feature names, author, short description) \n
- Save the model with
`coreml_model.save`

\n

coremltools prints helpful error messages in my (brief) experience. When using\n`converters.sklearn.convert`

it gave a helpful error message indicating that\nclass labels should either be of type `int`

or `str`

(not `float`

like I was\nusing).

Here’s the complete script for the `.mlmodel`

file generation:

`import coremltools\nfrom sklearn.svm import LinearSVC\n\ndef train_model():\n model = LinearSVC()\n # ...\n return model\n\nmodel = train_model()\n\ncoreml_model = coremltools.converters.sklearn.convert(model)\ncoreml_model.author = 'Scott Sievert' # other attributes can be added\ncoreml_model.save('sklearn.mlmodel')\n`

Yup, creation of these `.mlmodel`

files is as easy as Apple claims. Even\nbetter, it appears this file format has integration with named features and\nPandas.

The generation of this file is *easy*. Now, where can these files be used?

These `.mlmodel`

files can be included on any device that supports CoreML. It\nwill not be tied to iOS/macOS apps, though these files will certainly be used\nthere. It will allow general and easy use in Python for both saving and\nprediction. Given Apple’s expansion of Swift to other operating systems, I\ndon’t believe it will be tied to a particular operating system.

Prediction is easy as saving:

\n\n`coremlmodel = coremltools.models.MLModel('sklearn.mlmodel')\ncoremlmodel.predict(example) # `example` format should mirror training examples\n`

However, I can’t test it as macOS 10.13 (currently in beta) is needed.

\n\nThis difficulties were resolved quickly. Here’s what I ran while generating\nthis post:

\n\n- \n
- CoreML depends on Python 2.7 \n
- Version support in converting (e.g., Keras 2 not supported but 1.2 is). \n

The largest potential difficulty I see is with the limited (or not unlimited)\nscope of coremltools. There could be issues with version of different\nlibraries, and not all classifiers in sklearn are supported (supported\nsklearn models).

\n\n", "tags": [], "image": "" }, { "id": "http://stsievert.com/blog/2017/04/09/entropy/", "url": "http://stsievert.com/blog/2017/04/09/entropy/", "title": "Atmosphere and entropy", "date_published": "2017-04-09T03:30:00-05:00", "date_modified": "2017-04-09T03:30:00-05:00", "author": { "name": "Scott Sievert", "url": "http://stsievert.com" }, "summary": "I recently learned an abstract mathematical theorem, and stumbled across a\nremarkably direct measure. I’ll give background to this theorem before\nintroducing it, then I’ll show the direct measure of this theorem with physical\ndata.

\n\nThis theorem has to do with entropy, which is clouded in mystery. There are\nseveral types of entropy and, during the naming of one type, Von Neumann\nsuggested the name “entropy” to Claude Shannon in 1948 because

\n\n\n\n\n", "content_html": "In the first place your uncertainty function has been used in statistical\nmechanics under that name, so it already has a name. In the second place,\nand more important, no one really knows what entropy really is, so in a\ndebate you will always have the advantage.

\n

I recently learned an abstract mathematical theorem, and stumbled across a\nremarkably direct measure. I’ll give background to this theorem before\nintroducing it, then I’ll show the direct measure of this theorem with physical\ndata.

\n\nThis theorem has to do with entropy, which is clouded in mystery. There are\nseveral types of entropy and, during the naming of one type, Von Neumann\nsuggested the name “entropy” to Claude Shannon in 1948 because

\n\n\n\n\n\n\nIn the first place your uncertainty function has been used in statistical\nmechanics under that name, so it already has a name. In the second place,\nand more important, no one really knows what entropy really is, so in a\ndebate you will always have the advantage.

\n

Entropy fundamentally measures the uncertainty in a system. In “information\ntheory” it formalizes the “randomness” of a random variable. If a random\nvariable is uncertain, it has entropy. If a random variable is deterministic or\ntakes only takes one value, it has 0 entropy.

\n\nEntropy is fundamentally related to the flow of information, which is studied\nin information theory. If a message can only take one state, no information can\nbe transmitted; how would communication happen if everything is static? But if\nit can take two states, it’s possible to communicate one bit of information\n(i.e., an answer to “is the light on?”)

\n\nWe care about maximizing entropy because we want to receive information\nquickly. If a message can take 4 states instead of 2, it’s possible to transmit\ntwice as much information.

\n\nA typical statement about entropy maximization looks like:

\n\n\n\n\nFor a positive random variable $X$ with fixed mean $\\mu$, the entropy of $X$ is\nmaximized when $X$ is an exponential random variable with mean\n$\\frac{1}{\\mu}$.

\n

It doesn’t seem like there’d be a direct physical example that supports this\ntheorem. But, there is, and it has to do with air pressure as a function of\nheight.

\n\nLet’s take a column of the earth’s atmosphere, and ignore any weather or\ntemperature effects. **How does the air pressure in this column vary with\nheight?** The air pressure at sea level is very different than the air pressure\nwhere the ISS orbits.

An air particle’s height is the random variable we’ll study. Height is a\npositive variable, and a column of air will have a mean height $\\mu$. We can\napply the statement above if we can assume air particles maximize entropy.\nWhen this example was presented in lecture^{1}, I somewhat incredulously asked if this could be applied to Earth’s air pressure.

Air particles maximizing entropy is a fair assumption. An air particle’s\nposition is uniformly random when it’s contained in a small region. Given the\nwell-known fact that uniform random variables have maximum entropy, this seems\nlike a safe assumption.

\n\nSo by the statement above we’d expect the air molecules position to follow an\nexponential distribution. Pressure is a proxy for how many air particles are\npresent, and we’d expect that pressure as a function of\nheight to look like the barometric formula:

\n\n$$\nP(h) = c_1 e^{-c_2 h}\n$$

\n\nwhen $c_1$ and $c_2$ are constants that hide weather/temperature.

\n\nNOAA collects the data required to test this theory. The launch weather\nballoons through the NOAA IGRA program. These weather balloons collect\natmospheric data at different altitudes, and these data are available\nonline^{2}. They record monthly averages of pressure at different\nheights at each station from 1909 to 2017^{3}. These data are\nfrom at least 600,000 weather balloon launches from a total of 1,532\nstations.

We use these data to visualize pressure at different heights. We know that\nthis curve is characterized by $\\mu$, the average height of an air molecule.\nI’ve calculated this value from these data and have plotted expected pressure\nat any height, given by $P(h) = \\frac{1}{\\mu}\\Exp{\\frac{-h}{\\mu}}$.

\n\nI show the measured pressures at different heights using the 314 million NOAA\ndata. This is shown alongside an appropriately scaled version of $P(h)$ given\nthe average air particle height $\\mu$.

\n\n\n\nThe theorem as stated earlier holds true for an air particles height: air\npressure^{4} at different heights follow an exponential distribution.

*The plot for this post was generated at stsievert/air-pressure-height*

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2017/03/11/sexual-reproduction/",
"url": "http://stsievert.com/blog/2017/03/11/sexual-reproduction/",
"title": "Motivation for sexual reproduction",
"date_published": "2017-03-11T02:30:00-06:00",
"date_modified": "2017-03-11T02:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
In ECE 729: Information Theory taught by Varun Jog ↩

\n \n - \n
Summarized at igra2-monthly-format.txt and available in monthly-por as

\n`ghgt_*.txt.zip`

↩ \n - \n
Before information theory entropy was even defined! ↩

\n \n - \n
a proxy for number of air particles ↩

\n \n

Of course, the purpose of sexual reproduction is to perpetuate our species by\nhaving offspring. Combined with natural selection, it’s enable fit our genes to\nour environment *quickly.* Buy why is it required to have two mates to produce\na single offspring? Would asexual reproduction or having 3+ parents be more\nadvantageous?

Of course, the purpose of sexual reproduction is to perpetuate our species by\nhaving offspring. Combined with natural selection, it’s enable fit our genes to\nour environment *quickly.* Buy why is it required to have two mates to produce\na single offspring? Would asexual reproduction or having 3+ parents be more\nadvantageous?

The end goal of reproduction is have the offspring breed their own offspring.\nSince genes influence the probability of survival/breeding, one goal of\nreproduction is to have as many genes that will allow offspring\nsurvival/reproduction as possible. This is a very information theoretic\napproach – how quickly can biology transfer important information?

\n\nA species can evolve more quickly (quicker to optimal fitness) when they share\ngenetic information! The sharing of genetic material is important when\ncombined with natural selection. As a species we can evolve quicker this way!\nCorrespondingly, most species use two mates when reproducing.

\n\nThis isn’t surprising when compared with asexual reproduction which involves\none partner. Sharing information used to survive is more advantageous than\nnot sharing that information.

\n\nBut **would having three or more parents help our genome advance more\nquickly?** I’ll simulate to guess an answer to this question in this post.

*This post is inspired by a information theory lecture by Varun Jog, which is\nin turn inspired by a chapter in the textbook “Information Theory, Inference,\nand Learning Algorithms” by David MacKay.*

Simulation of evolution needs to have 4 main parts:

\n\n- \n
- an individual’s gene \n
- determination of a individual’s fitness \n
- inheritance of genes when producing offspring \n
- natural selection \n

Each individual’s DNA will be a binary sequence. This is a fair representation\nof DNA – the only difference is that actual DNA has 2 bits of information per\nbase pair.

\n\nWe’ll model fitness when an individual has genes $g_i \\in \\braces{0, 1}$ as

\n\n$$\n\\text{fitness } = \\sum_i g_i\n$$

\n\nThis is a sensible model because it mirrors actual fitness. If one is already\nreally fit, a little more fitness won’t help much (i.e, fitness of 99 → 100 is\na small percentage change). If one is not fit, getting a little fit helps a ton\n(i.e., fitness of 1 → 2 is a large percentage change).

\n\n`import numpy as np\ndef fitness(member):\n member = np.asarray(member)\n if member.ndim == 2:\n _, n_genes = member.shape\n indiv_fitnesses = member.sum(axis=1)\n return indiv_fitnesses.mean()\n \n return member.sum()\n`

We’ll model the probability of pulling parent $i$’s gene as $\\frac{1}{n}$ when there are $n$ parents. This mirrors how reproduction works.

\n\nWhile implementing this we include mutation. This will flip each gene with\nprobability $p$. This is a naturally occurring process.

\n\n`def produce_offspring(parents, mutate=False, p=0.00):\n n_parents, n_genes = parents.shape\n gene_to_pull = np.random.randint(n_parents, size=n_genes)\n child = [parents[gene_to_pull[k]][k] for k in range(n_genes)]\n child = np.array(child)\n \n if mutate:\n genes_to_flip = np.random.choice([0, 1], size=child.shape, p=[1-p, p])\n i = np.argwhere(genes_to_flip == 1)\n child[i] = 1 - child[i]\n \n return child\n`

Each generation of parents will produce twice as many children. We’ll kill half\nthose children to simulate natural selection.

\n\nWe’ll produce $2N$ children when we have $N$ parents, regardless of how many\nparents are required to produce each offspring. If we produced more children\nfor some groups that there would be more children to hand to natural selection.\nThis would lead to a bias because natural selection selects the strongest\ncandidates.

\n\n`def one_generation(members, n_parents=2, **kwargs):\n \"\"\" members: 2D np.ndarray\n members.shape = (n_parents, n_genes) \"\"\"\n parents = np.random.permutation(members) \n children = []\n for parent_group in range(len(parents) // n_parents):\n parents_group = parents[parent_group * n_parents : (parent_group + 1) * n_parents]\n children += [produce_offspring(parents_group, **kwargs) for _ in range(2*n_parents)]\n children = np.array(children)\n\n # make sure we produce (approximately) 2*N children when we have N parents\n assert np.abs(len(children) - 2*len(parents)) < n_mates + 1\n return children \n`

Now want to simulate evolution with an initial population:

\n\nIn the natural selection process we’ll kill off half the children, meaning\nthere will be $N$ parents for the next generation.

\n\nAt each generation we’ll record relevant data. We’ll look at the fitness below.

\n\n`def evolve(n_parents=2000, n_mates=2, n_genes=200, n_generations=100, p_fit=0.5, verbose=10):\n parents = np.random.choice([0, 1], size=(n_parents, n_genes), p=[1-p_fit, p_fit])\n child = produce_offspring(parents)\n\n data = []\n for generation in range(n_generations):\n if verbose and generation % verbose == 0:\n print('Generation {} for n_mates = {} with {} parents'.format(generation, n_mates, len(parents)))\n \n children = one_generation(parents, n_parents=n_mates, mutate=True, p=0.01)\n \n # kill half the children\n children_fitness = np.array([fitness(child) for child in children])\n i = np.argsort(children_fitness)\n children = children[i]\n parents = children[len(children) // 2:].copy()\n \n data += [{'fitness': fitness(parents), 'generation': generation,\n 'n_mates': n_mates, 'n_parents': n_parents,\n 'n_genes': n_genes, 'n_generations': n_generations}]\n \n return data\n`

Then we can run this for a different number of mates required to produce one\noffspring:

\n\n`data = []\nfor n_mates in [1, 2, 3, 4]:\n data += evolve(n_mates=n_mates, p_fit=0.50)\n`

`from altair import Chart\nimport pandas as pd\n\ndf = pd.DataFrame(data)\nChart(df).mark_line().encode(\n x='generation', y='fitness', color='n_mates')\n`

Asexual reproduction requires ~75 generations to reach the fitness\nsexual reproduction reaches in ~15 generations. Sexual reproduction\nappears to be fairly close to optimal in this model.

\n\n*This post download as a Jupyter notebook,\nReproduction.ipynb*

I often have highly optimized code that I want to run independently for\ndifferent parameters. For example, I might want to see how reconstruction\nquality varies as I change two parameters. My code takes a moderate amount of\ntime to run, maybe 1 minute. This isn’t huge, but if I want to average\nperformance over 5 random runs for $20^2$ different input combinations, using a\nnaïve for-loop means about 1.5 days. Using dask.distributed, I distribute these\nindependent jobs across different machines and different cores for a\nsignificant speedup.

\n\n", "content_html": "I often have highly optimized code that I want to run independently for\ndifferent parameters. For example, I might want to see how reconstruction\nquality varies as I change two parameters. My code takes a moderate amount of\ntime to run, maybe 1 minute. This isn’t huge, but if I want to average\nperformance over 5 random runs for $20^2$ different input combinations, using a\nnaïve for-loop means about 1.5 days. Using dask.distributed, I distribute these\nindependent jobs across different machines and different cores for a\nsignificant speedup.

\n\n\n\nTesting these input combinations requires at least one embarassingly simple\nfor-loop – each iteration is run independently of the last iteration. The\nsimplified example takes the form of

\n\n`import numpy as np\n\ndef test_model(x):\n # ...\n return random.choice([0, 1])\n\ny = [test_model(x) for x in np.linspace(0, 1)]\n`

dask.distributed is a tool to optimize these for loops^{1}. It can\ndistribute a single loop of this for-loop onto different cores and different\nmachines. This is perfect for me – as an grad student involved with the\nWisconsin Institute for Discovery, I have a cluster of about 30 machines ready\nfor my use.

I’ll first illustrate basic dask use then explain how I personally set it up on\nthe cluster. I’ll then go over some advanced use that covers how to use it with\nthe cluster at UW–Madison.

\n\nUsing dask.distributed is easy, and the dask.distributed documentation is helpful.\nFunctions for submitting jobs such as `Client.map`

and `Client.submit`

\nexist, and `Client.gather`

exists to collect pending jobs.

In the use case above,

\n\n`from distributed import Client\nclient = Client() # will start scheduler and worker automatically\n\ndef test_model(x):\n # ...\n return random.choice([0, 1])\n\ny = [client.submit(test_model, x) for x in np.linspace(0, 1)]\ny = client.gather(y) # collect the results\n`

This is a setup – we have to add 3 lines and change 1. For the speedup that\ndask.distributed gives access to that’s remarkably simple.

\n\nUsing `Client()`

is easy – it starts a worker and scheduler for you. This\nonly works for one machine though; other tools exist to use many machines as\ndetailed on *Setup Network*.

That really covers all you need to know; the dask.distributed docs are decent and\nthe above example is enough to get started. In what follows, I’ll explain my\nwork flow: using dask.distributed on the UW–Madison optimization cluster with a\nJupyter notebook.

\n\nAfter installing a personal Python install on the UW cluster, following\ndask.distributed’s *Quickstart* gets you 99% of the way to using dask.distributed on\nthe UW Optimization Cluster. The dask.distributed documentation are rather\ncomplete – please, give them a look. Most of the content below can be found in\n*Quickstart*, *Web interface* and *FAQ*.

Setting up many workers on the cluster with many machines is a little trickier\nbecause the cluster is not my personal machine and I (thankfully) don’t manage\nit. I’ll describe how I use dask.distributed and what workarounds I had to find\nto get dask.distributed to run on the UW–Madison cluster. Additionally I’ll\ndescribe how I use this in conjunction with Jupyter notebooks.

\n\nI setup dask.distributed like below, using SSH port\nforwarding to view the web UI.

\n\n`# visit localhost:8070/status to see dask's web UI\nscott@local$ ssh -L 8070:localhost:8787 ssievert@cluser-1\nscott@cluster-1$ dask-scheduler\n# `dask-scheduler` prints \"INFO - Scheduler at 123.1.28.1:8786\"\nscott@cluster-1$ export OMP_NUM_THREADS=1; dask-worker 123.1.28.1:8786\nscott@cluster-2$ export OMP_NUM_THREADS=1; dask-worker 123.1.28.1:8786\n`

When I run dask-worker without setting `OMP_NUM_THREADS`

, the worker throws an\nerror and fails. Setting `OMP_NUM_THREADS=1`

resolves this issue, and see a SO\nquestion titled “Error with OMP_NUM_THREADS when using dask distributed” for\nmore detail.

A nice tool to manage running the same commands on many machines is csshx\nfor OS X’s Terminal.app (not iTerm) and cssh for linux (`cssh`

stands for\n“Cluter SSH”).

I use `tmux`

to handle my dask-scheduler and dask-workers. This allows me to

- \n
- logout and not kill the processes I want running in the background. \n
- view the output of both dask-scheduler and dask-worker even after logging out \n
- always have a scheduler and workers available when I log in \n

This is enough to use dask.distributed on this cluster. Now, I’ll touch on how I use\nit with Jupyter notebooks using port forwarding. This allows me to quickly\nvisualize the result on the cluster and provides a native editing environment.

\n\nI also use dask.distributed with the Jupyter notebook, which provides a nice\ninterface to view the results and edit the code. This means I don’t have to\n`rsync`

my results to my local machine to visualize the results. Additionally,\nI feel like I’m editing on my local machine while editing code on this remote\ncluster.

`scott@local$ ssh -L 8080:localhost:8888 -L 8081:localhost:8787 scott@cluster-1\nscott@cluster-1$ jupyter notebook\n# on local machine, localhost:8080 views notebook running on the cluster\n`

With the process above, I can *quickly* visualize results *directly* on the\nserver. Even better, I can fully utilize the cluster and use as many machines\nas I wish.

With this, I can also view dask.distributed’s web UI. This allows me to see the\nprogress of the jobs on the cluster; I can check to see how far I’ve come and\nhow close I am to finishing.

\n\n\n \n \n \n \n \n \n \n \n \n \n \n \nNotebook | Web UI |
---|---|

Often times, I am finding model performance for different input combinations.\nDuring this I typically average the results by calling `test_model`

many\ntimes.

In the example below, I show a personal use case of dask.distributed. In this, I\ninclude the method of visualization (which relies on pandas and seaborn).

\n\n`import numpy as np\nimport seaborn as sns\nimport pandas as pd\nfrom distributed import Client\nfrom distributed.diagnostics import progress\n\ndef test_model(k, n, seed=42):\n np.random.seed(seed)\n # ...\n return {'sparsity': k, 'n_observations': n, 'success': 1 if error < 0.1 else 0}\n\nclient = Client('127.61.142.160:8786')\ndata = [client.submit(test_model, k, n, seed=repeat*n*k)\n for n in np.logspace(1, 3, num=60)\n for k in np.logspace(1, 1.7, num=40, dtype=int)\n for repeat in range(10)]\ndata = progress(data) # allows notebook/console progress bar\ndata = client.gather(data)\n\ndf = pd.DataFrame(data)\nshow = df.pivot_table(index='sparsity', columns='n_observations', values='success')\n\nsns.heatmap(show)\n# small amount of matplotlib/seaborn code\n`

This plot is $40\\times 60$ with each job averaged over 10 trials. In total,\nthat makes for $40 \\cdot 60 \\cdot 10 = 24\\cdot 10^3$ jobs. This plot was generated on\nmy local machine with 8 cores; at most, we can see a speedup of 8.

\n\nThis approach only parallelizes different jobs, not tasks within that job. This\nmeans that if a core finishes quickly and another job isn’t available, that\ncore sits empty and isn’t used.

\n\nFor more details on this setup, see dask.distributed’s page on *Related Work*.\nUsing any of these frameworks should allow for further speedups. I would\nrecommend dask the most as it has `dask.array`

and `dask.DataFrame`

, parallel\nimplementations of NumPy’s `array`

and Panda’s `DataFrame`

.

Additionally, dask also has a delayed function decorator. This allows\nrunning functions decorated with `@delayed`

on all available cores of one\nmachine. Of course, make you need to optimize before decorating a function.

- \n
- I couldn’t include nested map functions or use dask.distributed’s joblib\n frontend inside a function submitted to dask.distributed as detailed in\n dask.distributed#465 and joblib#389. Note that
`pd.pivot_table`

alleviates\n many of these concerns as illustrated above.*EDIT 2017-11-22:*This is\n resolved by distributed.get_client* \n - The psuedo-random number generated the same random number. To get around\n this and generate different seeds for every iteration, I passed\n
`i_repeat * some_model_param`

as the seed. \n

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2016/07/01/numpy-gpu/",
"url": "http://stsievert.com/blog/2016/07/01/numpy-gpu/",
"title": "NumPy GPU acceleration",
"date_published": "2016-07-01T03:30:00-05:00",
"date_modified": "2016-07-01T03:30:00-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Of course before you optimize, be sure you need to optimize. ↩

\n \n

I recently had to compute many inner products with a given matrix $\\Ab$ for\nmany different vectors $\\xb_i$, or $\\xb_i^T \\Ab \\xb_i$. Each vector $\\xb_i$\nrepresents a shoe from Zappos and there are 50k vectors $\\xb_i \\in \\R^{1000}$.\nThis is computation took place behind a user-facing web interface and during\ntesting had a delay of 5 minutes. This is clearly unacceptable; how can we make\nit faster?

\n\n", "content_html": "I recently had to compute many inner products with a given matrix $\\Ab$ for\nmany different vectors $\\xb_i$, or $\\xb_i^T \\Ab \\xb_i$. Each vector $\\xb_i$\nrepresents a shoe from Zappos and there are 50k vectors $\\xb_i \\in \\R^{1000}$.\nThis is computation took place behind a user-facing web interface and during\ntesting had a delay of 5 minutes. This is clearly unacceptable; how can we make\nit faster?

\n\n\n\n*edit, 2018-03-17: Looking for the libraries? Check out the\nlibraries section*

I spent a couple hours trying to get the best possible performance from my\nfunctions… and through this, I found a speed optimization^{1} that\nput most of the computation on NumPy’s shoulders. After I made this change, the\nnaïve for-loop and NumPy were about a factor of 2 apart, not enough to write a\nblog post about.

Use of a NVIDIA GPU significantly outperformed NumPy. Given that most of the\noptimization seemed to be focused on a single matrix multiplication, let’s\nfocus on speed in matrix multiplication.

\n\nWe know that matrix multiplication has computational complexity of something\nlike $\\bigO{n^{2.8074}}$^{2}, but very likely greater than\n$\\bigO{n^{2.375477}}$^{3} when multiplying two $n\\times n$ matrices. We can’t\nget around this without diving into theory, but we can change the constant that\ndictates exactly how fast these algorithms run.

The tools I’ll test are

\n\n- \n
- the default NumPy install, with no MKL (even though it’s now provided by\ndefault with Anaconda) \n
- Intel MKL, a tool that provides acceleration for BLAS/LAPACK \n
- the GPU. To do this, I’ll need an Amazon AWS machine and the NVIDIA CUDA\nToolkit. An easy interface is available through cudamat\nbut scikit-cuda and Accelerate also have nice interfaces and provide more access. \n

I had planned to test other tools but these tests didn’t pan out for reasons in\nthe footnotes. My test script can be summarized in the appendix, but I saw\nthat the GPU offered significant speedup with the following graph:

\n\n\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nEnvironment | NumPy + no MKL | NumPy + MKL | cudamat |
---|---|---|---|

Time (seconds) | 7.18 | 4.057 | 0.2898 |

Under the default Anaconda environment (i.e., with MKL), we see that our\nscript runs **80%** slower without MKL and has a **14x** speedup under cudamat!

*begin edits on 2018-03-17*

This simple test shows that using the GPU is powerful. However, this is a\nsimple test with only one library, cudamat. Many more libraries exist and have\nbetter usage, including:

\n\n- \n
- CuPy, which has a NumPy interface for arrays allocated on the GPU. The\ntransition from NumPy
*should*be one line. \n - Numba, which allows defining functions (in Python!) that can be used as GPU\nkernels through numba.cuda.jit and numba.hsa.jit. \n
- PyTorch, which supports arrays allocated on the GPU. It has other useful\nfeatures, including optimizers, loss functions and multiprocessing to support\nit’s use in machine learning. \n

CuPy tries to copy NumPy’s API, which means that transitioning *should* be very\neasy. I mean, they even have a page on “CuPy and NumPy Differences”. But,\nthey also offer some low level CUDA support which could be convenient. It\nlooks like Numba support is coming for CuPy (numba/numba#2786, relevant\ntweet).

Numba supports defining GPU kernels in Python, and then compiling them to C++.\nThis is a powerful usage (JIT compiling Python for the GPU!), and Numba is\ndesigned for high performance Python and shown powerful speedups. More advanced\nuse cases (large arrays, etc) may benefit from some of their memory\nmanagement. Numba\ndoes have support for other lower level details (e.g., calling the kernel with\ndifferent threads/block sizes).

\n\nPyTorch is useful in machine learning, and has a small core development team of\n4 sponsored by Facebook. It’s what I (a machine learning researcher) use\nevery day, and it’s inspired another blog post, “PyTorch: fast and simple”.\nIt’s API does not exactly conform to NumPy’s API, but this library does have\npretty good support (easy debugging, nice NumPy/SciPy integration, etc).

\n\n*end edits on 2018-03-17*

Accelerate and scikit-learn are both fairly similar. In choosing whether to use\nAccelerate or scikit-learn, there are two obvious tradeoffs:

\n\n- \n
- scikit-cuda has access to linear algebra functions (e.g.,
`eig`

) and Anaconda\n does not. However, access to these higher level mathematical functions\n comes through CULA, another framework that requires a license (free\n academic licenses are available). \n - Anaconda can accept raw
`ndarray`

s and scikit-cuda needs to have`gpuarray`

s\n passed in (meaning more setup/cleanup). \n

Whichever is chosen, large speed enhancements exist. I have timed a common\nfunction (`fft`

) over different values of `n`

; there is some overhead to moving\nto the GPU and I wanted to see where that is. I provide a summary of my testing\nscript in the appendix.

CULA has benchmarks for a few higher-level mathematical functions\n(source: the CULA Dense homepage):

\n\n\n\nAnaconda has published a good overview titled “Getting started with GPU\ncomputing”. I think I would start with Numba: it has debugging and supports\nsome notion of kernels. [updated 2017-11]

\n\n- \n
- Numba has [
`numba.cuda.jit`

], [`numba.hsa.jit`

]. It has good debugging and\nlooks like a wrapper around CUDA kernels. \n - Anaconda has developed pyculib. This provides access to
`cu{BLAS, FFT,\nRAND}`

and CUDA sorting. \n - PyCUDA and PyOpenCL are not tested because they\n require C++ code (PyCUDA example, PyOpenCL example). \n
- gnumpy not tested becuase it didn’t support Python 3 and hasn’t been\n touched in 4 years \n
- I tried to install cudarray but ran into install difficulties \n
- theano supports the GPU (see “Using the GPU”) but not tested – this\n seems to be primarily a machine learning library \n

…and of course I didn’t optimize any loop-based functions. To do optimize\nloop speed, I would look at numba first and then possibly Cython.

\n\n`import numpy as np\nimport cudamat as cm\n\nn, p = int(2e3), int(40e3)\nA = np.random.randn(n, p)\nB = np.random.randn(p, n)\n%timeit A @ B\n\ncm.cublas_init()\ncm.CUDAMatrix.init_random()\nA_cm = cm.empty((n, p)).fill_with_randn()\nB_cm = cm.empty((p, n)).fill_with_randn()\n%timeit A_cm.dot(B_cm)\ncm.cublas_shutdown()\n`

In this script, I show preparing for the FFT and preparing for linear algebra\nfunctions (e.g., `cilinalg.init()`

). I found that it’s useful to look at the\nscikit-cuda demos.

`import numpy as np\nfrom accelerate.cuda.blas import Blas\nimport accelerate.cuda.fft as acc_fft\nimport pycuda.autoinit\nimport pycuda.gpuarray as gpuarray\nimport skcuda.fft as cu_fft\nimport skcuda.linalg as culinalg\nimport skcuda.misc as cumisc\n\n# for scikit-learn\nculinalg.init()\n\n# for accelerate when calling wrapped BLAS functions (e.g., blas.dot)\nblas = Blas()\n\n\ndef fft_accelerate(x, y):\n f = acc_fft.FFTPlan(shape=x.shape, itype=x.dtype, otype=y.dtype)\n f.forward(x, out=y) # note: we're passing np.ndarrays\n return y\n\n\ndef fft_scikit(x, y):\n plan_forward = cu_fft.Plan(x.shape, np.float32, np.complex64)\n cu_fft.fft(x, y, plan_forward)\n return y.get()\n\nn = int(40e4)\nx = np.random.randn(n).astype('float32')\ny = np.zeros(n).astype('complex64') # needed because fft has complex output\n%timeit fft_accelerate(x, y)\n\nx = gpuarray.to_gpu(x)\ny = gpuarray.empty(n//2 + 1, np.complex64)\n%timeit fft_scikit(x, y)\n`

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2016/05/22/powder-days/",
"url": "http://stsievert.com/blog/2016/05/22/powder-days/",
"title": "Probability of a powder day",
"date_published": "2016-05-22T03:30:00-05:00",
"date_modified": "2016-05-22T03:30:00-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
which was calculating $\\Ab \\Xb^T$ outside the loop ↩

\n \n - \n
using the Strassen algorithm ↩

\n \n - \n
using the Coppersmith-Winograd algorithm ↩

\n \n

This last spring break, I had a *ton* of fun! Why?

I had the good fortune of catching a powder day with powder skis this spring\nbreak! While riding the Born Free chair at Vail, I wondered what the chances of\nthis happening in a given trip?

\n\n", "content_html": "This last spring break, I had a *ton* of fun! Why?

I had the good fortune of catching a powder day with powder skis this spring\nbreak! While riding the Born Free chair at Vail, I wondered what the chances of\nthis happening in a given trip?

\n\n\n\nA naïve approach is to simply count up the number of powder days in a season\nand divide by the number of weeks in a season (assuming a week-long trip)…\nbut that doesn’t capture any of the time dependence. In the depths of winter,\nit is more likely for big storms. When I typically go, during spring break I\nbelieve it’s less likely for a powder day to occur.

\n\nTo estimate the probability of a probability occurring during our trip, let’s\nplot the number of powder days of the last 31 years using the data from NOAA weather\nstation 058575 in Eagle County, CO, the location of my ski trips.

\n\n\n\nHow does this information help us determine the probability of a powder day? At\nfirst, this seems like a challenging problem. If you live in Colorado, you’ll\ncertainly see a powder day, a clue that the chance of a powder day is not just\nadding up the probabilities on each day.

\n\nFortunately, probabilists have spent time developing frameworks for exactly\nthis type of problem. The method that naturally lends itself to waiting for a\npowder day is a Poisson process. Quoting Wikipedia,

\n\n\n\n\n[a Poisson process can model] customers arriving and being served or phone\ncalls arriving at a phone exchange.

\n

This seems perfect framework for powder days! But first, how much snow do we\nneed before we declare a powder day? Speaking from experience, probably 1.5\ninches or about 38mm. Because the weather station is in the valley, this is\nreasonable. If the bottom of the valley sees 1.5 inches, the peak might see 4-5\ninches.

\n\nLuckily, this threshold doesn’t matter for our end use case: in practice, I’m\nmore interested in *when* to book my trip. This means I want the highest\nprobability of a powder day, which tends to remain fairly constant with\ndifferent thresholds.

The weather station reports the amount of snow each day. Using this definition\nof a powder day, we can graph the number of powder days that occur on (for\nexample) January 4th over 31 years.

\n\n\n\nWe have when powder days occur, and Poisson processes have some nice properties\nthat align with powder days. Most importantly, the value that characterizes\nPoisson processes, $\\lambda$, is characterized by the number of events seen in\na time interval. We can define $N(a, b)$ to be the number of events we see\nbetween days $a$ and $b$, and Poisson processes have the property that

\n\n$$\n\\E{N(a, b]} = \\lambda (b - a)\n$$

\n\nor the expected value in an interval is just some number times the length of\ntime. This means that we can easily compute $\\lambda$ in a time range: it’s\njust the number of snow storms that we observe. This parameter corresponds to\nthe frequency of events. Given a high $\\lambda$, the chances of a powder day\nare much higher.

\n\nIn this sense, waiting for a powder day is definitely *not* Poisson process.\nThe probability of a powder day changes over time. During the summer, a powder\nday definitely won’t happen. We can get around this because weather is a\n*slow* changing system. We can model each week as a stationary homogeneous\nPoisson process that follows the equation above.

When estimating $\\lambda$, we have to choose an amount of time that the system\ncan be modeled as a homogeneous Poisson process, or how long does the weather\nstay the same? We’ll decide on two weeks/14 days for this.

\n\n\n\nTo generate this graph, I did use some physical intuition. I said that the\nprobability of a power day couldn’t change too quickly over time, and smoothed\nout this curve over time. This corresponds to taking some [Bayesian prior] on\nthe likelihood of a powder day.

\n\nThis plot just tells us when it snows, and it doesn’t say anything about how\nmuch it snows. We know that in January, Colorado receives more snow (as\nindicated by the first plot), but here $\\lambda$ is lower.

\n\nBut now we have the value of $\\lambda$. Let’s see how probable powder days are\nthroughout the year! This will definitely depend on how long our trip to\nColorado is – if we lived out there, we’d definitely see a powder day.

\n\nTo calculate the probability that at least 1 powder day will occur on a 5 day\ntrip and given the value of $\\lambda$ in the graph above, we compute

\n\n$$\n\\Align{\n\\prob{N(a, b] > 1} &= 1 - \\prob{N(a, b] = 0}\\\\\n&=1 - \\Exp{\\lambda (b - a)}\n}\n$$

\n\nwhich is directly related to the Poisson random variable probability density\nfunction (which means that it’s easy to implement and found in `scipy.stats`

).

\n\n

This is of great practical important! I’m planning on taking a weekend trip to\nMt. Bohemia in Michigan to tap that lake effect snow. Using NOAA weather\nstation 201789, we can find out when to book a trip.

\n\n\n\nIt looks like I’ll schedule my trip to be in early January or mid-February!

\n\n", "tags": [], "image": "" }, { "id": "http://stsievert.com/blog/2016/02/26/coin-flips/", "url": "http://stsievert.com/blog/2016/02/26/coin-flips/", "title": "A Bayesian analysis of Clintons 6 heads", "date_published": "2016-02-26T06:30:00-06:00", "date_modified": "2016-02-26T06:30:00-06:00", "author": { "name": "Scott Sievert", "url": "http://stsievert.com" }, "summary": "Clinton recently won 6 coin flips during an Iowa caucus. On facebook and in the\nnews, I’ve only seen information about how unlikely this is – the chances\nof 6 heads are 1.56% with a fair coin.

\n\nYes, 6 heads is unlikely but these coin flips could have occurred by chance. I\nmean, on the Washington Post coin flip demo, I got all heads on my 5th try.\nInstead, it makes more sense a different question: given we observed these 6\nheads, what are the chances this coin wasn’t fair?^{1}

\n

\n",
"content_html": "- \n
- \n
If we were

\n*really*testing to see if the coin was unfair, it’d make more sense to do hypothesis testing ↩ \n

Clinton recently won 6 coin flips during an Iowa caucus. On facebook and in the\nnews, I’ve only seen information about how unlikely this is – the chances\nof 6 heads are 1.56% with a fair coin.

\n\nYes, 6 heads is unlikely but these coin flips could have occurred by chance. I\nmean, on the Washington Post coin flip demo, I got all heads on my 5th try.\nInstead, it makes more sense a different question: given we observed these 6\nheads, what are the chances this coin wasn’t fair?^{1}

This is a Bayesian approach; given our observations what probabilities can we\ninfer? This is not the classic frequentest approach that says “how likely are\nmy observations given all parameters about the model (the coin)?” This Bayesian\napproach makes complete sense when *only* the observations are known.

To formulate this problem as a probability problem, we’ll have to define some\nvariables:

\n\n- \n
- $\\theta$ is the probability of getting a heads. \n
- $f_i$ is the $i$th flip and is either 0 or 1. $f_i = 1$ with probability\n$\\theta$. \n
- $Y = \\sum_{i=1}^6 f_i$ is the number of heads we saw. In Clinton’s case, \n$Y = 6$. $Y$ is a binomial random variable. \n

Then given the probability of a heads $\\theta$, the probability of flipping $6$\nheads with this binomial random variable with

\n\n$$\n\\prob{Y = 6 \\given \\theta} = \\binom{6}{6} \\theta^6 (1 - \\theta)^{6-6} = 0.015625\n$$

\n\nwhich is exactly what Facebook/news articles focuses on. In their analysis,\nthey are given $\\theta = 1/2$ and show how unlikely it is. However, it could be\nthat Clinton got 6 heads by chance – maybe she got lucky enough to be in the\n1.56%?

\n\nTo do this, we do need a prior probability, or we need to guess how likely\nit is that Clinton made the coin unfair, and by how much. We’re guessing at\nsomething and that guess will tend to bias our result! This is a big deal; we\ncan’t be certain our estimate is unbiased.

\n\nTo do that, let’s take it that the probability of a heads $\\theta$ has this\nprobability density function (higher values in the graph below are considered\nmore likely):

\n\n\n\n\n\n\nHere, you can play with two sliders that determine the shape of this beta\ndistribution. At the risk of spoiling the rest of this post, when we take this\nprior there is a chance Clinton biased her coin.

\n\nTo do this, we’ll need to use Baye’s rule,

\n\n$$\n\\prob{\\theta \\given Y=k} = \\frac{\\prob{Y = k \\given \\theta}\\prob{\\theta}}{\\prob{Y=k}}\n$$

\n\nbut this is exceptionally easy because we chose that $\\theta \\sim \\text{beta}(a, b)$.\nWhen we do this, we find that $\\theta \\given Y=k \\sim \\text{beta}(a + k, b + n - k)$\nbecause the beta distribution is a conjugate prior for the \nBernoulli distribution.

\n\nAfter we do this, we have $\\prob{\\theta \\given Y=k}$. However, we want to know\nhow likely the coin was unfair, or the probability that $\\theta > 1/2 \\given Y=6$.\nTurns out that this probability is just an integral over the probability\ndensity function from 0.5 to 1 or

\n\n$$\n\\prob{\\theta > 1/2 \\given Y=6} = \\int_{0.5}^1 \\prob{\\theta \\given Y=6} d\\theta\n$$

\n\nPerforming the calculation, $\\prob{\\theta > 1/2 \\given Y=6} = $\n0.95. That’s right there’s a \n95% \nchance Clinton’s coin was biased given the 6 heads we saw!

\n\nWe can see that we have weak evidence for Clinton biasing her coin. When we\nassume that the coin is probably fair and it was only biased a small amount,\nthe largest probability of $\\theta>1/2$ is 85.2%. This is not a strong\nprobability; we’re looking for at least a 95% percent chance that Clinton\nbiased her coin to even put moderate faith in this belief.

\n\nYes, we are asserting something is true before we *prove* it is true.\nTypically, there is strong reason behind this. For example, we might know that\na small number of genes are important in a disease and we typically enforce\nthat. With this, I think it’s reasonable to assume that if the coin isn’t fair\nit’s only not fair by a small amount.

The Washington post article gives a similar conclusion after considering that\nthere were other coin flips that Sanders won:

\n\n\n\n\nThere were other coin tosses that emerged today which Sanders won – so, yes.\nVery slim.

\n

I’m writing this up to chance: Clinton got lucky that she flipped 6 heads. It\nlooks like these 6 flips were performed with a fair coin.

\n\nIf we wanted to formalize this method, we could take an even more scientific\napproach. We could use hypothesis testing with two hypothesis to find\np-values and numbers for how likely it is that these 6 heads were generated\nunder each hypothesis (null hypothesis: the coins are fair, alternative\nhypothesis: the coins are not fair).

\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2016/01/30/inverse-3/",
"url": "http://stsievert.com/blog/2016/01/30/inverse-3/",
"title": "Gradient descent and physical intuition for heavy-ball acceleration with visualization",
"date_published": "2016-01-30T06:30:00-06:00",
"date_modified": "2016-01-30T06:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
If we were

\n*really*testing to see if the coin was unfair, it’d make more sense to do hypothesis testing ↩ \n

*This post is a part 3 of a 3 part series: Part I, Part II, Part\nIII.*

We often make observations from some system and would like to infer something\nabout the system parameters, and many practical problems such as the Netflix\nPrize can be reformulated this way. Typically, this involves making\nobservations of the form\n$y = f(x)$ or $\\yb = \\Ab \\cdot \\xb$^{1} where $y$ is observed, $f/\\Ab$ is\nknown and $x$ is the unknown variable of interest.

Finding the true $x$ that gave us our observations $y$ involves inverting a\nfunction/matrix which can be costly time-wise and in the matrix case often\nimpossible. Instead, methods such as gradient descent are often involved, a\ntechnique common in machine learning and optimization.

\n\nIn this post, I will try to provide calculus-level intuition for gradient\ndescent. I will also introduce and show the heavy-ball acceleration method for\ngradient descent^{2} and provide a physical interpretation.

\n

\n",
"content_html": "- \n
- \n
I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet. ↩

\n \n - \n \n \n

*This post is a part 3 of a 3 part series: Part I, Part II, Part\nIII.*

We often make observations from some system and would like to infer something\nabout the system parameters, and many practical problems such as the Netflix\nPrize can be reformulated this way. Typically, this involves making\nobservations of the form\n$y = f(x)$ or $\\yb = \\Ab \\cdot \\xb$^{1} where $y$ is observed, $f/\\Ab$ is\nknown and $x$ is the unknown variable of interest.

Finding the true $x$ that gave us our observations $y$ involves inverting a\nfunction/matrix which can be costly time-wise and in the matrix case often\nimpossible. Instead, methods such as gradient descent are often involved, a\ntechnique common in machine learning and optimization.

\n\nIn this post, I will try to provide calculus-level intuition for gradient\ndescent. I will also introduce and show the heavy-ball acceleration method for\ngradient descent^{2} and provide a physical interpretation.

In doing this, I will interchangeably use the words derivative (aka\n$\\deriv$) and gradient (aka $\\grad$). The gradient is just the\nhigh-dimensional version of the derivative; all the intuition for the\nderivative applies to the gradient.

\n\nIn the problem setup it’s assumed we know the derivative/gradient of this\nfunction, a fair assumption. The derivative yields information about (a) the\ndirection the function increases and (b) how fast it increases.

\n\nFor example, we might be given a function, say $f(x) = x^2$. At the point\n$x = 3$, the derivative $\\deriv f(x)\\at{x=3} = 2\\cdot 3$ tells us the function\nincreases in the positive $x$ direction at a rate of 6.

\n\nThe important piece of this is the *direction* the function increases in.\nBecause the derivate points in the direction the function increases in, the\nnegative gradient points in the direction the function is *decreases* in.\nFunction minimization is common in optimization, meaning that the negative\ngradient direction is typically used. If we’re at $x_k$ and we take a step in\nthe negative gradient direction of $f$ our function will get smaller, or

$$\nx_{k+1} = x_k - \\tau \\grad f(x_k)\n$$

\n\nwhere $\\tau$ is some step-size. This will converge to a minima, possibly local.\nWe’re always stepping in the direction of the negative gradient; in every step,\nwe know that the function value gets smaller.

\n\nTo implement this, we would use the code below. Note that to implement this\npiece of code in higher dimensions, we would only define `x_hat = np.zeros(N)`

\nand make small changes to our function `grad`

.

`x_hat = 2\ntau = 0.02\n\nfor k in range(40):\n # grad is the gradient for some function, not shown\n x_hat = x_hat - tau*grad(x_grad, tau)\n`

This makes no guarantees that the minima is global; as soon as the gradient is\n0, this process stops. It can’t get past any bumps because it reads the\nderivative at a single point. In it’s simplest form, gradient descent makes\nsense. We know the function gets smaller in a certain direction; we should step\nin that direction.

\n\nOur cost function may include many smaller bumps, as in the picture below.\nGradient descent will fail here because the gradient goes to 0 before the\nglobal minima is reached.

\n\n\n\nGradient descent seems fragile in this sense. If it runs into any spots where\nthe gradient is 0, it will stop. It can run into a local minima and can’t get\npast it.

\n\nOne way of getting around that is by using some momentum. Instead of focusing\non the gradient at one point, it would be advantageous to include momentum to\novercome temporary setbacks. This is analogous to thinking of as a heavy ball\nis rolling hill and bumps only moderately effect it. The differential equation\nthat governs this motion in a gravity well described by $f$ is

\n\n$$\n\\ddot{x} + a \\dot{x} + b \\grad f(x) = 0\n$$

\n\nfor positive constants $a > 0$ and $b > 0$ but isn’t that useful for computers. The discretization of this equation is given by

\n\n$$\n(x_{k+1}-x_{k-1}) + a(x_{k+1} - x_k) + b\\grad f(x_k) = 0\n$$

\n\nAfter some algebraic manipulations (shown in the appendix) and defining $\\tau := \\frac{b}{1 + a}$ and $\\beta := \\frac{1}{1 + a}$, we can find

\n\n$$\nx_{k+1} = x_k - \\tau \\grad f(x_k) + \\beta \\cdot(x_{k-1} - x_{k})\n$$

\n\nWhen crafting this as a ball rolling down a hill, $a$ is friction and $b$ is\nthe strength of gravity. We would never expect friction to accelerate objects;\nif it did, the ball would never settle and would climb out of any bowl.\nCorrespondingly, when $a < 0$ aka when $\\beta > 1$ this algorithm diverges!

\n\nPhysical intuition has been provided, but does this hold in simulation? While\nsimulating, we should compare with the gradient descent method. The code below\nimplements this ball-accelerated method and is used to produce the video below\nthe code.

\n\n`def ball_acceleration(x, c):\n \"\"\"\n :input x: iterates. x[0] is most recent iterate, x[1] is last iterate\n :input c: Array of constants. c[0] is step size, c[1] is momentum constant\n :returns: Array of new iterates.\n \"\"\"\n # grad is the gradient for some function, not shown\n update = x[0] - c[0]*grad(x[0]) + c[1]*(x[1] - x[0])\n return [update, x[0]]\n\nx_hat = [2, 2]\ntau, weight = 0.02, 0.8\n\nfor k in range(40):\n x_hat = ball_acceleration(x_hat, [tau, weight])\n`

\n\n

\n\nWe can see that this heavy-ball method acts like a ball rolling down a hill\nwith friction. It nearly stops and falls back down into the local minima. It\nsettles at the bottom, near the global minima.

\n\nThis heavy-ball method is not guaranteed to converge to the global minima, even\nthough it does in this example. Typically, the heavy-ball method is used to get\nin the same region as the global minima, and then a normal optimization method\nis used that is guaranteed to converge to the global minima.

\n\nTo show this in higher dimensions, I quickly coded up the same algorithm above\nbut in higher dimensions (as shown in the appendix). Essentially, only made\nsmall changes to the function `grad`

were made. While doing this, I used\n$\\yb = \\Ab \\xb^\\star$ to generate ground truth and plotted the distance from\n$\\xb^\\star$. I made $\\Ab \\in \\R^{50 \\times 35}$ a tall matrix to represent\nan overdetermined system and to find the pseudoinverse. This is a problem\nformulation that where gradient descent methods are used.

I then graphed the convergence rate. In this, I knew ground truth and plotted\nand plotted the $\\ell_2$ distance to the true solution for both classic\ngradient descent and this accelerated ball method.

\n\n\n\nThis only shows that this ball-acceleration method is faster for linear systems\n– it doesn’t have any saddle points like a non-convex function!

\n\n- \n
- An introduction to gradient descent, an explanation of gradient descent\nwith code examples. \n
- A mathematical approach to the heavy-ball acceleration method \n
- Another mathematical approach on Nesterov acceleration. \n
- The lecture notes for lectures 2010-10-10 and 2012-10-15 in Ben Recht’s\nclass CS 726: Nonlinear optimization I. \n
- The academic paper that introduced the heavy-ball method, published by the\nRussian mathematician Polyak in 1964. \n

`M, N = 50, 35\nA = np.random.rand(M, N)\nx = np.random.rand(N)\ny = A @ x\n\nnp.random.seed(42)\ninitial_guess = np.random.rand(N)\nx_ball = [initial_guess.copy(), initial_guess.copy()]\nx_grad = initial_guess.copy()\n\nc = [1/np.linalg.norm(A)**2, 0.9]\n\nfor k in range(int(500)):\n x_grad = x_grad - c[0]*grad(A, x_grad, y)\n x_ball = ball_acceleration(A, x_ball, y, c)\n`

The discretization of $\\ddot{x} + a\\dot{x} + \\grad{f}(x) = 0$ is given by

\n\n$$\n\\align{\n(x_{k+1} - x_{k-1}) + a (x_{k+1} - x_k) + \\grad{f}(x) = 0\\\\\n}\n$$

\n\nSimplifying, we see that (if and only if’s left out for this algebraic\nmanipulation)

\n\n$$\n\\align{\n(1 + a) x_{k+1} &= ax_k + x_{k-1} - b \\grad{f}(x_k)\\\\\n(1 + a) x_{k+1} &= (a + 1)x_k + x_{k-1} - x_k - b \\grad{f}(x_k)\\\\\nx_{k+1} &= x_k + \\frac{1}{1+a} (x_{k-1} - x_k) - \\frac{b}{1+a} \\grad{f}(x_k)\\\\\n}\n$$

\n\n\n\n\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2016/01/06/vim-jekyll-mathjax/",
"url": "http://stsievert.com/blog/2016/01/06/vim-jekyll-mathjax/",
"title": "Vim syntax highlighting for Markdown, Liquid and MathJax",
"date_published": "2016-01-06T06:30:00-06:00",
"date_modified": "2016-01-06T06:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
I’ll use plain font/bold font for scalars/vectors (respectively) as per my notation sheet. ↩

\n \n - \n \n \n

I write my Jekyll blog in Markdown with Vim. I often include LaTeX equations\n(via MathJax) in my posts and/or Liquid tags like `{% include %}`

.\nMathJax equations and Liquid tags aren’t included in tpope/vim-markdown which\nmeant that the LaTeX can mess with the syntax highlighting, as illustrated by\nthe image below. I’ll describe a quick fix for this, resulting in the image on\nthe right.

I write my Jekyll blog in Markdown with Vim. I often include LaTeX equations\n(via MathJax) in my posts and/or Liquid tags like `{% include %}`

.\nMathJax equations and Liquid tags aren’t included in tpope/vim-markdown which\nmeant that the LaTeX can mess with the syntax highlighting, as illustrated by\nthe image below. I’ll describe a quick fix for this, resulting in the image on\nthe right.

A lot is wrong in the image on the left: LaTeX commands are misinterpreted as\nmisspelled words and entire paragraphs get incorrectly italicized.

\n\nTo highlight the MathJax equations, I included these lines in my `.vimrc`

. I\nset them up to only run when a Markdown file opens.

`function! MathAndLiquid()\n \"\" Define certain regions\n \" Block math. Look for \"$$[anything]$$\"\n syn region math start=/\\$\\$/ end=/\\$\\$/\n \" inline math. Look for \"$[not $][anything]$\"\n syn match math_block '\\$[^$].\\{-}\\$'\n\n \" Liquid single line. Look for \"{%[anything]%}\"\n syn match liquid '{%.*%}'\n \" Liquid multiline. Look for \"{%[anything]%}[anything]{%[anything]%}\"\n syn region highlight_block start='{% highlight .*%}' end='{%.*%}'\n \" Fenced code blocks, used in GitHub Flavored Markdown (GFM)\n syn region highlight_block start='```' end='```'\n\n \"\" Actually highlight those regions.\n hi link math Statement\n hi link liquid Statement\n hi link highlight_block Function\n hi link math_block Function\nendfunction\n\n\" Call everytime we open a Markdown file\nautocmd BufRead,BufNewFile,BufEnter *.md,*.markdown call MathAndLiquid()\n`

I put this in my `.vimrc`

. I tried to include helpful comments, but I can\nsummarize the steps:

- \n
- Look for LaTeX equations, surrounded by
`$`

or`$$`

pairs. For inline math\nwith`$`

, we use a regex that only works on a single line. For`$$`

, we use\nvim’s region. \n - Look for Liquid tags, wrapped with
`{%`

and`%}`

. This regex is easier. \n - We then highlight the regions we define as
`math`

and`liquid`

with\n`Function`

or`Statement`

. \n

When I first wrote this post, I modified my Jekyll’s `_config.yml`

file to\ninclude “GitHub Flavored Markdown” for my Markdown parser, kramdown. This isn’t\nnecessary for nice highlighting but it’s still nice (mostly for fenced code\nblocks).

To tell kramdown to use GitHub flavored markdown, I included the following\nlines in my `_config.yml`

:

`markdown: kramdown\n\nkramdown:\n input: GFM\n hard_wrap: false\n`

If you use a 80-character line wrap like I do, `hard_wrap`

is necessary to\nprevent kramdown putting `<br>`

tags where ever it sees a line break in the\nfile.

The method above worked for me; it provided basic functionality of what I\nneeded. Of course, there are other methods to achieve similar functionality:

\n\n- \n
- vim-markdown-jekyll, not an alternative but highlights YAML front matter. I\nalso installed this plugin. \n
- vim-pandoc-syntax, looks promising (pandoc documents can include Markdown,\nLaTeX and a bunch of other stuff) but a quick try couldn’t get it to work. \n
- tpope/vim-liquid, no documentation, but as far as I can tell it only\nhighlights
`.liquid`

files. \n - plasticboy/vim-markdown, provides markdown with LaTeX highlighting too (not\njust one color) but doesn’t highlight HTML. \n

How might two people communicate without others even knowing they’re communicating? They could be communicating to harm some entity and are being observed by that entity.^{1} Because of this, they want to send a message that others can’t even detect if it’s present.

\n

\n",
"content_html": "- \n
- \n
I’m sure you imagine more situations where other more nefarious people are communicating and

\n*know*they’re being watched. ↩ \n

How might two people communicate without others even knowing they’re communicating? They could be communicating to harm some entity and are being observed by that entity.^{1} Because of this, they want to send a message that others can’t even detect if it’s present.

What method can we use to talk/communicate without others even knowing we’re talking? The first and most obvious approach is to use the least significant bit to encode your message. That is, this method will take an image and hide message in the least significant of the 8 bits in an image. This process is best shown with an image:

\n\n\n\n

\n\nThis hardly changes the value of the pixel color, and it can encode a message (if given enough pixels). The code to do this approach would be

\n\n\n\nThis method can even be made secure: both the communicator and receiver can randomly shuffle their image in the same way using a key they both know (i.e., with `np.random.seed(key)`

). This makes it an attractive method, especially if communicating something you’d like to remain secret and know you’re being watched.

This method succeeds, at least by visual inspection. By looking at the images, it’s impossible to tell if a message has been sent.

\n\nHowever, this method fails by more rigorous testing. If we plot a histogram of the results, we can see an strange pattern occurring on the odd/even values:

\n\n\n\nThis method works and withstands uploading to imgur because the images are saved as PNGs, a lossless compression. However, if compressed to JPG to save space, the least significant bit will be corrupted because JPG isn’t a lossless compression.

\n\nInstead of the least significant bit method, let’s hide our message along the edges of an image. This is where the values are quickly changing; flipping a bit will be a small change in a much larger change. This means that we’ll certainly be able to hide a smaller image; we can only hide near the edges. Earlier, we made no restrictions on where we could hide the message.

\n\nTo do that, we’ll use the wavelet transform. You don’t have to know this in detail, just know that the wavelet transform is:

\n\n- \n
- A unique way of representing an image. It’s possible to back-and-forth between the wavelet transform and image. \n
- The edges of the image are represented by large values in magnitude in the image. We’ll be changing the least significant bit of these large values that characterize the image. \n

To do this, we would find the wavelet coefficients that are larger than some threshold (the threshold would have to be known on both sides). Then we could find the support of a wavelet coefficient and change those values. By the definition of the wavelet transform, this would correspond to changing pixel values that are near edges.

\n\nWhen I first implemented this, I didn’t find the support of each term and instead changed the value of coefficients larger than some threshold. Regardless, plotting the histograms shows us that if we change values near the edges, our message is better hidden in the histogram:

\n\n\n\nBy visual inspection, I can’t tell these two curves apart without knowing the other curve. This is exactly what this method hopes to achieve. It’s impossible to recover the message without knowing the message exactly – if you knew the message exactly, why would you go to the work of recovering the message?

\n\nTo detect data hiding methods like hide the fact that two parties are communicating, agencies that intercept these communications might try a suite of commonly used methods to decode the message.

\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/12/09/generating-rvs/",
"url": "http://stsievert.com/blog/2015/12/09/generating-rvs/",
"title": "Using the uniform random variable to generate other random variables",
"date_published": "2015-12-09T07:31:00-06:00",
"date_modified": "2015-12-09T07:31:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
I’m sure you imagine more situations where other more nefarious people are communicating and

\n*know*they’re being watched. ↩ \n

Since computers were invented we have spent a lot of time generating *uniform*\nrandom numbers. A quick search on Google Scholar for “Generating a uniform\nrandom variable” gives 850,000 results. But what if we want to generate\nanother random variable? Maybe a Gaussian random variable or a binomial\nrandom variable? These are both extremely useful.^{1}

\n

\n",
"content_html": "- \n
- \n
I won’t cover this here, but the Gaussian random variable is useful almost everywhere and the binomial random variable can represent performing many tests that can either pass or fail. ↩

\n \n

Since computers were invented we have spent a lot of time generating *uniform*\nrandom numbers. A quick search on Google Scholar for “Generating a uniform\nrandom variable” gives 850,000 results. But what if we want to generate\nanother random variable? Maybe a Gaussian random variable or a binomial\nrandom variable? These are both extremely useful.^{1}

We’ve spent so long focusing on generating uniform random variables they must\nbe useful. That is, computers try to generate

\n\n$$\nU \\sim \\textrm{uniform}(0, 1)\n$$

\n\nwhich is a random number between 0 and 1 with equal probability of any number\nhappening. It is exactly the variable that has received so much attention and\nhas seen many algorithms developed to generate this variable. We’ve developed\npseudo-random number generators for this and it’s what `rand`

implements in\nmany programming languages.

The cumulative distribution function (CDF) defines our random variable. It\ngives the probability that the random variable it defines lies below a certain\nvalue. It gives a probability out, and the cumulative distribution function of\n$X$ is defined to be

\n\n$$\nF_X(x) = \\Pr(X \\le x)\n$$

\n\nThen, let’s define our variable to be the inverse of the CDF we’d like, $F(x)$.

\n\n$$\nX = F^{-1}(U)\n$$

\n\nBecause we defined our random variable as $X = F^{-1}(U)$, we see that our\nestimate of the CDF for $X$, $F_X$ is

\n\n$$\n\\begin{aligned}\nF_X(x) &= \\Pr(X \\le x)\\\\\n&= \\Pr(F^{-1}(U) \\le x)\\\\\n&= \\Pr(U \\le F(x))\\\\\n&= F(x)\n\\end{aligned}\n$$

\n\nwhere in the last step we use that $\\Pr(U \\le t) = t$ because it is a uniform\nrandom variable between 0 and 1, and the CDF is a linear line with a slope of\n1 (that is, between 0 and 1).

\n\nWe see that $F_X(x) = F(x)$, exactly as we wanted! This means that we only need\na uniform random number and $F_X^{-1}$ to generate any random variable. After\nwe generate $F^{-1}(y)$ (which may be difficult) we can generate any random\nvariable!

\n\nFor an example, let’s use the exponential random variable. The cumulative\ndistribution function of this random variable is defined to be

\n\n$$\nF_X(x) = (1 - e^{-\\lambda x}) u(x)\n$$

\n\nTo generate this random variable, we would define

\n\n$$\nX = F_X^{-1}(U) = -\\frac{1}{\\lambda}\\log(1 - U)\n$$

\n\nThe following Python generates that random variable, and tests against theory.

\n\n\n\nAnd we can view the results with a histogram:

\n\n\n\nA quick check at `np.random`

yields that they generate random variables the\nsame way.

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/12/09/inverse-part-2/",
"url": "http://stsievert.com/blog/2015/12/09/inverse-part-2/",
"title": "Finding sparse solutions to linear systems",
"date_published": "2015-12-09T06:30:00-06:00",
"date_modified": "2015-12-09T06:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
I won’t cover this here, but the Gaussian random variable is useful almost everywhere and the binomial random variable can represent performing many tests that can either pass or fail. ↩

\n \n

*This post is a part 2 of a 3 part series: Part I, Part II, Part III*

We often have fewer measurements than unknowns, which happens all the time in\ngenomics and medical imaging. For example, we might be collecting 8,000 gene\nmeasurements in 300 patients and we’d like to determine which ones are most\nimportant in determining cancer.

\n\nThis means that we typically have an underdetermined system because we’re\ncollecting more measurement than unknowns. This is an unfavorable situation –\nthere are infinitely may solutions to this problem. However, in the case of\nbreast cancer, biological intuition might tell us that most of the 8,000 genes\naren’t important and have zero important in cancer expression.

\n\n\n\nHow do we enforce that most of the variables are 0? This post will try and give\nintuition for the problem formulation and dig into the algorithm to solve the\nposed problem. I’ll use a real-world cancer dataset^{1} to predict which\ngenes are important for cancer expression. It should be noted that we’re more\nconcerned with the *type* of solution we obtain rather than how well it\nperforms.

\n

\n",
"content_html": "- \n
- \n
This data set is detailed in the section titled Predicting Breast Cancer ↩

\n \n

*This post is a part 2 of a 3 part series: Part I, Part II, Part III*

We often have fewer measurements than unknowns, which happens all the time in\ngenomics and medical imaging. For example, we might be collecting 8,000 gene\nmeasurements in 300 patients and we’d like to determine which ones are most\nimportant in determining cancer.

\n\nThis means that we typically have an underdetermined system because we’re\ncollecting more measurement than unknowns. This is an unfavorable situation –\nthere are infinitely may solutions to this problem. However, in the case of\nbreast cancer, biological intuition might tell us that most of the 8,000 genes\naren’t important and have zero important in cancer expression.

\n\n\n\nHow do we enforce that most of the variables are 0? This post will try and give\nintuition for the problem formulation and dig into the algorithm to solve the\nposed problem. I’ll use a real-world cancer dataset^{1} to predict which\ngenes are important for cancer expression. It should be noted that we’re more\nconcerned with the *type* of solution we obtain rather than how well it\nperforms.

This post will build of Part I of this series. In this post, we will only\nchange the type of solution we obtain by changing the regularization\nparameter. In Part I, we saw that we could get an acceptable solution by\nintroducing a regularization parameter, $\\norm{\\xb}_2^2$. In this post,\nwe’ll examine changing that to $\\norm{\\xb}_1$.

\n\nBefore getting started, we’ll have to use norms, as they provide a nice\nsyntax for working with vectors and matrices. We define the $\\ell_p$ norms as

\n\n$$\n\\norm{\\xb}_p = \\parens{\\sum \\abs{x_i}^p}^{1/p}\n$$

\n\nwhich means that $\\ell_2$ norm of $\\xb$ is\n$\\norm{\\xb}_2 := \\sqrt{\\sum_i x_i^2}$\n(meaning $\\norm{\\xb}_2^2 = \\sum_i x_i^2$) and $\\ell_1$ norm of\n$\\xb$ is $\\norm{\\xb}_1 := \\sum_i \\abs{x_i}$. This definition doesn’t\ndefine $\\ell_0$ norms, but we define it to be the number of non-zero terms.

\n\nThrough external information, we know that most of our solution is 0.\nTherefore, we want to limit the number of non-zeros in our solution. We can\nenforce adding a penalty for the number of non-zeros.

\n\nThat is, after observing $\\yb$ and $\\Xb$, we’re trying to find a solution\n$\\widehat{\\wb}$ that minimizes the squared error and has a small number of zeros. We\ncan do this by adding a penalty on the number of non-zeros:

\n\n$$\n\\align{\n\\widehat{\\wb} &= \\arg \\min_\\wb \\norm{\\yb - \\Xb\\wb}_2^2 + \\lambda\\sum_i\n1_{w_i \\not= 0}\n}\n$$

\n\nbut this problem is exceptionally hard to solve because this is non-convex\nand NP-hard. The best algorithms to solve this by allowing $k$ variables to\nbe non-zero and runs through all $2^n$ ways to do that. This takes exponential\ntime… how can we find a more efficient method?

\n\nThe closest convex relaxation of the $\\ell_0$ norm is the $\\ell_1$ norm. In our\nnew problem with the $\\ell_1$ norm as defined above, we can make the $\\ell_1$\nnorm small by making many of the terms zero. This means we’re trying to solve

\n\n$$\n\\widehat{\\wb} = \\arg \\min_\\wb \\norm{\\yb - \\Xb\\wb}_2^2 +\n\\lambda\\norm{\\wb}_1\n$$

\n\nThe type of regularization matters characterizes the signal we get as output.\nIn this problem formulation, we are including the term $\\norm{\\wb}_1$ or the\n$\\ell_1$ regularization parameter. This gives a much different result than the\n$\\ell_2$ regularization parameter $\\norm{\\wb}_2^2$. To help see this, I have\ndeveloped an interactive widget that highlights the difference between $\\ell_1$\nand $\\ell_2$ regularization!

\n\n\n\n\n

\n\n\n\n

\nThe type of regularization parameter we use matters a ton – there’s a huge difference in the output when using $\\norm{\\wb}^2_2$ instead of $\\norm{\\wb}_1$.

\n\nWhy? We can think about the physical interpretations. If trying to optimize\nfor the engine to buy and normalizing for appropriate units, we might use the

\n\n- \n
- $\\ell_2$ norm if we’re trying to use as little gas as possible. This\ncorresponds to using as little energy as possible. This makes sense, because\nenergy typically comes in squares (i.e., kinetic energy is $\\frac{1}{2}mv^2$.\nSee Table 3 at tauday.com for more examples). \n
- $\\ell_0$ or $\\ell_1$ norm if we want to run the engine as little as possible.\nWe don’t care about how much gas we use, just how long it’s running. Because\nthe engine is off most of the time, this corresponds to a sparse solution. \n
- $\\ell_\\infty$ norm, or the maximum element in a vector. This would correspond\nto being limited to how much power we can use. We can have the engine on as\nlong as we want and use as much gas as we want. For example, the state could\nregulate that cars have to be less powerful than some limit. \n

To help provide more intuition, I have provided a 2D example using an\ninteractive widget. In this 2D example, we can think that $\\Xb = [c_1, c_2]$\nand $y$ as a scalar. We’re trying to find $\\wb = [w_1, w_2]^T \\in \\R^2$, but we only have one measurement; there are infinitely many\nsolutions.

\n\n- \n
- \n

\n**All possible solutions**to\n$y = c_1 w_1 + c_2 w_2$ are graphed by the purple line. We know $y$ and are trying to\n estimate $w_1$ and $w_2$ from our knowledge of $c_1$ and $c_2$. \n - \n

\n**The $\\ell_1$ solution vector**and the\n**the $\\ell_2$ solution vector**are in blue\nand green. The norm balls for the solution vectors are also graphed. \n

When we change $c_1$, we see that the solution tends to minimize the distance between the norm ball and the line of all possible solutions. We can see when we increase $\\lambda$, our estimate gets smaller. This makes sense because we are placing more emphasis on this value, and it reaches it’s optimum at the origin.

\n\n\n

\n\n\nWe can think of this optimization for $\\ell_2$ and $\\ell_1$ as minimizing the\ndistance between the norm balls and the line of all possible solutions. We see\nthat the $\\ell_1$ norm tends to give solutions with more zeros in them such as\n$(1, 0)$ or $(0, 1)$. The $\\ell_2$ solution gives more non-zero values off the\naxis which means, by definition, the $\\ell_1$ solution is more *sparse* than\nthe $\\ell_2$ solution.

Now that we know what tool to use we can start to tackle this cancer problem!

\n\nIn a class this semester Rob Nowak and Laurent Lessard introduced a breast cancer dataset described in a New England Journal of Medicine article.^{2} This dataset tests cancerous and health patients for gene expression levels (295 patients in total, roughly 8000 genes). We’d like to design a predictor for breast cancer based off levels of gene expression.

In this dataset, we observe if someone has cancer or not, indicated by $\\yb$,\nwith each element being $\\pm 1$ indicating the presence of cancer. For these\n295 patients, this dataset also provides tests the expression levels of 8,000\ngenes, as expressed by $\\Xb$. The $i$th row in $\\Xb$ corresponds to $\\yb_i$\n– it contains the gene expression levels for patient $i$.

\n\nIn this problem, we’d like to determine how important each gene is for cancer.\nWe will assign a weight to each gene, and a weight of 0 means it’s not at all\nimportant. We will assume that the underlying model takes the sign of our\nprediction, or $\\yb = \\sign{\\left(\\Xb\\wb\\right)}$ where\n$\\sign{\\left(\\cdot\\right)}$ is applied element-wise.

\n\nWe will solve this problem formulation we saw above, the LASSO problem\nformulation:

\n\n$$\n\\widehat{\\wb} = \\arg \\min_\\wb \\norm{\\yb - \\Xb\\wb}_2^2 + \\lambda \\norm{\\wb}_1\n$$

\n\nAs we saw above, it will encourage that most $w_i$’s are 0, or have no importance in determining if someone has breast cancer or not. We saw above *why* this formulation made sense, but now let’s see *how* to solve it!

The solution to this optimization problem has no closed form solution meaning we have to find our solution iteratively. We’ll choose to solve this with an alternating minimization method (a method for biconvex optimization).

\n\nThis method utilizes the fact that both the error term and the regularization parameter term are positive. Given two positive parameters, it’s natural to optimize one then the other to minimize their sum. If you were minimizing the product, it’d be natural to drive one as close to 0 as possible.

\n\nA method of alternating optimization is the proximal gradient method with rigorous justification in academic papers.^{3}^{4}^{5}^{6} Given suitable loss functions and regularization parameters, this method does two steps (typically in a for-loop):

- \n
- Take a step towards the solution that minimizes the error as defined by the loss function. This takes a step in the negative gradient (scalar case: derivative) direction because the gradient/derivative points to the direction the function
*increases*. \n - Enforces that the solution that minimizes the loss function should also minimize the regularization function. This takes the solutions found in (1) and enforces that they must be acceptable. \n

These steps can be written as

\n\n$$\n\\bm{z} = \\wb_k - \\tau \\nabla F\\left(\\wb_k\\right))\n$$

\n\n$$\n\\wb_{k+1} = \\arg \\min_\\wb \\norm{\\wb - \\bm{z}} + \\tau \\lambda\n\\norm{\\wb}_1\n$$

\n\nwhere $\\nabla F(\\wb_k)$ represents the gradient of $\\norm{\\yb - \\Xb\\wb}$ at the point $\\wb_k$ (which represents the estimate at iteration $k$) and $\\tau$ represents some step size. The equations correspond to steps (1) and (2) above.

\n\nThe derivations are in the appendix and results in

\n\n$$\n\\bm{z} = \\wb_k - \\tau \\cdot 2 \\Xb^T \\left(\\Xb\\wb - \\yb\\right)\n$$

\n\n$$\n\\wb_{k+1} = \\textrm{sign}(\\bm{z}) \\cdot \\left(\\abs{\\bm{z}} -\n\\tau\\lambda/2\\right)_+\n$$

\n\nwhere $(\\cdot)_+$ is the soft thresholding operator that keeps the positive elements of the input and sets the rest to $0$. All the operations ($\\textrm{sign}$, $(\\cdot)_+$, multiplication, etc) are done element-wise.

\n\nWe can implement these functions as follows:

\n\n`def proximity(z, threshold):\n \"\"\" The prox operator for L1 regularization/LASSO. Returns\n sign(z) * (abs(z) - threshold)_+\n where (.)_+ is the soft thresholding operator \"\"\"\n x_hat = np.abs(z) - threshold/2\n x_hat[x_hat < 0] = 0\n return x_hat * np.sign(z)\n\ndef gradient(A, x, y):\n \"\"\" Computes the gradient of least squares loss at the point x \"\"\"\n return 2*A.T @ (A@x - y)\n`

We can also implement the alternating minimization. As equations $(1)$ and $(2)$ mention, the output of the gradient step is fed into the proximity operator.

\n\n`# X is a fat matrix, y a label vector, y_i \\in {-1, 1}\nX, y = breast_cancer_data()\n\n# our initial guess; most of the values stay zero\nw_k = np.zeros(X.shape[1])\ntau = 1 / np.linalg.norm(X)**2 # guarantees convergence\nfor k in range(int(1e3)):\n z = w_k - tau * gradient(X, w_k, y)\n w_k = proximity(z, tau*lambda)\n\n# binary classification tends to take the sign\n# (we're only interested in the proprieties of the weights here, not y_hat)\ny_hat = np.sign(A @ w_k)\n`

After wrapping this function in a class, this turns into the basics of `sklearn.linear_model.Lasso`

. I have done this and the source is available on GitHub.

When we find the optimial waits, we don’t want to use *all* the data. We can save a set of training data and never see it until we test on it. Using only 80% of the dataset to train our weights, we get the following set of weights!

\n

\n\n\n\n

\nWith this $\\lambda$, this prediction gives accuracy of 81.36% when predicting on 20% dataset of the data reserved for testing. If we had a balanced dataset (we don’t) we’d get accuracy of 50% by flipping a coin. While we don’t have ideal accuracy, we do have a solution with many zeros which is what we set out to do.

\n\nThe content of this blog post was finding sparse solutions, but how can we improve these results? We are performing binary classification – we’re predicting either “yes” or “no”… but we don’t really use that information. We just naïvely used a least squares loss which penalizes points that are easy to classify and too far on the correct side of the decision boundary!

\n\nThe next post will focus on support vector machines, classifiers that don’t have punish accuracies that are too correct. They do this by using hinge loss and logistic loss.

\n\n*For ease, we will drop the bold face math, meaning $x := \\xb, A := \\Ab, y:=\\yb$. Also note that all operators are evaluated element-wise (expect matrix\nmultiplication). This applies to $(\\cdot)_+$, and use element-wise multiplication and sign operators.*

This “proof” details the iterative soft thresholding algorithm. This method can\nbe accelerated by the algorithms FISTA^{7} or FASTA^{8} by choosing a\ndifferent step size at each iteration with $\\tau_k$.

For rigorous justification why the proximal gradient method is justified,\nsee academic papers.^{3}^{4}^{5}^{6}

*Given $\\phi(x) = \\norm{Ax - y}_2^2$, the gradient is $2A^T(Ax - y)$.*

*Proof:* We can choose to represent squared error as $(Ax - y)^T (Ax-y)$. Then\nusing intuition from scalar calculus and some gradient identities,

$$\n\\begin{aligned}\nx + y &= 1\\\\\nx &= y\n\\end{aligned}\n$$

\n\n$$\n\\begin{aligned}\n\\nabla \\phi(x) &= \\frac{d (Ax - y)}{dx} \\cdot 2\\cdot (Ax - y)\\\\\n&= 2 A^T (Ax - y)\n\\end{aligned}\n$$

\n\n*Given the proximity operator for $\\ell_1$ regularization as $\\phi(x) =\n\\norm{y - x}_2^2 + \\lambda\\norm{x}_1$, the optimum solution is given by\n$\\widehat{x} = \\sign(y)\\left(\\abs{y} - \\lambda/2\\right)_+$ where $(\\cdot)_+$\ndenotes the soft thresholding operator.*

*Proof:* The definition of the proximity operator results in a separable\nequation that allows us to write

$$\n\\begin{aligned}\n\\phi(x) &= \\norm{y - x}_2^2 + \\lambda \\norm{x_i}_1\\\\\n&= \\sum_i (y_i - x_i)^2 + \\lambda \\abs{x_i}\n\\end{aligned}\n$$

\n\nThis equation can be minimized by minimizing each term separately.

\n\n$$\n\\pder{\\phi(x)_i}{x_i} = -2(y_i - x_i) + \\lambda\\pder{\\abs{x_i}}{x_i}\n$$

\n\n$$\n\\pder{x}{y}\n$$

\n\nThis last term on the end, $\\pder{\\abs{x_i}}{x_i}$ is tricky: at $x_i =\n0$, this term is not differentiable. After using subgradients to skirt around\nthat fact, we can say that $\\pder{\\abs{x_i}}{x_i} = \\sign(x_i) = \\pm 1$\nwhich makes sense when we’re not at the origin.

\n\nThis function is convex which allows us to set the derivative to 0 and\nfind the global minima.

\n\n$$\n\\pder{\\phi(x)_i}{x_i} = 0 = -2(y_i - x_i) + \\lambda\\sign(x_i)\n$$

\n\n$y_i > 0$ implies that $x_i > 0$ which allows us to write

\n\n$$\n\\Align{\nx_i &= y_i - \\frac{\\lambda}{2}\\sign(y_i)\\\\\n&= \\sign(y_i) (\\abs{y_i} - \\lambda/2)\n}\n$$

\n\nBut when $\\abs{y_i} < \\lambda/2$ we run into difficulty because $y_i > 0\n\\implies x_i > 0$. To get around this fact, we set all the values where\n$\\abs{y_i} < 0$ to 0 and subtract $\\lambda/2$ from the rest (this\nis detailed further in a StackExchange post). With this, we\ncan now write

\n\n$$\nx_i = \\sign(y_i) \\cdot (\\abs{y_i} - \\lambda/2)_+\n$$

\n\nwhere $(x)_+ := \\max(x, 0)$.

\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/11/19/inverse-part-1/",
"url": "http://stsievert.com/blog/2015/11/19/inverse-part-1/",
"title": "Least squares and regularization",
"date_published": "2015-11-19T06:30:00-06:00",
"date_modified": "2015-11-19T06:30:00-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "\n\n- \n
- \n
This data set is detailed in the section titled Predicting Breast Cancer ↩

\n \n - \n
One of the trickiest issues is finding real world data to apply your problems too. ↩

\n \n - \n
Wright, S. J., Nowak, R. D., & Figueiredo, M. A. (2009). Sparse reconstruction by separable approximation. Signal Processing, IEEE Transactions on, 57(7), 2479-2493. ↩ ↩

\n^{2}\n - \n
Hale, E. T., Yin, W., & Zhang, Y. (2007). A fixed-point continuation method for l1-regularized minimization with applications to compressed sensing. CAAM TR07-07, Rice University, 43, 44. ↩ ↩

\n^{2}\n - \n
Daubechies, I., Defrise, M., & De Mol, C. (2003). An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. arXiv preprint math/0307152. ↩ ↩

\n^{2}\n - \n
Figueiredo, M. A., Bioucas-Dias, J. M., & Nowak, R. D. (2007). Majorization–minimization algorithms for wavelet-based image restoration. Image Processing, IEEE Transactions on, 16(12), 2980-2991. ↩ ↩

\n^{2}\n - \n
Beck, A., and Teboulle, M.. “A fast iterative shrinkage-thresholding algorithm for linear inverse problems.” SIAM journal on imaging sciences 2.1 (2009): 183-202. ↩

\n \n - \n
Goldstein, T., Christoph S., and Richard B. “A field guide to forward-backward splitting with a FASTA implementation.” arXiv preprint arXiv:1411.3406 (2014).,

\n*website*↩ \n

*This post is part 1 of a 3 part series: Part I, Part II, Part III.*

Imagine that we have a bunch of points that we want to fit a line to:

\n\n\n\nIn the plot above, $y_i = a\\cdot x_i + b + n_i$ where $n_i$ is random\nnoise. We know every point $(x_i, y_i)$ and would like to estimate $a$ and\n$b$ in the presence of noise.

\n\nWhat method can we use to find an estimation for $a$ and $b$? What constraints\ndoes this method have and when does it fail? This is a classic problem in\nsignal processing – we are given some noisy observations and would like to\ndetermine *how* the data was generated.

*This post is part 1 of a 3 part series: Part I, Part II, Part III.*

Imagine that we have a bunch of points that we want to fit a line to:

\n\n\n\nIn the plot above, $y_i = a\\cdot x_i + b + n_i$ where $n_i$ is random\nnoise. We know every point $(x_i, y_i)$ and would like to estimate $a$ and\n$b$ in the presence of noise.

\n\nWhat method can we use to find an estimation for $a$ and $b$? What constraints\ndoes this method have and when does it fail? This is a classic problem in\nsignal processing – we are given some noisy observations and would like to\ndetermine *how* the data was generated.

After we express this problem in a matrix multiplication language, the solution\nbecomes more straightforward. We abstract some of the details away and focus on\nlarger concepts. In doing this I will denote scalars with plain math font,\nvectors with lower-case bold math font and matrices with upper-case bold math\nfont.

\n\nTo represent this with matrix multiplication, we’ll stack our observations\n$y_i$ into a vector $\\yb$. When we define $\\Ab$ and $\\wb$ (our data and\nweights respectively) in certain ways, we can say that $\\yb = \\Ab\\cdot\\wb +\n\\nb$ when $\\nb$ is a vector of noise. In particular, we’ll define $\\Ab$ so\nthat each row is $[x_i,~1]$ and our weights $\\wb$ as $[a, b]^T$.

\n\n$$\n\\Align{\n\\yb &= \\Ab\\cdot\\Matrix{a\\\\b} + \\nb\\\\\n&= \\Ab\\cdot\\wb + \\nb\n}\n$$

\n\nWe are collecting more data than we have unknowns and are fitting a linear line\nwhich means that this problem is overdetermined – we have more measurements\nthan unknowns. Our only unknowns are $a$ and $b$ and we have many\nmeasurements.^{1}

Our approximation of $\\wb$ will be called $\\est{\\wb}$. The key behind finding this\napproximation is realizing the objective. Should we minimize the error? Should\nwe say that large errors should be penalized more than small errors?

\n\nWe want to minimize the error, or the distance between our observations and the\nvalue that our approximation would predict. That is, we want to minimize the\ndistance between $\\yb$ and $\\Ab \\cdot \\est{\\wb}$ and penalize large errors more.\nWe can represent this a concise mathematical equation:

\n\n$$\n\\est{\\wb} = \\arg \\min_\\wb \\norm{\\yb - \\Ab\\cdot\\wb}_2^2\n$$

\n\nThis says that our estimate $\\est{\\wb}$ will be the *arg*ument that *min*imizes the\nstandard error term. After writing this equation down and knowing what to\nGoogle, we can find a closed form solution with the help of the Wikipedia page for least squares!

$$\n\\est{\\wb} = \\left( \\Ab^T \\Ab\\right) \\Ab^T \\yb\n$$

\n\nWith this equation we find that $\\est{\\wb} = [\\widehat{a},~\\widehat{b}]^T \\approx\n[0.89,~0.03]$, and not that far away from the true solution $\\xb = [1,~0]^T$.

\n\n\n\nI won’t cover on how to *solve* this equation, but know that matrix inversion\nis often expensive time- and memory-wise. Instead, we often rely on methods\nlike gradient descent to find a solution to this equation.

This method works and is (and should!) be widely used in similar use cases,\nbut it’s also old. What other solutions are out there and where does it fail?

\n\nLet’s say we are given noisy observations of a blurry signal. How can we\nreconstruct the values of the original signal from our observations? Similar\nsituations happen all the time in the real world. We see averaging input data\ntogether all the time.

\n\nLet’s say that we make $n$ observations of a signal $\\xb$ and put the\nobservations into a vector $\\yb$. We know that the averaging of $k$ values can\nbe represented by a matrix $\\Hb_k$ because blurring is a\nlinear transformation. When we represent noise with standard deviation $\\sigma$ as\n$\\nb_\\sigma$,

\n\n$$\n\\yb = \\Hb_k \\cdot \\xb + \\nb_\\sigma\n$$

\n\nThe naive approach would be to use the inverse transform with the the inverse\nfilter, which is possible because $\\Hb_k$ is a square matrix. It would seem\nnatural just to reverse the operation by making our estimate $\\est{\\xb}$ by

\n\n$$\n\\est{\\xb} = \\Hb_k^{-1}\\cdot\\yb = \\xb + \\Hb_k^{-1}\\cdot\\nb_\\sigma\n$$

\n\nBut the noise ruins this approximation and similar approximation (such as the\nleast squares approximation). In general, $\\Hb^{-1}$ is adding or\nsubtracting many noisy values. Even worse, it is common for $\\Hb^{-1}$ to be\nill-conditioned and include large values. This means that our estimate\n$\\est{\\xb}$ quickly gets a lot of error as (typically) many noisy values are\nmultiplied by large values and then added.

\n\nOf course, in the noise free case this method works *perfectly.* But in noisy\ncases, the estimate does not perform well… and I’ve made an interactive widget\nto help explore that fact!

\n

\n\n\nWe can see that in the noise-free case, this works perfectly, which is to be\nexpected: we know the output $\\yb$ and how the signal became blurred because\nwe also know $\\Hb$.

\n\nHowever, we can see that we add *any* noise, our estimate is *way* off. This occurs because with noise, we’re approximating by $\\est{\\xb} = \\xb + \\Hb^{-1} \\nb$. This $\\Hb^{-1}\\nb$ will be adding up $k$ noisy values when we average by $k$ pixels. This means our noise grows and the approximation grows worse and worse. Note that it doesn’t grow in expectation, but the standard deviation/variance grows, meaning that we’re more likely to get larger values.

We know our error comes from $\\Hb^{-1}\\nb$. When we have a moving average\nfilter over $k$ values, this adds up $k$ noisy values. We would expect the\nvalue of our noise to remain constant as expectation is linear. However, the\nvariance of our independent noisy values is

\n\n$$\n\\Align{\n\\var\\left(\\sum_{i=1}^k n_i\\right) &= \\sum_{i=1}^k \\var(n_i)\\\\\n&= k \\cdot\\sigma^2\n}\n$$

\n\nMeaning that even a single estimate $\\widehat{x}_i$ grows poorly – there’s\nnothing that limiting how large it grows!

\n\nWe might know something about our signal. Why don’t we use that information?\nThis is the basis behind prior probability. This is a fancy word, but it’s\njust saying that we *know* thing about our signal. For example, I *know* the\nweather will not drastically change throughout the day and will not bring\nseveral changes of clothes, a winter coat, an umbrella and sunscreen with me.

With the above blurred signal, what facts do we *know*? We know that our signal\ndoes not grow to infinity as we saw when we increased the noise in our\nobservation. So, let’s use this information to our advantage and say that our\nreconstruction should not grow too large! To do this, we’ll use regularization.

Often times, we want to balance *two* objectives and can’t have both. For\nexample, we might want to make a car fast and gas efficient and safe and cheap,\nbut that’s not possible. We can only choose a few of those. Regularization in\nmathematics plays the same game. In the above case, we want to balance the\nerror *and* the energy of the reconstructed signal. We can’t do both, so we’ll\nhave to find a balance between the two.

Quoting the wikipedia page on regularization,

\n\n\n\n\nA theoretical justification for regularization is that it attempts to impose\nOccam’s razor on the solution. From a Bayesian point of view, many\nregularization techniques correspond to imposing certain prior distributions\non model parameters.

\n

In the example above, we want to say the energy of our estimate shouldn’t grow\ntoo large *and* the error should be relatively small. We are balancing the\nerror and the norm (or energy). We can represent the balance of error of the\nestimate and energy in the estimate with an addition. If $\\lambda > 0$ is some\nconstant that determines how important the error is relative to the energy,

$$\n\\Align{\n\\est{\\xb} &= \\arg \\min_\\xb \\norm{\\yb - \\Ab\\xb}_2^2 + \\lambda \\norm{\\xb}_2^2\n}\n$$

\n\nwhich just minimizes the error of what we observe and our estimate *and* the\nenergy in our estimate or $\\sum_i x_i^2$. When $\\lambda \\approx 0$, this\nsolution is the minimum norm least squares solution. When $\\lambda \\gg 1$, the\nsolution converges to $x_i = 0$ for all $i$.

This results in a better approximation, one that is much more tolerant to\nnoise. To see that, I have repeated the same example above. Here, we’ll graph\nthe “ground truth” $\\xb$ and our estimation $\\est{\\xb}$. Our noisy observations\nwill be $\\yb$ and involve the noise free observations plus noise of standard\ndeviation $\\sigma$. That is,

\n\n$$\n\\bm{y} = \\bm{Hx} + \\bm{n}_\\sigma\n$$

\n\n$$\n\\bm{\\widehat{x}} = \\arg \\min_\\bm{x}\\norm{\\bm{y}-\\bm{H}\\bm{x}}_2^2 + \\lambda \\norm{\\bm{x}}_2^2\n$$

\n\nAnd here’s another interactive widget!

\n\n\n\n\n

\n\n\nThis is a much more noise-tolerant estimate. Why?

\n\nThis regularization parameter just makes sure that each value in our estimate\ndoesn’t grow too large. This regularization parameter enforces that and\nrestricts the estimate $\\est{\\xb}$ from growing too large. Increasing the noise\nincreases the variance of our noise, and this is a method to help limit that.

\n\nAnother way to look at this is that we have two objectives: minimize the error\n*and* minimize the norm of estimate $\\est{\\xb}$. These two values happen at\ndifferent points, and we can’t achieve both at the same time.

Of course, we can have different regularization parameters. What if we know\nthat only a small number of coefficients are nonzero? What if that our signal\nhas a sparse representation in some transform? This is discussed in Part II\nof this series!

\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/09/01/matlab-to-python/",
"url": "http://stsievert.com/blog/2015/09/01/matlab-to-python/",
"title": "Stepping from Matlab to Python",
"date_published": "2015-09-01T07:48:19-05:00",
"date_modified": "2015-09-01T07:48:19-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "\n- \n
- \n
Underdetermined systems pose many challenges because they have infinitely many solutions and are the focus for the rest of the series ↩

\n \n

It’s not a big leap; it’s one small step. There’s only a little to pick up and\nthere’s not a huge difference in use or functionality. The difference is so\nsmall you can switch and just google any conversion issues you have: they’re so\nsmall you’ll have no trouble finding the appropriate functions/syntax.

\n\nThere is a wrapper package in Python with the aim of providing a Matlab-like\ninterface that is well suited for numerical linear algebra. This package is\ncalled pylab and wraps NumPy, SciPy and matplotlib. When I use pylab,\nthis is how similar my Python and Matlab code is:

\n\n\n\nPython even has a matrix multiplication operator! Python 3.5 introduces the\nmatrix multiplication operator `@`

detailed in PEP 465. Python is remarkably\nwell suited for developing numerical algorithms – what else does Python offer?

It’s not a big leap; it’s one small step. There’s only a little to pick up and\nthere’s not a huge difference in use or functionality. The difference is so\nsmall you can switch and just google any conversion issues you have: they’re so\nsmall you’ll have no trouble finding the appropriate functions/syntax.

\n\nThere is a wrapper package in Python with the aim of providing a Matlab-like\ninterface that is well suited for numerical linear algebra. This package is\ncalled pylab and wraps NumPy, SciPy and matplotlib. When I use pylab,\nthis is how similar my Python and Matlab code is:

\n\n\n\nPython even has a matrix multiplication operator! Python 3.5 introduces the\nmatrix multiplication operator `@`

detailed in PEP 465. Python is remarkably\nwell suited for developing numerical algorithms – what else does Python offer?

*Outline for this post*

- \n
- Benefits of Python \n
- How to switch to Python\n
- \n
- Anecdote \n
- Resources \n
- Little hints \n

\n - Details\n
- \n
- Install \n
- Development environment \n
- Syntax \n
- pylab \n
- Style \n

\n - Conclusion \n

It turns out that Python is a *general* programming language built by computer\nengineers that happens to have a scientific stack built on top of it. This\nmeans it’s been optimized to be easy to develop in; the same is not true for\nMatlab, a domain-specific language.

Here’s a quick overview of *some* benefits:

- \n
**Functions are**They are*simple*(a big deal!)*easy*to implement and can\nbe in the same file as a general script (a feature Matlab lacks). This can\nmake your code modular and readable. I mean, in Python functions can even\nhave optional arguments, the scoping just works and functions can be passed\nas proper variables. \n **Python has many packages**, one for almost every task. A few packages\nscientists might run into are…\n- \n
- NumPy for numerics, Pandas for data munging, Dask for scaling \n
- scikit-image for image processing, scikit-learn for machine learning\nand astropy for astronomy. \n
- various other packages, organized by category: vinta/awesome-python,\nwhich includes everything from “Debugging tools” to “RESTful API” to “Deep\nlearning” \n
- This goes much further; there are packages for almost everything. e.g.,\nhave to deal with Excel spreadsheets? There are at least 7 options at\nPyXLL’s overview \n

\n **Package install is**via the pip command line tool. Just*simple*,`pip\ninstall python-drawnow`

to install after you google “matlab python drawnow”\nand find python-drawnow! \n **Python**Some little goodies that are really nice to have and\nexamples of Pythonic code:\n*just works*.- \n
- optional arguments for functions exist! \n
- Swapping variables is
*simple*:`a, b = b, a`

\n - Comparison is
*simple*:`1 < x < 3`

works via chaining comparison\noperators \n - variables in a string are easy with string’s format like
`\"tau =\n{0}\".format(2*pi)`

(and PEP 498 proposes`f\"tau = {2*pi}\"`

) \n - for loops are easier than easy with enumerate and zip, \n
- list comprehension works; collapse a simple for-loop into one line.\n
`[2*x for x in [0, 1, 2]]`

results in`[0, 2, 4]`

\n

\n **Documentation in Python is**via docstrings.*simple*, \n **Python has an active development community.**Want to change a package\nyou’re using? Make a pull request! Go to a conference or a local Python\ngroup! \n **Python is free as in beer and free as in speech.**There aren’t any\nlicensing issues and you can see the source code of every function called. \n `clc; clear all; close all;`

found in Matlab is gone forever! And no more\nsemicolon! \n

While Matlab is convenient software for linear algebra, it is a language\ndeveloped by mathematicians. Python has a similar linear algebra interface but\nis developed by computer scientists which makes it *easy* to develop in. For a\ncomplete discussion of this, see\n Python, machine learning, and language wars: a highly subjective point of view

In fact, Python is so wonderful there’s even a relevant XKCD^{1} that\nillustrates how wonderful Python is a general programming language! But now\nlet’s get down to business describing *how* to switch over.

*Want to try Python without installing anything?* Go to try.jupyter.org or\nDataJoy to try Python out! (careful – as of 2015-11-3, `@`

is not yet\navailable for these online tools!)

- \n
- Install Anaconda with Python 3.5 (and the Anaconda install is easy – it\nincludes everything you’ll need and puts a Launcher.app on your desktop with\naccess to Spyder and IPython notebook!) \n
- Open up Spyder which tries to provide a Matlab-like IDE environment\n(variable explorer, debugger, etc). \n
- Type
`from pylab import *`

which provides a Matlab-like programming\nenvironment. \n

That’s it! A Matlab-like environment exists. Try almost any function from\nMatlab to see if it exists in pylab; it probably does.

\n\nBefore we get started diving into some details, I would like to share what I\nheard Rob Nowak say in personal conversation. He’s a senior professor and one\nof the people that inspired this post. He shared the below on 2016-10-22. He\nsays it only took him two hours to get up to speed and said “there is little\nbarrier to transition”.

\n\n\n\n\nI had a Matlab script that performed Lasso. I wanted to implement this in\nPython, and it only took me two hours.

\n\nFirst, I asked my grad students what to install and they recommend Anaconda.\nThen I googled “python least squares” to get the basics, then Lasso is not much\nfrom this, only some thresholding and shrinkage.

\n\nBut my data was in a .mat file, so I googled “python load mat file” and found\nsome function [note:

\n`scipy.io.loadmat`

]

- \n
- I would use NumPy for Matlab users if I have almost any question. \n
- If I have basic syntax questions, I look at the Python wikibook in either\ncontrol flow or decision control.
^{2}\n - If I need something more specific, I google “python matlab <function>”. \n
- Many different libraries are available and for many other scientific use cases!\n
- \n
- bokeh, seaborn, ggplot or mayavi for plotting \n
- theano or caffe for machine learning \n
- SymPy for symbolic manipulation \n
- Many different scikit libraries! (e.g., scikit-image for image processing, scikit-learn for machine learning) \n

\n

- \n
`%reset`

, an IPython magic will clear all variables from the IPython kernel\nand will restart the current session. \n - If I want the transpose of a matrix, I use
`A.T`

, not`transpose(A)`

. \n - I use
`A.shape`

to get the number of rows and columns of a matrix as`(m,\nn)`

. Not`size(A)`

! \n `*`

is element-wise multiplication.`@`

or`A.dot(x)`

is a matrix\nmultiplication \n - Python operates by reference (explanation). This can\nmean that
`y = x`

can fail; any changes to`y`

*or*`x`

will be reflected in\nboth`x`

*and*`y`

.*This is not the critical bug is seems to be*– I’m\njust careful around plain copy statements. Why?\n- \n
- Any operation (e.g.,
`y = 1*x`

,`y = x + 0`

) avoid this as a reference to`1*x`

is passed to`y`

. \n - Both ndarray.copy for NumPy and the copy module for everything else exist. \n
- Function arguments are passed by argument (source); if this\ncopy issue happens inside a function, it can’t touch values outside the\nfunction. \n

\n - Any operation (e.g.,
`reshape(arange(3*3), (3, 3))`

can become`arange(3*3).reshape(3, 3)`

. To see\nwhat else can provide this function mapping from`b(a())`

to`a().b()`

, go to\nIPython’s console and type`x = a() <RETURN> x.<TAB>`

and see what pops up. \n - In IPython, the output of
`Out[n]`

is accessible through`_n`

at the prompt\n(i.e.,`_7`

has the same output as`Out[7]`

). \n - List comprehension, enumerate and zip make for loops super easy \n

The full scientific stack of software is available in Anaconda. Go to their website and download the Python 3.5 version. This gives you many packages but namely it gives you the main scientific stack (matplotlib, scipy, numpy, pandas). Anaconda includes over 330 libraries and it simplifies the entire installation process to a simple .pkg installer.

\n\nAfter navigating to Anaconda’s downloads page, you’re presented with a choice between Python 2 and Python 3. This might seem like a small difference but Python 3 is not backwards-compatible with Python 2. For basic stuff as shown, it doesn’t matter but can play a role when depending on third party packages. I think at this point, I recommend to install Python 3 (and others recommend Python 3 as well).

\n\nPython aligns with the Unix philospopy: you can choose a Python prompt and you\ncan choose your editor. I have chosen vim as my editor and IPython console as\nmy prompt, but for prompts you can choose among the default python, ipython,\nbpython or ptpython consoles. IPython is (by far) the most mature\nand they are most connected with the scientific community. It has even spawned\nan offspring Project Jupyter that generalizes the IPython notebook/etc to *any*\nprogramming language.

If you prefer an integrated develop environment or IDE, I would most strongly\nrecommend Spyder as it tries to provide a Matlab-like experience. Spyder\ncomes with the Anaconda package and is available in the Python Launcher if you\ninstalled Anaconda (on your Desktop). Though I most recommend Spyder for IDEs,\nother options exist: rodeo, IDLE, NetBeans and PyCharm.

\n\nI personally use vim and the IPython console and use the IPython’s magic function `%run <file>`

\nat the prompt. This means that the prompt can see the variables\ninside my script – I can just type `plot(x); show()`

*at the prompt* to see\nwhat `x`

looks like. The same feature feature is available in Spyder when you\nclick the “Run” button.

When I use the IPython console (which is available under *Interpeters* in Spyder!) I use several IPython specific things at the IPython prompt:

- \n
`%run <filename>.py`

to run and interactively explore my script. This allows me to\ntype`plot(x); show()`

at the prompt and it finds the global variable`x`

\nfrom`<filename>.py`

\n - The output of
`Out[n]`

is accessible through`_n`

(i.e.,`Out[7]`

available\nthrough`_7`

). \n `function?`

to get help on “function” (which reads`function`

’s\ndocstring!) \n `!<shell>`

to run shell commands (`cd`

and`ls`

are also available\nseparately) \n `%timeit <function>`

to time a function (by running it multiple times) \n `%cpaste <code>`

for copying/pasting code from the web \n `%matplotlib inline`

allows for*inline*graphics! Your plot doesn’t open up\nin a separate window; it’s just an output like anything else! (I show an\nexample image below) \n

For more detail, look at IPython’s description of different magic functions.

\n\n*Note: By default, Spyder doesn’t do the equivalent of clear all before\nrunning the script (but using %run <filename>.py does!). I recommend running\nthe magic %reset at the prompt or running in a separate Python console.*

For questions on the Python/Matlab conversion (e.g., how do I index an array/matrix), I’ve found the NumPy for Matlab users sheet helpful. For basic syntax questions (e.g., what’s a for-loop look like?) I think this cheat sheet would be useful.

\n\nThere’s a lot of syntax errors that are strange for beginners. For example, to\nconcatenate two arrays, `hstack((x, y))`

is used. These double parentheses are\ndifficult for a beginner to learn. To get around this, please please please\nlook at NumPy for Matlab users or Google if you’re getting weird errors for\nsomething straightforward. If an error persists, I google something general and\nslowly refine the wording in my search.

This will give you a Matlab-like environment as that’s pylab’s goal. However,\n`from anything import *`

is *not* recommended. It imports a\nbunch of stuff into the global namespace. PEP 20, the Zen of Python\nsays

\n\n\nNamespaces are one honking great idea – let’s do more of those!

\n

Of course, I still use `from pylab import *`

because it’s so darn convenient.\nI use it *during development* (not when publishing code) and you probably will\ntoo. *EDIT* I don’t use pylab anymore; because I know where each function lives\n(i.e., `np.linalg.norm`

vs `norm`

) pylab doesn’t provide much benefit.

Better practice would be explicitly define where each function is from, like\nbelow:

\n\n\n\nFor more advanced Matlab stuff, I would be uncertain if Python had a certain feature/function. I learned that I could just google “python matlab function_name” and often get the result I wanted. “python matlab drawnow” gives python-drawnow and “python matlab cvx” gives cvxpy^{3} (a convex optimization library). I just run `pip install cvxpy`

at the command line and I have the library!

*A complete style guide can be found at python-guide.* This guide details\nPythonic code and how to make your code easily readable and modular.

Python has many features that Matlab doesn’t. Language features like optional\narguments and string formatting have been optimized for years in Python. These\nare incredibly nice features that Matlab doesn’t have.

\n\nThese features are things that make the code Pythonic and easy to read/parse.\nThis includes mixing function definitions and a script in the same file,\noptional arguments for functions, and some of the nice functions for for-loops\n(`enumerate(), zip()`

).

Mixing function definitions and a script as well as optional arguments allows\nme to play with my script and see what effects certain parameters have. It\nbecomes easy and natural to replace parameters with optional arguments, and\nthen it becomes easy and natural to change those parameters.

\n\nI might wrap a function like this:

\n\n\n\nAt the end of the day, Python is what I develop signal processing algorithms\nin. I’ve found that it’s as fast as Matlab^{4} and has the numerical\ncomputation ease/syntax that Matlab has. But perhaps the nicest thing is that\nPython can handle any other task really well: default arguments, strings,\nfunctions, system commands, running a Python script from the shell and making\ninteractive widgets are all *easy*.

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/04/23/image-sqrt/",
"url": "http://stsievert.com/blog/2015/04/23/image-sqrt/",
"title": "Computer color is only kinda broken",
"date_published": "2015-04-23T09:22:31-05:00",
"date_modified": "2015-04-23T09:22:31-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Try

\n`import antigravity`

↩ \n - \n
And Python syntax and semantics for more advanced concepts like list comprehension ↩

\n \n - \n
cvxpy is actively developed by Steve Diamond, a member of Stephen Boyd’s group. It supports both Python 2 and Python 3. ↩

\n \n - \n
There’s some interesting stuff with Julia, a language with Matlab-like syntax but C-like speeds. ↩

\n \n

When we blur red and green, we get this:

\n\n\n\nWhy? We would not expect this brownish color.

\n\n", "content_html": "When we blur red and green, we get this:

\n\n\n\nWhy? We would not expect this brownish color.

\n\n\n\nI’m not an expert but I should know why. I work in an academic lab that deals with image processing (although that’s not our sole focus). Lab-mates have taken entire classes on image processing. You think they would know all the gory details of how images are taken and stored, right? It turns out there’s a surprisingly simple concept that plays an enormous role in image processing and we were all surprised to learn it!

\n\nWe happened to stumble across this concept through the MinutePhysics video “Computer color is broken” (/r/programming discussion).

\n\n\n\nIn short, typical image files do not represent what the camera sees. If the camera sees $x$ photons at a certain pixel, they store $\\sqrt{x}$ at that pixel. That makes sense with how our eyes work and it only takes up half as much space.

\n\n\n\n\n\nThis process works; when we display the color, our monitors display the square of the value found in the file.

\n\nThis is only a problem when we want to modify the image. Almost every computer program lazily computes the average of the raw values found in the image file (e.g., PNG or JPG). This is widespread – programs like Photoshop include the physically incorrect behavior by default (but you can fix it by changing the gamma correction settings). In picture form, they compute

\n\n\n\n\n\nLet’s see this result. As explained in the video, when we’re mixing red and green we would expect to see a yellowish color, as that’s what happens when a camera is out of focus. We would not expect to see a gray brownish color.

\n\nBut before we see the results, let’s write a script to average the two colors. Before seeing the above video, this is how I would have written this script:

\n\n`from pylab import imread, hstack\n\nx = imread('red.png')\ny = imread('green.png')\n\nmiddle = (x + y) / 2\n\n# corresponds to the image on the left\nfinal_brown = hstack((x, middle, y))\n`

Almost by definition, that computes what I want right? I’m averaging the two pixels and using a well-known formula for averaging. But no, this average does not correspond to the number of photons hitting the camera lens in two different pixel locations, the number I want.

\n\nTo get the physically natural result, we only need to add two lines of code:

\n\n\n\n\n\nWhen I saw this, I was shocked. How was this not default, especially in programs like Photoshop? Of course, this behavior can be enabled in Photoshop by playing with the gamma correction settings and using the sRGB color space. I assume the method described above is an approximation for the sRGB color space, with $\\gamma = 2$.

\n\nAt first, I was also surprised that Python/Matlab’s `imread`

didn’t enable this behavior by default. After thinking about it more, I came to realize that it’d be almost impossible to implement this behavior and wouldn’t make sense. `imread`

is a very raw function that gives access to raw values in image files. The pixel values could not be of type `ndarray`

but would instead have to be wrapped in some `Image`

class (and of course PIL has an ImageMath module). How can NumPy dictate `x + y`

is really `sqrt((x**2 + y**2) / 2)`

for images but not ndarrays?

At first, I decided to test an image deblurring algorithm^{1} – this algorithm takes in a blurred image and (ideally) produce a sharp image. I saw much more natural results with the method described above.

Something felt off about this; there could be other stuff that’s getting in the way. We were fairly drastically changing the input to a mysterious blind deconvolution algorithm. What if they computed similar results but one image needed more iterations? What if the parameters happened to be better tuned for parameters in one image?

\n\nAdditionally, this also felt wrong because I was only using this new and seemingly invented color space. I knew other color spaces existed as an image doesn’t have to be represented in RGB. It can be represented in many other color spaces that emphasize different things. Some color spaces might emphasize how humans naturally see colors and others might try to make addition of two images work and some might try to represent an image but save disk space.

\n\nSo I decided to test blurring in different color spaces. Going into this, I wasn’t really sure what to expect but after thinking about it (and seeing the results), I realize this tests what’s important in each color space. What does each pixel and it’s color represent?

\n\nDo they emphasize additive or subtractive color? Do they emphasize the true pigments and act like mixing paint? Do they go for smooth gradients? Are they constrained by the device and instead optimize for image size/etc?

\n\nTo be clear, here are the following color spaces I tested in and what I gleaned^{2} from their descriptions on wiki pages/etc:

Color space | Design goal |
---|---|

HSV | hue saturation; represents how humans see color (meaning pigments?) |

XYZ | defined to be a link between wavelengths and humans perceive color |

RGB | default for images; red green blue color planes. When implemented, the storage space of the image was kept in mind. |

LUV | used extensively in computer graphics to deal with colored lights |

LAB | designed to approximate human vision; can be used to make accurate color balance corrections |

HED | not sure but I found a paper (and on deconvolution!) |

RGB^{2} | the method described above. I assume this a rough sRGB approximation |

\nOkay, let’s try mixing the two colors! We should not expect to see the same result in each color space. They all optimize for different things; they might try to mix the pigments or might be constrained by device physics and produce garbage.

`from skimage.filters import gaussian_filter\nfrom skimage.color import rgb2hsv, hsv2rgb\n\nI = imread('two_colors.png') # NxMx3 array\nI_hsv = rgb2hsv(I)\n\n# blur the image using skimage\nI_hsv = gaussian_filter(I, 15, multichannel=True)\n\n# show this rgb image (RGB since plt.imshow takes RGB values)\nI = hsv2rgb(I_hsv)\n`

After wrapping this function and using ipywidgets, I can produce this\ninteractive widget!

\n\n\n\n\n

\n\n\n

\nPlaying with the input colors, we can see that most color blends look pretty similar. Green/orange, purple/green and yellow/orange all seem to be fairly similar in different color spaces. The colors that best show the different results for different color spaces are red/green and blue/yellow.

We see that certain color spaces are constrained by device limitations (RGB, HED). We see that other color spaces emphasize the pigments (HSV) or other elements like additive/subtractive color (LUV, LAB). We see that certain color spaces play nicely with addition and perform a smooth gradient between the two colors (XYZ, RGB$^2$ aka the method described above).

\n\nWhile writing this blog post I learned about color spaces. I learned they’re not only transformations but also try to pull out different elements. Some color spaces try to represent how humans see color (e.g., HSV) and others optimize for device parameters (e.g., RGB optimizes for image size).

\n\n…but at the end of the day I’ll keep this trick in mind.

\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2015/01/31/the-mysterious-eigenvalue/",
"url": "http://stsievert.com/blog/2015/01/31/the-mysterious-eigenvalue/",
"title": "Applying eigenvalues to the Fibonacci problem",
"date_published": "2015-01-31T10:41:26-06:00",
"date_modified": "2015-01-31T10:41:26-06:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Using blind deconvolution. In essence, this tries to undo some system’s effect on an image. In this case, that system is blurring and this algorithm needs an estimate of the point spread function ↩

\n \n - \n
And I’m still not confident on all the details ↩

\n \n

The Fibonacci problem is a well known mathematical problem that models\npopulation growth and was conceived in the 1200s. Leonardo of Pisa aka\nFibonacci decided to use a recursive equation: $x_{n} = x_{n-1} + x_{n-2}$\nwith the seed values $x_0 = 0$ and $x_1 = 1$. Implementing this recursive\nfunction is straightforward:

\n\n`def fib(n):\n if n==0: return 0\n if n==1: return 1\n else: return fib(n-1) + fib(n-2)\n`

The Fibonacci problem is a well known mathematical problem that models\npopulation growth and was conceived in the 1200s. Leonardo of Pisa aka\nFibonacci decided to use a recursive equation: $x_{n} = x_{n-1} + x_{n-2}$\nwith the seed values $x_0 = 0$ and $x_1 = 1$. Implementing this recursive\nfunction is straightforward:

\n\n`def fib(n):\n if n==0: return 0\n if n==1: return 1\n else: return fib(n-1) + fib(n-2)\n`

Since the Fibonacci sequence was conceived to model population growth, it would\nseem that there should be a simple equation that grows almost exponentially.\nPlus, this recursive calling is expensive both in time and memory.^{1}.\nThe cost of this function doesn’t seem worthwhile. To see the surprising\nformula that we end up with, we need to define our Fibonacci problem in a\nmatrix language.^{2}

$$\n\\begin{bmatrix}\nx_{n} \\\\ x_{n-1}\n\\end{bmatrix} =\n\\bm{x}_{n} =\n\\bm{A}\\cdot \\bm{x}_{n-1} =\n\\begin{bmatrix}\n1 & 1 \\\\ 1 & 0\n\\end{bmatrix} \\cdot\n\\begin{bmatrix}\nx_{n-1} \\\\ x_{n-2}\n\\end{bmatrix}\n$$

\n\nCalling each of those matrices and vectors variables and recognizing the fact\nthat $\\bm{x}_{n-1}$ follows the same formula as $\\bm{x}_n$ allows us to write

\n\n$$\n\\begin{aligned}\n\\bm{x}_n &= \\bm{A} \\cdot \\bm{x}_{n-1} \\\\\n&= \\bm{A} \\cdot \\bm{A} \\cdots \\bm{A} \\cdot \\bm{x}_0 \\\\\n&= \\bm{A}^n \\cdot \\bm{x}_0\n\\end{aligned}\n$$

\n\nwhere we have used $\\bm{A}^n$ to mean $n$ matrix multiplications.\nThe corresponding implementation looks something like this:

\n\n`def fib(n):\n A = np.asmatrix('1 1; 1 0')\n x_0 = np.asmatrix('1; 0')\n x_n = np.linalg.matrix_power(A, n).dot(x_0)\n return x_n[1]\n`

While this isn’t recursive, there’s still an $n-1$ unnecessary matrix\nmultiplications. These are expensive time-wise and it seems like there should\nbe a simple formula involving $n$. As populations grow exponentially, we would\nexpect this formula to involve scalars raised to the $n$th power. A simple\nequation like this could be implemented many times faster than the recursive\nimplementation!

\n\nThe trick to do this rests on the mysterious and intimidating\neigenvalues and eigenvectors. These are just a nice way to view the same data but they have a lot of mystery\nbehind them. Most simply, for a matrix $\\bm{A}$ they obey the equation

\n\n$$\n\\bm{A}\\cdot\\bm{x} = \\lambda \\cdot\\bm{x}\n$$

\n\nfor different eigenvalues $\\lambda$ and eigenvectors $\\bm{x}$. Through the\nway matrix multiplication is defined, we can represent all of these cases. This\nrests on the fact that the left multiplied diagonal matrix $\\bm{\\Lambda}$\njust scales each $\\bm{x}_i$ by $\\lambda_i$. The column-wise\ndefinition of matrix multiplication makes it clear that this is represents\nevery case where the equation above occurs.

\n\n$$\n\\bm{A} \\cdot\n\\begin{bmatrix}\n\\bm{x}_1 & \\bm{x}_2\\\\\n\\end{bmatrix}\n=\n\\begin{bmatrix}\n\\bm{x}_1 & \\bm{x}_2\\\\\n\\end{bmatrix}\n\\cdot\n\\begin{bmatrix}\n\\lambda_{1} & 0\\\\\n0 & \\lambda_2\n\\end{bmatrix}\n$$

\n\nOr compacting the vectors $\\bm{x}_i$ into a matrix called $\\bm{X}$ and\nthe diagonal matrix of $\\lambda_i$’s into $\\bm{\\Lambda}$, we find that

\n\n$$\n\\bm{A}\\cdot\\bm{X} = \\bm{X}\\cdot\\bm{\\Lambda}\n$$

\n\nBecause the Fibonacci matrix is diagonalizable

\n\n$$\n\\bm{A} = \\bm{X}\\cdot\\bm{\\Lambda}\\cdot\\bm{X}^{-1}\n$$

\n\nAnd then because a matrix and it’s inverse cancel

\n\n$$\n\\begin{aligned}\n\\bm{A}^n &= \\bm{X}\\cdot\\bm{\\Lambda}\\cdot\\bm{X}^{-1}\n\\cdot\\ldots\\cdot\n\\bm{X}\\cdot\\bm{\\Lambda}\\cdot\\bm{X}^{-1}\\\\\n&= \\bm{X}\\cdot\\bm{\\Lambda}^n\\cdot\\bm{X}^{-1}\n\\end{aligned}\n$$

\n\n$\\bm{\\Lambda}^n$ is a simple computation because $\\bm{\\Lambda}$ is a\ndiagonal matrix: every element is just raised to the $n$th power. That means\nthe expensive matrix multiplication only happens twice now. This is a powerful\nspeed boost and we can calculate the result by substituting for $\\bm{A}^n$

\n\n$$\n\\bm{x}_n = \\bm{X}\\cdot \\bm{\\Lambda}^n \\cdot \\bm{X}^{-1}\\cdot\\bm{x}_0\n$$

\n\nFor this Fibonacci matrix, we find that\n$\\bm{\\Lambda} = \\textrm{diag}\\left(\\frac{1+\\sqrt{5}}{2}, \\frac{1-\\sqrt{5}}{2}\\right)= \\textrm{diag}\\left(\\lambda_1, \\lambda_2\\right)$.\nWe could define our Fibonacci function to carry out this matrix multiplication,\nbut these matrices are simple: $\\bm{\\Lambda}$ is diagonal and\n$\\bm{x}_0 = \\left[1; 0\\right]$.\nSo, carrying out this fairly simple computation gives

\n\n$$\nx_n = \\frac{1}{\\sqrt{5}}\\left(\\lambda_{_1}^n - \\lambda_{_2}^n\\right) \\approx\n\\frac{1}{\\sqrt{5}} \\cdot 1.618034^n\n$$

\n\nWe would not expect this equation to give an integer. It involves the power of\ntwo irrational numbers, a division by another irrational number and even the\ngolden ratio phi $\\phi \\approx 1.618$! However, it gives exactly the Fibonacci\nnumbers – you can check yourself!

\n\nThis means we can define our function rather simply:

\n\n`def fib(n):\n lambda1 = (1 + sqrt(5))/2\n lambda2 = (1 - sqrt(5))/2\n return (lambda1**n - lambda2**n) / sqrt(5)\ndef fib_approx(n)\n # for practical range, percent error < 10^-6\n return 1.618034**n / sqrt(5)\n`

As one would expect, this implementation is *fast*. We see speedups of roughly\n$1000$ for $n=25$, milliseconds vs microseconds. This is almost typical when\nmathematics are applied to a seemingly straightforward problem. There are often\nlarge benefits by making the implementation slightly more cryptic!

I’ve found that mathematics^{3} becomes fascinating, especially in higher\nlevel college courses, and can often yield surprising results. I mean, look at\nthis blog post. We went from a expensive recursive equation to a simple and\nfast equation that only involves scalars. This derivation is one I enjoy and I\nespecially enjoy the simplicity of the final result. This is part of the reason\nwhy I’m going to grad school for highly mathematical signal processing. Real\nworld benefits $+$ neat theory $=$ <3.

\n \n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2014/07/31/common-mathematical-misconceptions/",
"url": "http://stsievert.com/blog/2014/07/31/common-mathematical-misconceptions/",
"title": "Common mathematical misconceptions",
"date_published": "2014-07-31T13:12:26-05:00",
"date_modified": "2014-07-31T13:12:26-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "When I heard course names for higher mathematical classes during high school\nand even part of college, it seemed as if they were teaching something simple\nthat I learned back in middle school. I knew that couldn’t be the case, and\nthree years of college have taught me otherwise.

\n\n", "content_html": "When I heard course names for higher mathematical classes during high school\nand even part of college, it seemed as if they were teaching something simple\nthat I learned back in middle school. I knew that couldn’t be the case, and\nthree years of college have taught me otherwise.

\n\n\n\nGeneralizing to $N$ dimensions is often seen as a pointless mathematical\nexercise because of a language confusion. Space exists in three dimensions; we\njust need three elements or a three dimensional vector to describe any point.\nBecause of this, we say we live in three dimensions. So isn’t generalizing to\nan $N$ dimensions pointless?

\n\nLet’s go back to vectors. An $N$ dimensional vector is just one with $N$\ncomponents.^{1} Thinking about a grid or 2D graph, we need two numbers to\ndescribe any point on that grid, $x$ and $y$. That’s a two dimensional vector.\nFor the four dimensions we live in, we need four: $x, y, z, t$.

But wait. Why can’t we define a five dimensional vector that includes my age or\na six dimensional vector that also includes my year in school? Thinking about\nthis in the real world, we have data that has $N$ components all the time. Images\non my computer have $N$ pixels and my GPS records $N$ locations. Almost anything\n*discrete* is an $N$ dimensional vector.

This pops up all the time in image processing, machine learning and almost\nanywhere that uses linear algebra *or a computer.* All the vectors and matrices\nI’ve seen have $N$ components, a *discrete* number. Computers only have a\nfinite number of bits^{2} meaning computers must also be discrete. That\nmeans if we want to do anything fancy with a computer, we need to use linear\nalgebra.

At first glance, linear algebra seems like middle school material. In\nmiddle school you might have seen $y=m\\cdot x+b$ (more on this linear function\nlater). In linear algebra, you might see $\\bm{y}=\\bm{A}\\cdot \\bm{x}+\\bm{b}$ where $\\bm{y}$, $\\bm{x}$, and $\\bm{b}$ are\nvectors and $\\bm{A}$ is a matrix.^{3}

A matrix is just nothing more than a collection of vectors and 2D grid of\nnumbers. For example, if you have $M$ students with $N$ features (age, weight,\nGPA, etc), you can collect them by simply stacking the vectors and make an\n$M\\times N$ matrix.

\n\nWe’ve defined both vectors and matrices, but what about basic operations? How\ndo we define addition and multiplication? Addition works element-wise, but\nmultiplication is a bit more interesting.

\n\nMatrix multiplication is defined rather simply, but that doesn’t give you any\nintuition. You can think of matrix multiplication as linear combinations of the\nrows and columns, but that still doesn’t tell what matrix multiplication\n*means.*

Intuitively, we know that we can transform any matrix into any other matrix\nbecause we can arbitrarily choose numbers. But wait. In middle school we\nlearned that a function could transform any number into any other number.\nMatrices are then analogous to functions!

\n\nFor example, let’s say we have a set of inputs $\\bm{x} = [1,~2,~3,~4]^T$ and we want to\nperform the function $y = f(x) = 2\\cdot x + 1$ on each element in our vector. In\nmatrix notation, we can just do

\n\n$$\n\\bm{y} =\n\\begin{bmatrix} 2&0&0&0 \\\\ 0&2&0&0 \\\\ 0&0&2&0 \\\\ 0&0&0&2 \\end{bmatrix}\n\\cdot\n\\begin{bmatrix} 1 \\\\ 2 \\\\ 3 \\\\ 4 \\end{bmatrix}\n+\n\\begin{bmatrix} 1 \\\\ 1 \\\\ 1 \\\\ 1 \\end{bmatrix}\n$$

\n\nWith these functions, we can perform even more powerful actions. We can easily\nswap elements or perform different actions on different elements. It gets even\nmore powerful but still relatively simple to an experienced mathematician. We\ncan find when a matrix or function simply scales a vector by finding the\neigenvalues and eigenvectors or use the inverse to find a discrete\nanalog to $f^{-1}(x)$.

\n\nA linear function seems like the most boring function. It’s just a straight\nline, right? It is, but English got confused and it also means when some\nfunction obeys the superposition principle. Concisely in math, it’s\nwhen $f(\\alpha x+\\beta y)=\\alpha\\cdot f(x)+\\beta\\cdot f(y)$.

\n\nThis means that *a general linear function is not linear.* If $f(x) = mx+b$,\n$f(x+y) = f(x)+f(y)-b$ which doesn’t satisfy the definition of a linear\nfunction. But there’s much more interesting functions out there. Integration\nand differentiation respect scalar multiplication and addition, meaning\nthey’re linear.

Since a host of other functions depend on integration and differentiation, so\nmany of these functions (but not all!) are also linear.^{4} The\ndiscrete Fourier transform (computed by `fft`

), wavelet transform and a host of other\nfunctions are linear.

Addition and scalar multiplication are defined element-wise for matrices, so\nany linear function can be represented by a matrix. There are matrices for\ncomputing the `fft`

and wavelet transform. While unused as they require\n$O(n^2)$ operations, they exist. Exploiting specific properties, mathematicians\nare often able to push the number of operations down to $O(n\\log(n))$.

Linear functions are perceived as boring. If you know $\\bm{y}$ in $\\bm{y}=\\bm{Ax}$ for\nsome known^{5} matrix $\\bm{A}$, you can easily find $\\bm{x}$ by left-multipying\nwith $\\bm{A}^{-1}$. This might be expensive time-wise, but an exact solution is\ncomputable.

Nonlinear functions have a unique property that an exact and closed form\nsolution often can’t be found. This means that no combination of\nelementary function like $\\sin(\\cdot), \\exp(\\cdot), \\sqrt{\\cdot},\\int \\cdot~dx,\n\\frac{d~\\cdot}{dx}$ along with respective operators and the infinitely many\nreal numbers can describe *every* solution.^{6} Instead, those elementary\nfunctions can only describe the equation that needs to be solved.

Even a “simple” problem such as\ndescribing the motion of a pendulum\nis nonlinear and an exact solution can’t be found, even for the most simple\ncase. Getting even more complex,\nthere’s a whole list of\nnonlinear parital differential equations\nthat solve important problems.^{7}

This is why simulation is such a big deal. No closed form solution can be found\nmeaning that you have to find a solution numerically. Supercomputers don’t\nexist to for mild speed boosts or storage requirements that machine\nlearning or “big data”^{8} seems to require. No, supercomputers exist\nbecause with a nonlinear problem, a simulation *has* to be run to get *any*\nresult. Running one of these simulations on my personal computer would take\nyears if not decades.

When you hand this type of problem to an experienced mathematician, they won’t\nlook for a solution. Instead, they’ll look for bounds and guarantees on a\nsolution they work out. If they just came up with a solution, they wouldn’t\neven know how good it is! So, they’ll try to bound the error and look for ways\nto lower that error.

\n\nWhen I started college, I was under the impression that my later career would\ninvolve long and messy exact solutions. In my second and third years, I came to\nrealize that those messy solutions didn’t exist and instead we used extremely\nelegant math to find exact solutions. In the last couple months^{9}, I have come to\nrealize that while simple math might *describe* the problem, there’s no closed\nform solution and the only way to get a solution to “interesting” problems is\nwith a computer.

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2014/07/30/simple-python-parallelism/",
"url": "http://stsievert.com/blog/2014/07/30/simple-python-parallelism/",
"title": "Simple Python Parallelism",
"date_published": "2014-07-30T10:00:40-05:00",
"date_modified": "2014-07-30T10:00:40-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
And hence an ndarray or “N dimensional array” in NumPy ↩

\n \n - \n
When computers can have an uncountably infinite number of bits they can be continuous and I’ll eat my hat ↩

\n \n - \n
For the rest of this post, vectors will be described with bold mathematical notation, matrices like vectors but capitalized, and scalars with regular math font. ↩

\n \n - \n
That is if they only depend on the simplest forms of integration and differentiation ↩

\n \n - \n
And matrix completion handles when $A$ isn’t known. This is analgous to finding an unknown scalar function ↩

\n \n - \n
It should be noted that

\n*occaisonally*a closed form solution can be found*assuming*certain conditions apply ↩ \n - \n
And these problems are wayyy over my head ↩

\n \n - \n
Yeah it’s a buzz word. For example, satellites that collect 2 billion measurements a day. That’s a lot of data, but analyzing it and using it is not “big data.” ↩

\n \n - \n
Or even as I write this blog post. I’ve known this for a while but this is a new light ↩

\n \n

In the scientific community, executing functions in parallel can mean hours or\neven days less execution time. There’s a whole array of\nPython parallelization toolkits, probably partially due to the\ncompeting standards issue.

\n\n*Update: I’ve found joblib, a library that does the same thing as\nthis post. Another blog post compares with Matlab and R.*

In the scientific community, executing functions in parallel can mean hours or\neven days less execution time. There’s a whole array of\nPython parallelization toolkits, probably partially due to the\ncompeting standards issue.

\n\n`joblib`

, a library that does the same thing as\nthis post. Another blog post compares with Matlab and R.

Python has the infamous global interpreter lock (GIL) which greatly restricts\nparallelism. We could try to step around it ourselves, but a lesson I’ve\nlearned is not to solve a problem others tackled, especially when they do it\n*right.* There are a host of other packages available, and all done *right.*\nThey all tackle problems similar to their own problem.

Perhaps the biggest barrier to parallelization is that it can be very\ncomplicated to include, at least for the niche of the scientific community I’m\nin. I spent a while looking at the IPython parallelization framework and\nvarious other packages, but they all gave me mysterious bugs so I decided to\nuse the multiprocessing library for this simple task. Without\nthis Medium article, I would be lost in threads and classes that add\nsignificant speed boosts but seem to be default in basic parallelization\nlibraries.

\n\nI want to emphasize that there are *extremely* large speed gains through these\nother toolboxes and it’s probably worth it to learn those toolboxes if you need\nthose speed gains. The GPU has thousands of cores, while your CPU only has a\nfew. Of course, an easy and simple solution uses the least complex methods,\nmeaning this simple parallelization uses the CPU. There are massive gains for\nusing more complex solutions.

I don’t want to add another package to Python’s list of parallel toolboxes\n(again, competing standards), but let’s define\nsome functions^{1} so we can have a parallel function *easily*.\n`some_function.parallel([1,2,3])`

will execute `some_function`

in parallel with\nthe inputs `[1,2,3]`

.

The function `easy_parallize`

just uses `pool.map`

to execute a bunch of\nstatements in parallel. Using `functool`

, I return a function that only needs\n`sequence`

as an input. It opens and closes a pool each time; certainly not\noptimal but easy. This method of parallelization seems prime for the scientific\ncommunity. It’s short and sweet and doesn’t require extensive modifications to\nthe code.

Let’s test a small example out and see what each part does. If you want to know\nmy thought process through the whole thing, see\nmy Binpress tutorial.\nWe’ll test a complicated and time-consuming function, a\nnot-so-smart way to test if a number is prime. We could use much smarter\nmethods such as using NumPy or even the Sieve of Eratosthenes, but\nlet’s pretend that this is some much longer and more computation-intensive\ncomputation.

\n\n\n\nTesting this parallelization out, we find that the results are *exactly* the\nsame, even with incredibly precise floats. This means that the *exact* same\ncomputation is taking place, just on different cores.

Given the simplicity of including this parallelization, the most important\nquestion to ask is about the speedup of this code. Like all complex subjects,\nthe answer is “it depends.” It will primarily^{2} depend on the number of cores\nyour machine has. One core or worst case means that your parallelized code will\nstill run, just without the speed benefits.

On “basic” machines, such as this 4-core Macbook Air, I see parallelized results\nthat run about twice as fast. On an 8-core iMac, it’s about 4 times as fast.\nIt seems like my operating system is only dedicating half the cores to this\nprocess, but I don’t want to dig into that magic^{3}.

If you’re running code that takes a long time to run and can be run in\nparallel, you probably should have access to a large supercomputing institute\nlike I should (but don’t. Classic academia). Running code on machines\nthat have 20 cores, we would expect to get results in 5% of the serial time.

\n\nCalling `test_primes.parallel`

instead of `test_primes`

gives us pretty large\nspeed gains for such a simple method. Even more convenient, editing the code to\ndo this was *easy.* Another selling point is that it’s easy to use threads\ninstead of processes. The `multiprocessing.dummy`

module is an exact clone of\n`multiprocessing`

that uses threads instead of processes. Hence if you find\ngains with threads (I didn’t), use `from multiprocessing.dummy import Pool`

.

Too often, some promising example is shown that when applied in the real world\nfails. To try and combat this, I decided to use a more complicated example,\ncalculating a Mandelbrot set. This can be natively parallelized using\nNumPy, but I decided to simulate a more complex function. Once again, the\nfull code including serial and parallel declarations is\navailable on Github.

\n\nCalculating this Mandelbrot set relies on looping over a two-dimensional grid\n(that is, without using NumPy). To parallelize this function, I essentially\nonly looped over one dimension while parallelizing the other dimension.

\n\n\n\nThis edit is relatively simple. While I couldn’t just have a drop in\nreplacement with `mandel.parallel`

, some small edits to take out looping over\nthe `x`

dimension made my function faster. Essentially, I brought the for-loop\nover `x`

outside the function. I see similar results with this code: about 2x\nspeedup on my 4-core and about a 4x speedup on the 8-core iMac.

At the end of the day, we can expect to easily parallelize our code providing\nwe use the right tools. This just involves making *small* edits for fairly\nlarge speed gains (even on this machine!) and huge speed gains on\nsupercomputers.

If you want to see this in more detail, look at the source which is\nwell commented and works. As I mentioned earlier, I have written a\nBinpress tutorial\non how I discovered this function. To be complete, here are\nthe full results:

\n\n\n\n\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2014/05/27/fourier-transforms-and-optical-lenses/",
"url": "http://stsievert.com/blog/2014/05/27/fourier-transforms-and-optical-lenses/",
"title": "Fourier transforms and optical lenses",
"date_published": "2014-05-27T08:50:25-05:00",
"date_modified": "2014-05-27T08:50:25-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
The full source, including function declarations is available on Github ↩

\n \n - \n
If your input range is small like

\n`range(1, 100)`

, the overhead of creating the parallel function will wipe out any speed benefits. Luckily, your range is small enough this shouldn’t matter. ↩ \n - \n
And /u/longjohnboy did. Apparently Intel uses hyperthreading to redefine cores. ↩

\n \n

The Fourier transform and it’s closely related\ncousin the discrete time Fourier transform (computed by the FFT) is a powerful\nmathematical concept. It breaks an input signal down into it’s frequency\ncomponents. The best example is lifted from Wikipedia.

\n\n\n\n", "content_html": "The Fourier transform and it’s closely related\ncousin the discrete time Fourier transform (computed by the FFT) is a powerful\nmathematical concept. It breaks an input signal down into it’s frequency\ncomponents. The best example is lifted from Wikipedia.

\n\n\n\n\n\nThe Fourier transform is used in almost every type of analysis. I’ve seen the\ndiscrete Fourier Transform used to detect vehicles smuggling contraband\nacross borders and to seperate harmonic overtones from a cello. It can\ntransform convolution into a simple (and fast!) multiplication and\nmultiply incredibly long polynomials. These might seem pointless, but they’re\nuseful with any “nice” system and complicated stability problems,\nrespectively. The Fourier transform is perhaps the most useful abstract\nathematical concept.

\n\nWe can see that it’s very abstract and mathematical just by looking at it’s\ndefinition:

\n\n$$\nF(f_x) = \\mathbb{F}\\left[{f(x)}\\right] = \\int f(x) \\exp\\left[ -j \\cdot 2\\pi \\cdot f_x \\cdot x \\right] dx\n$$

\n\nThe Fourier transform is so useful, the discrete version is implemented in\nprobably every programming language through `fft`

and there are even dedicated\nchips to perform this efficiently. The last thing we would expect is for this\nabstract and mathematical concept to be implemented by physical devices.

We could easily have some integrated chip call `fft`

, but that’s not\ninteresting. If a physical device that has some completely unrelated purpose\nbut can still perform an Fourier transform without human intervention (read:\nprogramming), that’d be interesting. For example, an optical lens is shaped\nsolely to produce an image, not to take a Fourier transform… but that’s\nexactly what it does.

Let me repeat: a *lens can take an exact spatial Fourier transform.* This does\nhave some limitations^{1}, mainly that it only works under coherent light. A\ncoherent light source is simply defined a source that’s not random. Natural\nlight is random as there are many different wavelengths coming in at random\ntimes. Laser light is very coherent – there’s a very precise wavelength and\nevery individual ray is in phase^{2}.

Goodman, the textbook that almost every Fourier optics course uses, says^{3}\nthat the field for a lens of focal length $f$ is

$$\nU_f(u,v) =\n\\frac\n { A \\exp\\left[ j \\frac{k}{2f} (1 - \\frac{d}{f}) (u^2 + v^2) \\right] }\n { j \\lambda f}\n\\cdot\n\\int \\int U_o(x,y) \\exp\\left[ -j \\frac{2\\pi}{\\lambda f} (xu + yv) \\right] dxdy\n$$

\n\nWhen $d=f$, *that’s exactly the definition of a Fourier transform.* Meaning we\ncan expect $U_f(u,v) \\propto \\mathbb{F}\\left[{U_i(x,y)}\\right]\\big|_{f_x = u/\\lambda f} $. Minus\nsome physical scaling, that’s an *exact* Fourier transform.

You may know that a purely real values input to a Fourier transform gives a\ncomplex output. We have no way of detecting this complexity or phase. Instead,\nour eyes (and cameras/etc) detect *intensity* or the *magnitude* of\n$U_f(x,y)$. That is, they detect $I_f(x,y) = \\left|U_f(x,y)\\right| ^2$. This function is\npurely real, although there’s some fancy ways^{4} to detect phase as well.

No matter how elegant this math was, I wanted to see it in the real world and\ncompare the computer FFT with the lens Fourier transform. The setup for this\nexperiment was rather involved, and I would like to give a resounding thanks to\nMint Kunkel. Without his help, I *never* could have gotten an image, much less\na decent one.

I was taking an image of a grid, illuminated by a infinite^{5} plane wave\ngenerated by a laser. The computer generated image and discrete Fourier\nTransform are shown below.

We can’t expect the lens Fourier transform to look exactly like this. The first\nissue that jumps to mind is discrete vs continuous time, but that should\nhopefully play a small role. The equipment required to get this image is highly\nspecialized and costs more than I want to know. But even more significantly,\nthe tuning of this equipment is critical and almost impossible to get right.\nPlus, there’s detail like grid spacing/size/etc missing, the reason the two\nimages aren’t almost exactly identical.

\n\nRegardless, the Fourier transform by lens shows remarkable similarity to the\nFourier transform done on the computer.

\n\n\n\nThis is a real world example of how a lens, a simple object used for\nphotography performs perhaps the most powerful concept in signal processing.\nThis alone is cool, but it shows itself elsewhere. The transfer function is\njust the pupil function or $H\\left(f_x, f_y\\right) = P(cx,cy)$ ($c$ is a\nconstant, only works under coherent light, but has similar effect in incoherent\nlight). If you want to resolve a higher frequency (aka more detail), you need\nyour pupil function to extend further.

\n\n\n\nAnimals have different shaped *pupils* or different *pupil functions* for their\neye. A cat has a very vertical pupil, a zebra’s pupil is horizontal and an\neagle’s pupil is round^{6}. There are different evolutionary reasons why an animal\nneeds to see more detail in the vertical or horizontal directions (ie, jumping\nvs hunting) and this shows itself with their pupils. Animals see more detail in\nthe horizontal or vertical directions if that’s what they care about!

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2014/05/18/speckle-and-lasers/",
"url": "http://stsievert.com/blog/2014/05/18/speckle-and-lasers/",
"title": "Speckle and lasers",
"date_published": "2014-05-18T04:26:40-05:00",
"date_modified": "2014-05-18T04:26:40-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
Another limitation: the lens can only accept frequencies as large as $r/\\lambda f$, meaning the input signal must be band-limited. ↩

\n \n - \n
For a detailed explanation of why the laser spots are speckled and random, see my previous blog post. ↩

\n \n - \n
The derivation culminates on page 105. This assumes the paraxial approximation is valid. ↩

\n \n - \n
Horner, J. L., & Gianino, P. D. (1984). Phase-only matched filtering. Applied optics, 23(6), 812-816. ↩

\n \n - \n
Three centimeters is about infinity according to Mint. ↩

\n \n - \n
Images are taken from Google Images. If you are the owner of one of these images and would like it removed, let me know and I’ll remove it. ↩

\n \n

We know that lasers are very accurate instruments and emit a very precise\nwavelength and hence are in an array of precision applications including\nbloodless surgery, eye surgery and fingerprint\ndetection. So why do we see random light/dark spots when we shine a\nlaser on anything? Shouldn’t it all be the same color since lasers are\ndeterministic (read: not random)? To answer that question, we need to delve\ninto optical theory.

\n\n", "content_html": "We know that lasers are very accurate instruments and emit a very precise\nwavelength and hence are in an array of precision applications including\nbloodless surgery, eye surgery and fingerprint\ndetection. So why do we see random light/dark spots when we shine a\nlaser on anything? Shouldn’t it all be the same color since lasers are\ndeterministic (read: not random)? To answer that question, we need to delve\ninto optical theory.

\n\n\n\n*Coherent* optical systems are simply defined to be\n*deterministic* systems. That’s a big definition, so let’s break it into\npieces. Coherent systems are where you know the wavelength and phase of every\nray. Lasers are very coherent (one wavelength, same phase) while sunlight is\nnot coherent (many wavelength, different phases).

Deterministic is just a way of saying everything about the system is known and\nthere’s no randomness. Sunlight is not deterministic because there are many\nrandom processes. Photons are randomly generated and there are many\nwavelengths. Sunspots are one example of the randomness present in sunlight.

\n\nBut if lasers are coherent and deterministic, why do we see speckle (read:\nbright and dark spots) when we see a laser spot? The speckle is random; we\ncan’t predict where every dark spot will be. The randomness of this laser spot\nand the fact that lasers are deterministic throws a helluva question at us. It\nturns out *what* we see the laser on is important, but let’s look at the math\nand physics behind it.

Coherent optical systems have a very special property. Their\nimpulse response (read: reaction to a standardized input)\nin the frequency domain is just the pupil function. For those familiar with\nthe parlance and having $f_x$ be a spatial frequency as opposed to a time\nfrequency,

\n\n$$\nH\\left( f_x, f_y\\right) = P(-\\lambda z x, -\\lambda z y)\n$$

\n\nWhen I saw this derived^{1}, I thought “holy (expletive).” If you just want to only pass high\nfrequency spatial content (read: edges), then all that’s required it to not let\nlight through the center of the lens.

Since this system is linear, we can think of our output as bunch of impulse\nresponses shifted in space and scaled by the corresponding amount. This is the\ndefinition of convolution and only works because this is a\nlinear and space invariant system.

\n\nTo find our impulse response in the space domain, $h\\left( x, y\\right) $, we\nhave to take the Fourier transform (aka FFT) of our pupil. Since our pupil\nfunction is symmetric, the inverse Fourier transform and forward Fourier\ntransform are equivalent.

\n\n\n\n\nThrough the angular plane wave spectrum, this impulse response can be viewed as\na series plane waves coming in at different angles, shown in the figure below.

\n\n\n\nWhat angles can a wave be thought of as? The frequency content and angles turn\nout to be related, since two planes waves of a constant frequency adding\ntogether can have a change in frequency depending on what angle they’re at,\nwhich makes intuitive sense. Or, our spatial plane wave $U(x,y)$ can be\nrepresented by the Fourier transform:

\n\n$$\n\\textrm{APWS}(\\theta_x, \\theta_y) = \\mathcal{F}\\left\\{ U(x,y) \\right\\}\\rvert_{f_x = \\theta_x/\\lambda}\n$$

\n\nThe wall which the laser is shining on is not smooth and perfectly flat. It’s\nrough, and the distance adds a phase difference between two waves^{2}. Through the\nDrunkard’s Walk and the angular plane wave spectrum, if we could\nobtain every angle, the laser spot wouldn’t have any speckle. Our eyes are finite in\nsize, so we can’t obtain every angle or frequency.

Because the wall gives the wave some random phase, we can represent the spot we\nsee by a 2D convolution with a random phase and the impulse response. This\nconvolution is just saying that every spot gives the same response multiplied\nby some random phase, added together for every point.

\n\n\n\nThe laser spot `y`

shows some speckle! The speckle varies with how large we\nmake `d`

(really `delta`

but that’s long); if we include more frequencies and\nmore of the impulse response, the dots get smaller. To see this, if you hold a\npinhole up to your eye, the speckles will appear larger.

An intuitive way to think about this involves the impulse response. The impulse\nresponse changes on with the distance and so does the phase. Certain areas add\nup to 0 while others add up to 1. There’s a whole probability density function\nthat goes with that, but that’s goes further into optical and statistical\ntheory.

\n\n**tl;dr:** the roughness of the walls add uncertainty in phase and hence speckle

\n

\n",
"tags": [],
"image": ""
},
{
"id": "http://stsievert.com/blog/2014/05/15/Scientific-Python-tips-and-tricks/",
"url": "http://stsievert.com/blog/2014/05/15/Scientific-Python-tips-and-tricks/",
"title": "Scientific Python tips and tricks",
"date_published": "2014-05-15T15:20:24-05:00",
"date_modified": "2014-05-15T15:20:24-05:00",
"author": {
"name": "Scott Sievert",
"url": "http://stsievert.com"
},
"summary": "- \n
- \n
This does rely on assumptions by Goodman. Refer to Eqn. 6-20 for more detail (and thanks Mint!) ↩

\n \n - \n
reddit commenter delmar15 pointed out that there’s also phase due to the glass it’s shining through (and many other effects). Statisitcal optics covers that in much more detail. ↩

\n \n

You want to pick up Python. But it’s hard and confusing to pick up a whole new\nframework. You want to try and switch, but it’s too much effort and takes too\nmuch time, so you stick with MATLAB. I essentially grew up on Python, meaning I\ncan guide you to some solid resources and hand over tips and tricks I’ve\nlearned.

\n\n", "content_html": "You want to pick up Python. But it’s hard and confusing to pick up a whole new\nframework. You want to try and switch, but it’s too much effort and takes too\nmuch time, so you stick with MATLAB. I essentially grew up on Python, meaning I\ncan guide you to some solid resources and hand over tips and tricks I’ve\nlearned.

\n\n\n\nThis guide aims to ease that process a bit by showing tips and tricks within\nPython. This guide *is not* a full switch-to-Python guide. There are plenty of\nresources for that, including some wonderful SciPy lectures,\ndetailed guides to the same material, and\nPython for MATLAB users. Those links are all useful,\nand **those links should be looked at.**

For an intro to Python, including types, the scope, functions and optional\nkeywords and syntax (string addition, etc), look at the Python docs.

\n\nBut, that said, I’ll share my most valuable tips and tricks I learned from\nlooking at the resources above. These do not serve as a complete replacement\nthose resources! I want to emphasize that.

\n\nI recommend you install Anaconda. Essentially, all this amounts to\nis running `bash <downloaded file>`

, but complete instructions can be found on\nAnaconda’s website.

This would be easiest if you’re familiar with the command line. The basics\ninvolve using `cd`

to navigate directories, `bash <command>`

to run files and\n`man <command>`

to find help, but more of the basics can be found\nwith this tutorial.

The land of Python has many interpreters, aligning with the Unix philosophy.\nBut at first, it can seem confusing: you’re presented with the default python\nshell, bpython, IPython’s shell, notebook and QtConsole.

\n\nI most recommend IPython; they seem to be more connected with scientific\ncomputing. But which one of IPython’s shells should you use? They all have\ntheir pros and cons, but the QtConsole wins for plain interpreters.\nSpyder is an alternative (and IDE, meaning I haven’t used it much)\nout there that tries to present a MATLAB-like GUI. I do know\nit’s possible to have IPython’s QtConsole in Spyder.

\n\n*EDIT: Apparently Spyder includes IPython’s QtConsole by default.*

This is what I most highly recommend. It allows you to see plots inline. Let me\nrepeat that: *you can plot inline*. To see what I mean, here’s an example:

I’ve only found one area where it’s lacking. The issue is so small, I won’t\nmention it.

\n\nGreat for *sharing* results. Provides an interface friendly to reading code,\nLaTeX, markdown and images side-by-side. However, it’s not so great to\ndevelop code in.

Normally in Python, you have to run `attach(filename)`

to run an object. If you\nuse IPython, you have access to `%run`

. I’ve found it most useful for\ninspecting global variables after the script has run. IPython even has other\nuseful tools including `%debug`

(debug *after* error occured acting like it\n*just* occured),\n`!<shell-command>`

and `function?`

/`function??`

for help on a function. The\ndocs on magics are handy.

I typically have MacVim and IPython’s QtConsole (using\na special applescript\nto open; saves opening up iTerm.app) visible and open\nwith an external monitor to look at documentation. A typical script look like

\n\n\n\nI can then run this script in IPython’s QtConsole with `%run script.py`

(using\na handy Keyboard Maestro shortcut to switch windows and\nenter %run) and then can query the program, meaning I can type `z`

in the\nQtConsole and see what `z`

is or even `plot(z[0,:])`

. This is a simple script,\nbut this functionality is priceless in larger and more complicated scripts.

Pylab’s goal is to bring a MATLAB-like interface to Python. They\nlargely succeed. With pylab, Python can *almost* serve as a drop-in replacement\nfor MATLAB. You call `x = ones(N)`

in MATLAB; you can do the same with pylab.

One area where it isn’t a drop-in replacement is with division. In Python 2,\n`1/2 == 0`

through integer division and in MATLAB (and the way it should be),\n`1/2 == 0.5`

. In Python, if `int/int-->int`

is wanted, you can use `1//2`

\ninstead.

To present a nearly drop-in MATLAB interface, use the following code

\n\n\n\nThis `from pylab import *`

is frowned upon. The Zen of Python says

\n\n\nNamespaces are a honking great idea – let’s use more of those!

\n

meaning that `from package import *`

shouldn’t be used *with any package.* It’s\nbest to use `import pylab as p`

but that’s kinda annoying and gets messy in\nlong lines with lots of function calls. I use `from pylab import *`

; I’m\nguesing you’ll do the same. If I’m wondering if a function exists, I try\ncalling it and see what happens; oftentimes, I’m surprised.

Parallelism is a big deal to the scientific community: the code we have takes\nhours to run and we want to speed it up. Since for-loops can be sped up a ton\nby parallelism if each iteration is independent, there are plenty of\nparallelization tools out there to parallelize code, including\nIPython’s paralleziation toolbox.

\n\nBut, this is still slightly confusing and seems like a pain to execute. I\nrecently stumbled across on\na method to parallelize a function in one line. Basically,\nall you do is the following:

\n\n\n\nThe link above goes into more detail; I’ll omit most of it.\nIPython’s parallelization toolkit also includes a `map()`

interface.

*UPDATE:* see my blog post on how to parallelize easily

SymPy serves as a replacement for Mathematica (or at least it’s a\nclose race). With SymPy, you have access to symbolic variables and can perform\nalmost any function on them: differentiation, integration, etc. They support\nmatrices of these symbolic variables and functions on them; it seems like a\ncomplete library.

\n\nPerhaps most attractive, you can even pretty print LaTeX or ASCII.

\n\n\n\nI haven’t used this library much, but I’m sure there are good tutorials out\nthere.

\n\nWhen indexing a two-dimensional numpy array, you often use something like\n`array[y, x]`

(reversed for good reason!). The first index `y`

selects the\n*row* while the second selects the *column.*

For example,

\n\n\n\nThis makes sense because you’d normally use `x[0][1]`

to select the element in\nthe 1st row and 2nd column. `x[0,1]`

does the same thing but drops the\nunnecessary brackets. This is because Python selects the first object with the\nfirst index. Looking at our array, the first object is another array and the\nfirst row.

In MATLAB, indexing is 1-based but perhaps most confusingly `array(x,y)`

is\n`array[y,x]`

in Python. MATLAB also has a feature that allows you to select an\nelement based on the total number of element in an array. This is useful for\nthe Kroeneker product. MATLAB stacks the columns when doing this, which\nis exactly the method `kron`

relies on. To use Kroeneker indexing in Python, I\nuse `x.T.flat[i]`

.

In any Python version <= 3.4, there’s no dot\nproduct operator unlike MATLAB’s `*`

. It’s possible to multiply an array\nelement-wise easily through `*`

in Python (and `.*`

in MATLAB). But coming in\nPython 3.5 is a new dot product operator! The choices behind `@`

and the\nrational are detailed in this PEP.

Until the scientific community slowly progresses towards Python 3.5, we’ll be\nstuck on lowly Python 2.7. Instinct would tell you to call `dot(dot(x,y), z)`

\nto perform the dot product of $X \\cdot Y \\cdot Z$. But instead, you can use\n`x.dot(y).dot(z)`

. Much easier and much cleaner.

This is not really related to the scientific programming process; it applies to\nany file, whether it be in a programming language or not (a good example:\nLaTeX files).

\n\nStealing from this list, if you’ve ever

\n\n\n\n\n\n

\n- made a change to code, realised it was a mistake and wanted to revert back?
\n- lost code or had a backup that was too old?
\n- had to maintain multiple versions of a product?
\n- wanted to see the difference between two (or more) versions of your code?
\n- wanted to prove that a particular change broke or fixed a piece of code?
\n- wanted to review the history of some code?
\n- wanted to submit a change to someone else’s code?
\n- wanted to share your code, or let other people work on your code?
\n- wanted to see how much work is being done, and where, when and by whom?
\n- wanted to experiment with a new feature without interfering with working code?
\n

then you need version control. Personally, I can’t imagine doing anything\nsignificant without source control. Whenever I’m writing a paper and working on\nalmost any programming project, I use `git commit -am \"description\"`

all the\ntime. Source control is perhaps my biggest piece of advice.

Version control is normally a bit of a pain: you normally have be familiar with the\ncommand line and (with CVS/etc) it can be an even bigger pain. Git (and it’s\nbrother Github) are considered the easiest to use versioning tool.

\n\nThey have a GUI to make version control *simple*. It’s simple to commit changes\nand roll back to changes and even branch to work on different features. It’s\navailable for Mac, Windows and many more GUIs\nare available.

They even offer private licenses for users in academia. This\nallows you to have up to five free *private* code repositories online. This\nallows for easy collaboration and sharing (another plus: access to\nGithub Pages). There’s a list of useful guides to\ngetting started with Git/Github.

(shameless plug) MATLAB has a great feature that allows you to call `drawnow`

\nto have a figure update (after calling a series of plot commands). I searched\nhigh and low for a similar syntax in Python. I couldn’t find anything but\nmatplotlib’s animation frameworks which didn’t jive with the global scope ease\nI wanted to use. After a long and arduous search, I did find `clf()`

and\n`draw()`

. This is simple once you know about it, but it’s a pain to find it.

So, I created python-drawnow to make this functionality *easily*\naccessible. It easily allows you to view the results of an iterative (aka\nfor-loop) process.

As I stressed in the introduction, this guide is not meant to be a full\nintroduction to Python; there are plenty of other tools to do that.\nThere are many\nother tutorials on learning Python. These all cover\nthe basics: syntax, scope, functions definitions, etc. And of course, the\ndocumentation is also a great place to look (NumPy,\nSciPy, matplotlib). Failing that, a\nGoogle/stackoverflow search will likely solve your problem. Perhaps the best\npart: if you find a problem in a package and fix it, you can commit your\nchanges and make it accessible globally!

\n\n", "tags": [], "image": "" }, { "id": "http://stsievert.com/blog/2013/11/14/predicting-the-weather/", "url": "http://stsievert.com/blog/2013/11/14/predicting-the-weather/", "title": "Predicting the weather", "date_published": "2013-11-14T04:44:33-06:00", "date_modified": "2013-11-14T04:44:33-06:00", "author": { "name": "Scott Sievert", "url": "http://stsievert.com" }, "summary": "Let’s say that we’re accurately measuring the temperature in both Madison and Minneapolis, but then our temperature sensor in Minneapolis breaks. We could easily install a new sensor, but we would prefer to estimate the temperature in Minneapolis based on the temperature in Madison.

\n\n", "content_html": "Let’s say that we’re accurately measuring the temperature in both Madison and Minneapolis, but then our temperature sensor in Minneapolis breaks. We could easily install a new sensor, but we would prefer to estimate the temperature in Minneapolis based on the temperature in Madison.

\n\n\n\nFirst, let’s see the temperature difference between the two cities:

\n\n\n\n\n\nLet’s say we’re collecting the data accurately and are free from the effects of noise. So, let’s gather the data. In this, we’re estimating $X$ from $Y$. The mean temperature difference, or in math terms, $\\E{\\left|X-Y\\right|} = 4.26^\\circ$ ($\\E{\\cdot}$ is an operator that finds the mean).

\n\nWe’re going to a linear estimation process. This process only takes in\ninformation data about the current data and nothing about the general trend of\nthe seasons. This process just says that the temperature in Minneapolis is 80%\nof the temp in Madison plus some constant; fairly simple. Regardless, it’s still\npretty good as Madison and Minneapolis are fairly similar for temperature. The\nonly thing this estimation requires is some past weather data in Minneapolis to\npredict the mean $\\E{X}$ and variance $\\propto \\E{X^2}$ nothing more.

\n\nWe want to minimize the *energy* of the error, using the $l_2$ norm. This\nchoice may seem arbitrary, and it kind of is. If this data were sparse (aka\nlots of zeros), we might want to use the $l_1$ or $l_0$ norms. But if we’re\ntrying to minimize cost spent, the $l_0$ or $l_1$ norms don’t do as good of a\njob minimizing the amount of dollars spent.

But doing the math,

\n\n$$\n\\Align{\n&\\min \\E{\\left(X-(\\alpha Y+\\beta)\\right)^2} = \\\\\n&\\min \\E{X^2} +\n\\alpha^2\\E{X^2} + \\beta^2 + 2\\alpha\\beta\\E{X} - 2\\alpha\\E{XY} - 2\\beta\\E{X}\n}\n$$

\n\nSince this function is concave (or U-shaped) and $\\E{\\cdot}$ a linear function, we can minimize it using derivates on each term.

\n\n$$\n\\frac{d}{d\\alpha} = 0 = -2 \\E{XY} + 2 \\alpha\\E{Y^2} + 2\\beta \\E{y}\n$$

\n\n$$\n\\frac{d}{d\\beta} = 0 = -2 \\E{X} + 2\\beta + 2\\alpha\\E{X}\n$$

\n\nThis linear system of equations is described by $Ax = b$ or

\n\n$$\n\\begin{bmatrix} \\E{Y^2} & \\E{Y} \\\\\\\\ \\E{X} & 1 \\end{bmatrix}\n\\cdot\n\\begin{bmatrix} \\alpha~/~2 \\\\\\\\ \\beta~/~2 \\end{bmatrix}\n=\n\\begin{bmatrix} \\E{XY}\\\\\\\\ \\E{X}\\end{bmatrix}\n$$

\n\nSolving this linear system of equations by multiplying $A^{-1}$ gives us

\n\n$$\n\\alpha = 0.929\\\\\\\\\\beta = 3.14\n$$

\n\nOn average, the temperature in Minneapolis 92.9% of Madison’s, plus 3 degrees.\nLet’s see how good our results are using this $\\alpha$ and $\\beta$. The\ntemperature difference between the two cities, but predicting one off the other\nis shown below:

\n\n\n\n\n\nThat’s *exactly* what we want! It’s fairly close to the first graph. While\nthere are areas it’s off, it’s pretty dang close. In fact, on average it’s\nwithin $4.36^\\circ$ – fairly close to the original on average temperature\ndifference of $4.26^\\circ$!