In the scientific community, executing functions in parallel can mean hours or even days less execution time. There’s a whole array of Python parallelization toolkits, probably partially due to the competing standards issue.
Python has the infamous global interpreter lock (GIL) which greatly restricts parallelism. We could try to step around it ourselves, but a lesson I’ve learned is not to solve a problem others tackled, especially when they do it right. There are a host of other packages available, and all done right. They all tackle problems similar to their own problem.
Perhaps the biggest barrier to parallelization is that it can be very complicated to include, at least for the niche of the scientific community I’m in. I spent a while looking at the IPython parallelization framework and various other packages, but they all gave me mysterious bugs so I decided to use the multiprocessing library for this simple task. Without this Medium article, I would be lost in threads and classes that add significant speed boosts but seem to be default in basic parallelization libraries.
I want to emphasize that there are extremely large speed gains through these other toolboxes and it’s probably worth it to learn those toolboxes if you need those speed gains. The GPU has thousands of cores, while your CPU only has a few. Of course, an easy and simple solution uses the least complex methods, meaning this simple parallelization uses the CPU. There are massive gains for using more complex solutions.
I don’t want to add another package to Python’s list of parallel toolboxes
(again, competing standards), but let’s define
some functions1 so we can have a parallel function easily.
some_function.parallel([1,2,3]) will execute
some_function in parallel with
easy_parallize just uses
pool.map to execute a bunch of
statements in parallel. Using
functool, I return a function that only needs
sequence as an input. It opens and closes a pool each time; certainly not
optimal but easy. This method of parallelization seems prime for the scientific
community. It’s short and sweet and doesn’t require extensive modifications to
Let’s test a small example out and see what each part does. If you want to know my thought process through the whole thing, see my Binpress tutorial. We’ll test a complicated and time-consuming function, a not-so-smart way to test if a number is prime. We could use much smarter methods such as using NumPy or even the Sieve of Eratosthenes, but let’s pretend that this is some much longer and more computation-intensive computation.
Testing this parallelization out, we find that the results are exactly the same, even with incredibly precise floats. This means that the exact same computation is taking place, just on different cores.
Given the simplicity of including this parallelization, the most important question to ask is about the speedup of this code. Like all complex subjects, the answer is “it depends.” It will primarily2 depend on the number of cores your machine has. One core or worst case means that your parallelized code will still run, just without the speed benefits.
On “basic” machines, such as this 4-core Macbook Air, I see parallelized results that run about twice as fast. On an 8-core iMac, it’s about 4 times as fast. It seems like my operating system is only dedicating half the cores to this process, but I don’t want to dig into that magic3.
If you’re running code that takes a long time to run and can be run in parallel, you probably should have access to a large supercomputing institute like I should (but don’t. Classic academia). Running code on machines that have 20 cores, we would expect to get results in 5% of the serial time.
test_primes.parallel instead of
test_primes gives us pretty large
speed gains for such a simple method. Even more convenient, editing the code to
do this was easy. Another selling point is that it’s easy to use threads
instead of processes. The
multiprocessing.dummy module is an exact clone of
multiprocessing that uses threads instead of processes. Hence if you find
gains with threads (I didn’t), use
from multiprocessing.dummy import Pool.
Too often, some promising example is shown that when applied in the real world fails. To try and combat this, I decided to use a more complicated example, calculating a Mandelbrot set. This can be natively parallelized using NumPy, but I decided to simulate a more complex function. Once again, the full code including serial and parallel declarations is available on Github.
Calculating this Mandelbrot set relies on looping over a two-dimensional grid (that is, without using NumPy). To parallelize this function, I essentially only looped over one dimension while parallelizing the other dimension.
This edit is relatively simple. While I couldn’t just have a drop in
mandel.parallel, some small edits to take out looping over
x dimension made my function faster. Essentially, I brought the for-loop
x outside the function. I see similar results with this code: about 2x
speedup on my 4-core and about a 4x speedup on the 8-core iMac.
At the end of the day, we can expect to easily parallelize our code providing we use the right tools. This just involves making small edits for fairly large speed gains (even on this machine!) and huge speed gains on supercomputers.
If you want to see this in more detail, look at the source which is well commented and works. As I mentioned earlier, I have written a Binpress tutorial on how I discovered this function. To be complete, here are the full results:
If your input range is small like
range(1, 100), the overhead of creating the parallel function will wipe out any speed benefits. Luckily, your range is small enough this shouldn’t matter. ↩
subscribe via RSS