In this article, we will continue the topic of efficient computation of the Euclidean Distance started in the previous blogpost about vectorization. This time we’ll focus on the Numba JIT compiler.
What is Numba?
On the project page http://numba.pydata.org/ we can read:
“Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code.”
Let’s decompose this definition into prime factors:
All the code is available on the GitHub page: https://github.com/numba/numba. It is released under BSD 2-Clause “Simplified” License
- Just-In-Time (JIT) compiler
The standard compiler compiles the code before execution as a separate step. In contrast, the JIT compiler compiles the code during run time when the code is executed. It means that the first invocation might be slower (because of the compilation overhead), but subsequent invocations will benefit from using the compiled version. The compiler infers the variables’ types from the values passed during execution. If the types change in the subsequent invocations, the new version of the function is compiled. This means we don’t need to create different functions for different argument types, Numba does that for us automatically. It can be a great solution when we don’t know variables’ types beforehand because it saves a lot of programmer’s time!
- A subset of Python and NumPy code
Unfortunately, Numba cannot compile every Python code. You might be surprised when you try to run your code for the first time and get errors, even though the same code would work in plain Python execution.
- Fast machine code
Numba works well for numerically-oriented Python code and it is very good in optimizing loops, even ‘hidden’ behind NumPy functions. It translates Python code into machine code and may apply some LLVM transformation optimizations.
This means Numba can be used for compiling your Python functions to machine code during runtime, doesn’t need a lot of programming effort, and can greatly increase the code performance.
Let’s use Euclidean distance example from the previous blogpost about vectorization.
We begin with the same dummy implementation that computes the Euclidean distance between 2 points using simple
for loop written in plain Python:
To use Numba we just need to wrap a function with
@numba.jit decorator. The nopython argument explicitly tells Numba to use the faster nopython mode and prevents Numba from falling back on the slower object mode.
And that’s it…
To compare execution times, we’ve benchmarked the functions using the perfplot library on vectors of increasing lengths (from 10¹ to 10⁶ elements). Before the performance test, we’ve called the function once to make it compile.
As we can notice on the performance chart, the compiled version of the dummy algorithm works 200-300x faster than a plain one. We can see the performance boost even for the small vectors.
And that’s just the beginning 🙂
In the last blog post, we’ve introduced also a few other implementations of Euclidean distance that used NumPy:
l2_norm use NumPy functions supported by Numba. Both
scipy Euclidean distance are not suitable for the Numba compiler and won’t be used for experiments in this post.
Supported NumPy features can be found in the Supported NumPy features documentation.
Anyway, let’s compare NumPy functions with the Numba equivalent.
In both cases, Numba speeds up 5-10 times for small vectors but for large vectors, results converge. Many operations on the NumPy arrays are already performed by pre-compiled C code so there is no performance gain in cases where those operations become bottlenecks.
Numba Parallel Loops
Numba makes it extremely easy to parallelize loops. We just need to add
parallel=True argument to the
@numba.jit decorator and replace
numba.prange in explicit
Let’s see how does it impact the performance:
It’s worth noting that parallelization does not always improve the performance of functions. There are two main reasons:
- For small vector sizes, the overhead of parallelization is bigger than the gain so we can see the decline in performance,
- There is no gain for L2 Norm because
parallelperforms an automatic parallelization which is implemented only for a subset of functions e.g. explicit for loop and a subset of NumPy functions. All the supported cases are described in the Automatic parallelization documentation.
Let’s now compare how well all the functions perform:
For the small vector sizes, the best performance can be achieved with any of the three mentioned algorithms compiled with Numba without parallelization. It may be useful if we perform lots of operations on small vectors and when the type of the vectors doesn’t change too often. If the type changes, the compilation overhead might be bigger than the gain.
For large vectors, we start to see the boost from the parallelization. Parallel versions of both Dummy and Bare NumPy implementations have the same performance which is ~2x faster than the non-parallel version. But again, automatic parallelization works only for some functions.