High-Performance Computation in Python | Numba

In this article, we will continue the topic of efficient computation of the Euclidean Distance started in the previous blogpost about vectorization. This time we’ll focus on the Numba JIT compiler.

What is Numba?

On the project page http://numba.pydata.org/ we can read:

“Numba is an open-source JIT compiler that translates a subset of Python and NumPy code into fast machine code.”

Let’s decompose this definition into prime factors:

  1. Open-source
    All the code is available on the GitHub page: https://github.com/numba/numba. It is released under BSD 2-Clause “Simplified” License
  2. Just-In-Time (JIT) compiler
    The standard compiler compiles the code before execution as a separate step. In contrast, the JIT compiler compiles the code during run time when the code is executed. It means that the first invocation might be slower (because of the compilation overhead), but subsequent invocations will benefit from using the compiled version. The compiler infers the variables’ types from the values passed during execution. If the types change in the subsequent invocations, the new version of the function is compiled. This means we don’t need to create different functions for different argument types, Numba does that for us automatically. It can be a great solution when we don’t know variables’ types beforehand because it saves a lot of programmer’s time!
  3. A subset of Python and NumPy code
    Unfortunately, Numba cannot compile every Python code. You might be surprised when you try to run your code for the first time and get errors, even though the same code would work in plain Python execution.
  4. Fast machine code
    Numba works well for numerically-oriented Python code and it is very good in optimizing loops, even ‘hidden’ behind NumPy functions. It translates Python code into machine code and may apply some LLVM transformation optimizations.

This means Numba can be used for compiling your Python functions to machine code during runtime, doesn’t need a lot of programming effort, and can greatly increase the code performance.

Basic Usage

Let’s use Euclidean distance example from the previous blogpost about vectorization.

We begin with the same dummy implementation that computes the Euclidean distance between 2 points using simple for loop written in plain Python:

To use Numba we just need to wrap a function with @numba.jit decorator. The nopython argument explicitly tells Numba to use the faster nopython mode and prevents Numba from falling back on the slower object mode.

And that’s it…

To compare execution times, we’ve benchmarked the functions using the perfplot library on vectors of increasing lengths (from 10¹ to 10⁶ elements). Before the performance test, we’ve called the function once to make it compile.

Performance comparison between Python version of dummy function and the compiled version. We can see the ~5x performance gain for small vectors and 200-300x gain for large ones.

As we can notice on the performance chart, the compiled version of the dummy algorithm works 200-300x faster than a plain one. We can see the performance boost even for the small vectors.

And that’s just the beginning 🙂

NumPy Functions

In the last blog post, we’ve introduced also a few other implementations of Euclidean distance that used NumPy:

Unfortunately, only bare_numpy and l2_norm use NumPy functions supported by Numba. Both einsum and scipy Euclidean distance are not suitable for the Numba compiler and won’t be used for experiments in this post.
Supported NumPy features can be found in the Supported NumPy features documentation.

Anyway, let’s compare NumPy functions with the Numba equivalent.

Performance comparison between Python version of bare_numpy function and the compiled version. We can see the ~10x performance gain for small vectors and no gain for large ones.
Performance comparison between Python version of l2_norm function and the compiled version. We can see the ~5x performance gain for small vectors and no gain for large ones.

In both cases, Numba speeds up 5-10 times for small vectors but for large vectors, results converge. Many operations on the NumPy arrays are already performed by pre-compiled C code so there is no performance gain in cases where those operations become bottlenecks.

Numba Parallel Loops

Numba makes it extremely easy to parallelize loops. We just need to add parallel=True argument to the @numba.jit decorator and replace range with numba.prange in explicit for loops.

Let’s see how does it impact the performance:

Performance comparison between Python version of dummy function, the compiled version, and the parallel version. We can see a decline in performance for small vectors and the ~2x gain for large ones.
Performance comparison between Python version of bare_numpy function, the compiled version, and the parallel version. We can see a decline in performance for small vectors and the ~4x gain for large ones.
Performance comparison between Python version of l2_norm function, the compiled version, and the parallel version. We can see a decline in performance for small vectors and no difference for large ones.

It’s worth noting that parallelization does not always improve the performance of functions. There are two main reasons:

  • For small vector sizes, the overhead of parallelization is bigger than the gain so we can see the decline in performance,
  • There is no gain for L2 Norm because parallel performs an automatic parallelization which is implemented only for a subset of functions e.g. explicit for loop and a subset of NumPy functions. All the supported cases are described in the Automatic parallelization documentation.

Conclusion

Let’s now compare how well all the functions perform:

Performance comparison between all the tested functions. We can see the compiled functions have ~5x performance gain on the small vectors and the parallelized functions have ~2x gain on large vectors.

For the small vector sizes, the best performance can be achieved with any of the three mentioned algorithms compiled with Numba without parallelization. It may be useful if we perform lots of operations on small vectors and when the type of the vectors doesn’t change too often. If the type changes, the compilation overhead might be bigger than the gain.

For large vectors, we start to see the boost from the parallelization. Parallel versions of both Dummy and Bare NumPy implementations have the same performance which is ~2x faster than the non-parallel version. But again, automatic parallelization works only for some functions.