## Slide 1

### Slide 1 text

CyCraft Proprietary and Confidential Information Your Escape Plan From NumPy + Cython Cheng-Lin Yang, PhD CyCraft Japan

## Slide 2

### Slide 2 text

CyCraft Proprietary and Confidential Information \$ whoami • Cheng-Lin Yang : @clyang • Taiwanese and live in Taipei • Working for Cybersecurity company: CyCraft Japan • Member of Machine Learning team

## Slide 3

### Slide 3 text

CyCraft Proprietary and Confidential Information Before we start, one quick question for you. Which code runs faster? A. np.power(x, 8) B. x ** 8 C. x * x * x * x * x * x * x * x

## Slide 4

### Slide 4 text

CyCraft Proprietary and Confidential Information Answer: C x * x * x * x * x * x * x * x

## Slide 5

### Slide 5 text

CyCraft Proprietary and Confidential Information Why not Cython?

## Slide 6

### Slide 6 text

CyCraft Proprietary and Confidential Information Cython • Advantage • Utilising 3rd party C library can execute faster • Releasing GIL • Still have the run-time check for common problem provided by Python • Cython syntax is very similar to Python • Disadvantage • You have to handle memory by yourself (if malloc is used) • To get ultimate performance, writing C code with low-level intrinsics CANNOT be avoided (this can be painful)

## Slide 7

### Slide 7 text

CyCraft Proprietary and Confidential Information You have to write something like this, and it’s painful

## Slide 8

### Slide 8 text

CyCraft Proprietary and Confidential Information Today’s example

## Slide 9

### Slide 9 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - I • softmax function is defined as: for j = 1, … , k where Z is a K-dimensional vector • logumexp is a log-sum-trick which prevents over/underflow during softmax calculation

## Slide 10

### Slide 10 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - II • However, floating point underflow will occur during summation. For example: 134217728 1 134217728

## Slide 11

### Slide 11 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - III • The problem can be solved by this simple trick • Applying previous example:

## Slide 12

### Slide 12 text

CyCraft Proprietary and Confidential Information SciPy has it. Why rebuild the wheel? • Too many checks drag performance • For general purpose usage • Caveats to improve performance: • Assuming the input data is following the conditions, so we can remove the unnecessary checks. • Verify what you actual need and simplify the code as per your requirements. • For example: only 1-D arrays will be used in my following scenario

## Slide 13

### Slide 13 text

CyCraft Proprietary and Confidential Information logsumexp in NumPy

## Slide 14

### Slide 14 text

CyCraft Proprietary and Confidential Information logsumexp in NumPy • Based on my scenario. Logsumexp can be implemented as follows:

## Slide 15

### Slide 15 text

CyCraft Proprietary and Confidential Information NumPy vs. SciPy • Results:

## Slide 16

### Slide 16 text

CyCraft Proprietary and Confidential Information Solution 1: CuPy

## Slide 17

### Slide 17 text

CyCraft Proprietary and Confidential Information CuPy • https://github.com/cupy/cupy • Providing NumPy-compatible ND-array on CUDA • Utilising GPU power • Compatible with Existing CUDA kernel • Providing many NumPy equivalent functions so you can minimize code refactoring effort • Check the differences! • https://docs.cupy.dev/en/stable/reference/difference.html • Moving data between CPU and GPU is expensive!

## Slide 18

### Slide 18 text

CyCraft Proprietary and Confidential Information logsumexp in CuPy • Result:

## Slide 19

### Slide 19 text

CyCraft Proprietary and Confidential Information Solution 2: Numba

## Slide 20

### Slide 20 text

CyCraft Proprietary and Confidential Information Numba • http://numba.pydata.org • Just-In-Time (JIT) approach • Translating a subset of Python and NumPy code to machine code • Utilising both CPU and GPU power • Support OpenMP • Near zero code modification • Simply put the “@jit” decorator before the function you want to speed up • Works best with functions not classes (early support) • Active development and large user community

## Slide 21

### Slide 21 text

CyCraft Proprietary and Confidential Information Numba • Two modes you need to know • nopython mode (equals to @njit) • Allows you to get rid of Python’s GIL • object mode • @njit + OpenMP is easy to parallelize computation without GIL limitation

## Slide 22

### Slide 22 text

CyCraft Proprietary and Confidential Information logsumexp by Numba • Results:

## Slide 23

### Slide 23 text

CyCraft Proprietary and Confidential Information Solution 3: Pythran

## Slide 24

### Slide 24 text

CyCraft Proprietary and Confidential Information Pythran • https://pythran.readthedocs.io/en/latest/ • Active development and has fast growing community • Using ahead-of-time compiling approach • LLVM + compiler does all the magic! • Supporting a subset of Python and NumPy functions • Works on Python 2.7 and 3.6/7/8 • Similar to Numba, you have to put a special decorator before the function you want to boost • OpenMP can also be used with Pythran

## Slide 25

### Slide 25 text

CyCraft Proprietary and Confidential Information logsumexp in Pythran • First, write the Python code as usual. (pythran_logsumexp.py) • Compile it by using: • CXX=clang++ pythran -DUSE_XSIMD -march=native -O3 pythran_logsumexp.py

## Slide 26

### Slide 26 text

CyCraft Proprietary and Confidential Information logsumexp in Pythran • Import the just compiled module and run it! • Result

## Slide 27

### Slide 27 text

CyCraft Proprietary and Confidential Information So, which is better? (in numbers)

## Slide 28

### Slide 28 text

CyCraft Proprietary and Confidential Information Benchmark • All benchmarks were run on a bare metal machine with the following specifications: • CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz • RAM: 256GB DDR4 with ECC • GPU: GeForce GTX 1080 Ti • Python and Library information: • Python 3.6.9 • Cuda 10.2 • NumPy 1.17.5 • CuPy 7.8.0 • Numba 0.51.0 • Pythran 0.9.6

## Slide 29

### Slide 29 text

CyCraft Proprietary and Confidential Information 6.7171 6.2521 6.7264 6.0252 5.4564 1.6152 0 1 2 3 4 5 6 7 8 SCIPY NUMPY NUMBA JIT NUMBA NJIT PYTHRAN CUPY seconds (lower is better)

## Slide 30

### Slide 30 text

CyCraft Proprietary and Confidential Information My decision tree CuPy, Numba and Pythran

## Slide 31

### Slide 31 text

CyCraft Proprietary and Confidential Information Has GPU Like compiler? Need Cuda kernel? CuPy Has CPU computation? Pythran Numba CuPy Numba yes yes yes yes no no no no

## Slide 32

### Slide 32 text

CyCraft Proprietary and Confidential Information Three Takeaways • If you have GPU(s), try CuPy first! • If you only have CPU, use Numba first • Numba supports more NumPy functions • If it works, try Pythran to get more performance • Each solution supports different number of NumPy functions. • You can easily find out which function doesn’t work (program stops :P ) • Check its document to see which functions are provided • If A doesn’t work, B might work!

## Slide 33

### Slide 33 text

CyCraft Proprietary and Confidential Information Thank You Stay Safe