Slide 1

Slide 1 text

CyCraft Proprietary and Confidential Information Your Escape Plan From NumPy + Cython Cheng-Lin Yang, PhD CyCraft Japan

Slide 2

Slide 2 text

CyCraft Proprietary and Confidential Information $ whoami • Cheng-Lin Yang : @clyang • Taiwanese and live in Taipei • Working for Cybersecurity company: CyCraft Japan • Member of Machine Learning team

Slide 3

Slide 3 text

CyCraft Proprietary and Confidential Information Before we start, one quick question for you. Which code runs faster? A. np.power(x, 8) B. x ** 8 C. x * x * x * x * x * x * x * x

Slide 4

Slide 4 text

CyCraft Proprietary and Confidential Information Answer: C x * x * x * x * x * x * x * x

Slide 5

Slide 5 text

CyCraft Proprietary and Confidential Information Why not Cython?

Slide 6

Slide 6 text

CyCraft Proprietary and Confidential Information Cython • Advantage • Utilising 3rd party C library can execute faster • Releasing GIL • Still have the run-time check for common problem provided by Python • Cython syntax is very similar to Python • Disadvantage • You have to handle memory by yourself (if malloc is used) • To get ultimate performance, writing C code with low-level intrinsics CANNOT be avoided (this can be painful)

Slide 7

Slide 7 text

CyCraft Proprietary and Confidential Information You have to write something like this, and it’s painful

Slide 8

Slide 8 text

CyCraft Proprietary and Confidential Information Today’s example

Slide 9

Slide 9 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - I • softmax function is defined as: for j = 1, … , k where Z is a K-dimensional vector • logumexp is a log-sum-trick which prevents over/underflow during softmax calculation

Slide 10

Slide 10 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - II • However, floating point underflow will occur during summation. For example: 134217728 1 134217728

Slide 11

Slide 11 text

CyCraft Proprietary and Confidential Information logsumexp (LSE) - III • The problem can be solved by this simple trick • Applying previous example:

Slide 12

Slide 12 text

CyCraft Proprietary and Confidential Information SciPy has it. Why rebuild the wheel? • Too many checks drag performance • For general purpose usage • Caveats to improve performance: • Assuming the input data is following the conditions, so we can remove the unnecessary checks. • Verify what you actual need and simplify the code as per your requirements. • For example: only 1-D arrays will be used in my following scenario

Slide 13

Slide 13 text

CyCraft Proprietary and Confidential Information logsumexp in NumPy

Slide 14

Slide 14 text

CyCraft Proprietary and Confidential Information logsumexp in NumPy • Based on my scenario. Logsumexp can be implemented as follows:

Slide 15

Slide 15 text

CyCraft Proprietary and Confidential Information NumPy vs. SciPy • Results:

Slide 16

Slide 16 text

CyCraft Proprietary and Confidential Information Solution 1: CuPy

Slide 17

Slide 17 text

CyCraft Proprietary and Confidential Information CuPy • https://github.com/cupy/cupy • Providing NumPy-compatible ND-array on CUDA • Utilising GPU power • Compatible with Existing CUDA kernel • Providing many NumPy equivalent functions so you can minimize code refactoring effort • Check the differences! • https://docs.cupy.dev/en/stable/reference/difference.html • Moving data between CPU and GPU is expensive!

Slide 18

Slide 18 text

CyCraft Proprietary and Confidential Information logsumexp in CuPy • Result:

Slide 19

Slide 19 text

CyCraft Proprietary and Confidential Information Solution 2: Numba

Slide 20

Slide 20 text

CyCraft Proprietary and Confidential Information Numba • http://numba.pydata.org • Just-In-Time (JIT) approach • Translating a subset of Python and NumPy code to machine code • Utilising both CPU and GPU power • Support OpenMP • Near zero code modification • Simply put the “@jit” decorator before the function you want to speed up • Works best with functions not classes (early support) • Active development and large user community

Slide 21

Slide 21 text

CyCraft Proprietary and Confidential Information Numba • Two modes you need to know • nopython mode (equals to @njit) • Allows you to get rid of Python’s GIL • object mode • @njit + OpenMP is easy to parallelize computation without GIL limitation

Slide 22

Slide 22 text

CyCraft Proprietary and Confidential Information logsumexp by Numba • Results:

Slide 23

Slide 23 text

CyCraft Proprietary and Confidential Information Solution 3: Pythran

Slide 24

Slide 24 text

CyCraft Proprietary and Confidential Information Pythran • https://pythran.readthedocs.io/en/latest/ • Active development and has fast growing community • Using ahead-of-time compiling approach • LLVM + compiler does all the magic! • Supporting a subset of Python and NumPy functions • Works on Python 2.7 and 3.6/7/8 • Similar to Numba, you have to put a special decorator before the function you want to boost • OpenMP can also be used with Pythran

Slide 25

Slide 25 text

CyCraft Proprietary and Confidential Information logsumexp in Pythran • First, write the Python code as usual. (pythran_logsumexp.py) • Compile it by using: • CXX=clang++ pythran -DUSE_XSIMD -march=native -O3 pythran_logsumexp.py

Slide 26

Slide 26 text

CyCraft Proprietary and Confidential Information logsumexp in Pythran • Import the just compiled module and run it! • Result

Slide 27

Slide 27 text

CyCraft Proprietary and Confidential Information So, which is better? (in numbers)

Slide 28

Slide 28 text

CyCraft Proprietary and Confidential Information Benchmark • All benchmarks were run on a bare metal machine with the following specifications: • CPU: Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz • RAM: 256GB DDR4 with ECC • GPU: GeForce GTX 1080 Ti • Python and Library information: • Python 3.6.9 • Cuda 10.2 • NumPy 1.17.5 • CuPy 7.8.0 • Numba 0.51.0 • Pythran 0.9.6

Slide 29

Slide 29 text

CyCraft Proprietary and Confidential Information 6.7171 6.2521 6.7264 6.0252 5.4564 1.6152 0 1 2 3 4 5 6 7 8 SCIPY NUMPY NUMBA JIT NUMBA NJIT PYTHRAN CUPY seconds (lower is better)

Slide 30

Slide 30 text

CyCraft Proprietary and Confidential Information My decision tree CuPy, Numba and Pythran

Slide 31

Slide 31 text

CyCraft Proprietary and Confidential Information Has GPU Like compiler? Need Cuda kernel? CuPy Has CPU computation? Pythran Numba CuPy Numba yes yes yes yes no no no no

Slide 32

Slide 32 text

CyCraft Proprietary and Confidential Information Three Takeaways • If you have GPU(s), try CuPy first! • If you only have CPU, use Numba first • Numba supports more NumPy functions • If it works, try Pythran to get more performance • Each solution supports different number of NumPy functions. • You can easily find out which function doesn’t work (program stops :P ) • Check its document to see which functions are provided • If A doesn’t work, B might work!

Slide 33

Slide 33 text

CyCraft Proprietary and Confidential Information Thank You Stay Safe