Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

Yukio Okuda (freelance) [email protected] PyConJP2018/9 Y. Okuda

Me= A Programmer over 40 Years= Joyful 2
:FBS $16 $MPDL <()[> 3FBM $PSFT 0OF $PSF .VMUJ $PSF )ZQFS5ISFBE GIL@Python One Core N-Threads ≤ 1-Thread CPU-Bound ➡Special Vector Processor Multi Core N-Threads@M-Cores = min(N,M) CPU-Bound ➡General Low-price GPGPU IBM/ SRB SUN/ process DECα/ VMS-thread POSIX/ pthread Linux/ pthread ✈Wiki-Xeon PyConJP2018/9 Y. Okuda

Story(1/2) 3 How to accelerate CPU bound codes in Python
Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development • Compatibility • Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda

Story(2/2) 4 ▪Showing speed, but 10=90% 20=95% 50=98% 100=99% of
time down F F F F F F 4IPUT 5JNF <TFD> 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPO&YU /VN1Z5G!(QV 1ZUIPO/VNCB 5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda

H/W S/W 5 env-all Tf-cpu,gpu Python 3.6 3.5 Conda 5.1.0
VirtualEnv Mint Linux(Ubuntu 16.04) CPU + GPU Batch python on shell SSH, NFS CPU: i7-2630QM stepping=5 (Sandy Bridge’12 mobile) Turbo=Off, EIST=Off SSE SSE2 SSE4.1 SSE4.2 AVX 2.0 GHz 4Core 8HT L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s,swap off QM77, NF9G(Jetway Co.) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280 L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-9 CC-6.1 Test bench PyConJP2018/9 Y. Okuda

Background ➊ Python Thread ➋ GIL ➌ CPython-Ext ➍ NumPy
PyConJP2018/9 Y. Okuda

Speeds of Process and Thread (1/2) 7 def add(n): a
= 0 for in range(n): a += 1 for n in [ .. ]: ts = time.monotonic() for in range(1000): f(n) te = time.monotonic() def series(n): add(n) add(n) def process(n): p1 = Process (target= add, args=(n,)) p1.start() p2 = Process (target= add, ... p1.join(); p2.join() def thread(n): t1 = Thread (target= add, args=(n,)) t1.start() t2 = Thread (target= add, ... t1.join(); t2.join() Background PyConJP2018/9 Y. Okuda

Speeds of Process and Thread (2/2) 8 ▪ Speed •Thread
1× Series ▼ (25%Down@TruboOn) •Process 1.8× Series F F F 4IPUT 5JNF <TFD> 5ISFBE 4FSJFT 1SPDFTT "EE4FS1SP5IS) ▪ Launch time •Thread Zero •Process 6 msec /each F F F 4IPUT 5JNF <TFD> 5ISFBE 4FSJFT 1SPDFTT "EE4FS1SP5IS-PX Background PyConJP2018/9 Y. Okuda

Is Thread Safe? (1/2) 9 def add(n): global g for
in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda

Is Thread Safe? (2/2) 10 ▪ T ime ≥ 8
Not Thread-Safe global and local F F F F F F 4IPUT 5JNF <TFD> /PU ;FSP JO (4 5JNF <TFD> /PU ;FSP JO (4 Background PyConJP2018/9 Y. Okuda

Why not Safe? GIL 11 ▪ GIL activates one thread
to avoid object corruption ✈Dabeaz ✈Abhinav Ajitsaria • GIL: Global Interpreter Lock ▪ Threads chopped intslice, and lose codes ✈A. Jesse • tslice = 5 msec • Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe operations Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda

How to Avoid GIL 12 ▪ CPython-Extension: 1. Copy Python
objects to C-Data 2. Apply “ Py BEGIN ALLOW THREADS” C-Macro 3. Execute C-Native codes or OTF codes 4. Apply “ Py END ALLOW THREADS” C-Macro 5. Copy C-Data to Python objects Thread1 Thread2 CPython C-Native GIL AvoidGIL Copy in Py BEGIN ALLOW THREADS Py END ALLOW THREADS Copy out C-Native Codes Background PyConJP2018/9 Y. Okuda

Monte Carlo ΠCalculation 13 H hits
in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 <= 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i < n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (&s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (&s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 <= 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda

Multi-Threaded ΠCalculation 14 ▪ Original: pin(n) Get h hits in
n shots ➡ 4 ·h/n ▪ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda

Π@CPython 15 ▪ Easy Operation (good tools and ofﬁcial documents)
▪ Require to run setup at each release of cpython import cif pi = cif.pin(n) dist/cif- • • • egg python setup.py • • • (Compile, Link, Deploy, Test) cifmodule.c #include <Python.h> static PyObject * pin( • • • ){ Py BEGIN ALLOW THREADS for (int i = 0; i < n; ++i){ • • • Py END ALLOW THREADS setup.py import setuptools setup( • • • ) cif test.py import unittest import cif Background PyConJP2018/9 Y. Okuda

Effects of Threads and Cores 16 ☞ min(N, M)× –
Overhead N: # of Threads, M: # of Real Cores ; 5ISFBET 3FMBUJWF 4QFFE ! ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Background PyConJP2018/9 Y. Okuda

Hard to Develop Threading 17 ▪ Design issues : out
of scope of this talk ▪ A issue in this trial: rand r, random r • rand r: Low randomness, ideal speed up ➡Selected • random r : Good randomness, speed down at threading • random r is slower at threading ✈stackoverflow • Standard shows no clear speed specification at multi-thread ✈open-std • 80 stdlib functions are not thread-safe✈opengroup • Not thread-safe: rand, random, drand48, lrand48, mrand48 • “more standardization for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 Π Error 0.0001 -0.005 Rand r Random r F F F 4IPUT 5JNF <TFD> 5XP 5ISFBE /P 5ISFBE Background PyConJP2018/9 Y. Okuda

NumPy Speedup 18 ▪ Converting to NumPy 7.7✕ • Vectorize:
Move “for loops” into functions • Numpy Vector/Matrix functions are compiled C-codes ▪ Not only numeric calculation • count nonzero • less equal, less, .. • sort, lexsort, .. • where, searchsorted • I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 <= 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / \ np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda

Summary 19 ➊Avoid GIL to speed up ➋Apply GIL-Safe operations
for Thread-Safe ➌min(N, M) acceleration F F F F F F 4IPUT 5JNF <TFD> 1ZUIPO /VN1Z $1ZUIPO&YU $1ZUIPO&YU!5ISFBE Background PyConJP2018/9 Y. Okuda

Numba (V0.38.0) ✈Ofﬁcial •Background •Accelerate on: CPU, M-Core, CUDA (SSE,
AVX, AVX2, AVX-512) •@numba.jit Just in Time Compile ▪ Few user’s guides ✈Conda2018Slide ▪ An excellent review ✈Matthew Rocklin ▪ Supported by Conda, Inc ▪ The Gordon and Betty Moore Foundation ▪ GPU version free from end of 2017 ▪ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ▪ CUDA 2.0 or above PyConJP2018/9 Y. Okuda

Outline 21 ▪Two different systems CPU CUDA @numba.jit(• • )
def pin(n) • • • • • • return out pi = pin(100) @numba.cuda.jit(• • ) def pin(n, out) • • • • • • (no return) pin[25, 40](100, pi) •Many Python codes: ✈Official ✈NumPy ▼ Language: All except class, try, except, with, yield ▼ Function call: inner,closure, recursive ▼ Built-in:abs() bool complex divmod() enumerate() float int iter() len() min() max() next() print() range round() sorted() type() zip() ▼ NumPy: all() any() argmax() argmin() cumprod() cumsum() max() mean() min() nonzero() prod() std() take() var() argsort() astype() copy() flatten() item() itemset() ravel() reshape() sort() sum() transpose() view() ▼ Modules: array, cmath, collections, ctypes, enum, math, operator, functools, random, cffi • CUDA Kernel codes • NumPy: Not Supported Numba PyConJP2018/9 Y. Okuda

On CPU 22 ▪ @numba.jit() Compile/Execute compatible Python codes to
LLVM •Apply Python π ✍ ➡ 21✕ Cf. Manual convert to CPython ✍ ➡ 23✕ ☞Comparable speed to manually converted C •Apply NumPy π ✍ ➡ 1✕ ☞ NumPy functions are not accelerated Cf. Python to NumPy ➡7.7✕ ☞ Jit 21/7.7= 3✕ of NumPy functions ▼ Numba: Python ➡LLVM ➡Python ▼ NumPy: (Python ➡C ➡Python)✕Repeat Numba PyConJP2018/9 Y. Okuda

Accelerate NumPy Indexing 23 ▪ Jit NumPy indexing ➡ 817✕
, actual 100✕ ✈Murillo • “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F F F F F F 4IPUT 5JNF <TFD> 'PS/E"SSBZ 'PS-JTU /Q"EE-JTU +JU'PS-JTU /Q"EE/E"SSBZ +JU'PS/E"SSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverﬂow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda

On M-Core 24 ▪ All Core working ➊ set @jit(parallel=True)
➋ change “range” to “numba.prange” • Apply Python π➡ 89✕ ➡ 4.4✕ of @jit() •No way to control # of cores ▼ Multi-User/Process needs core assignment ▪ @jit(nogil=True) + ThreadPoolExecutor controls ✍ 5ISFBET 3FMBUJWF 4QFFE ! 3FBM $PSFTˡ ˠ)ZQFS5SFBE $PSFT 4QFFE *EFBM Numba PyConJP2018/9 Y. Okuda

On CUDA 25 ▪ Non-compatible python codes, (details are out
of scope) • CUDA kernel codes in deﬁnitions ▼ Python like, not C in PyCuda • insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) • Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F F F F F F 4IPUT 5JNF <TFD> $6%"TU $6%"OE Overhead ➡ Numba PyConJP2018/9 Y. Okuda

Summary 26 ➊Convert to Nogil functions ➋Accelerate “for/while” loops ➌Improve
NumPy indexing F F F F F F 4IPUT 5JNF <TFD> 1ZUIPO $16 /PHJM!5ISFBE 1BSBMMFM $6%" Numba PyConJP2018/9 Y. Okuda

Machine Learning Packages: •NumPy accelerators •Kernel-less CUDA access •Tensor objects
•Poor Documents➡My thought ? ➊ TensorFlow (V1.9) ✈Official ✈ •CPU, CUDA, (TPU, ROCm, Phi) Own-SIMD +(SSE, AVX, AVX2, AVX-512) ➋ PyTorch (V0.4.11) ✈Official •CUDA ➌ CuPy (V4.1.0) –Chainer– ✈Official •CUDA PyConJP2018/9 Y. Okuda

Exec Modes 28 ▪ TensorFlow-tf: (CPU, CUDA)✕(Eager, Graph)= 4 •
Eager: Python is a direct executor for ordinary actions • Graph: Python is a macro generator for computing graphs • Eager if 1st-code is tf.enable eager execution() else Graph •Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ▪ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) • torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) • Implicit: auto-decide from operands ➡ Fast • Explicit-2: torch.func(..).cuda() ➡ Slow ▪ CuPy-cp: CUDA = 1 (NN-Graph) •Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda

CUDA 29 ▪ TensorFlow Eager✍ ➊ np. ➡tf. ➋ Change
some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ▪ PyTorch✍ /CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ▪ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ▪ TensorFlow CPU • Execute the same codes on env. of CPU F F F 4IPUT 5JNF <TFD> 5G&BHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda

CPU 30 ▪ TensorFlow ✍ 2.4, 3.8✕ 8 cores run
SIMD ? ▪ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F F F 4IPUT 5JNF <TFD> 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ▪ In progress of Eager , More functional and faster ? F F F 4IPUT 5JNF <TFD> $POEB&OW 7JSUVBM&OW $POEB.LM • V1.5@Jan./2018: Contribution version ✈ • V1.7: Moving out of contribution • V1.8: SSE, AVX link • V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel • V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda

TensorFlow Graph 31 ▪Advanced computing graph •While, Branch, Parallel, Reduce,
Scatter, etc in CUDA • Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph • Slower than PyToch in the π calculation •1000 While@CUDA✍ •10 Parallel@CUDA✍ F F F F F F 4IPUT 5JNF <TFD> 5G8IJMF $V1Z 1Z5PSDI F F F F F F 4IPUT 5JNF <TFD> 5G1BSB $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda

Overhead (OH) 32 ▪ Negligible OHs for heavy functions as
fft, cv, solvers, etc • TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) • CuPy: 1. linalg 2. math 3. fft ▪ Prediction of getting array OHs at ordinary cases •NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F F F F F F 4IPUT 5JNF <TFD> /VN1Z "SSBZ 4DBMBS F F F F F F 4IPUT 5JNF <TFD> "SSBZ 4DBMBS    { ▼ Transfer time from CUDA to CPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda

All Overheads 33 ▪ Accelerate function “r = f(a1, a2)”
•NumPy • Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F F F F F F 4IPUT 5JNF <TFD> 5G!$QV 5G!(QV 1Z5PSDI $V1Z ▼ copy out F F F F F F 4IPUT 5JNF <TFD> 5G!(QV $V1Z 1Z5PSDI /VN1Z 5G!$QV − ▼ NumPy-copy F F F F F F 4IPUT 5JNF <TFD> /VN1Z = ▼ copy F F F F F F 4IPUT 5JNF <TFD> 5G!(QV $V1Z 1Z5PSDI 5G!$QV Modify copy ML Packages PyConJP2018/9 Y. Okuda

Tensor 34 ▪ Bridge between CPU and Accelerator ? CPU
Acc. a1 p1 a2 p2 • • • r return rf Tensor copy in copy out others Buffer/Cache •copy in (Create Tensor Object) TensorFlow convert to tensor(nd) PyTorch tensor(nd) Cupy array(nd) nd: NdArray •copy out (Convert to NdArray) TensorFlow t obj.numpy() PyTorch t obj.cpu().numpy() Cupy asnumpy(t obj) t obj: Tensor Object •Others ▼ Neural Network functions ▼ MM-Direct: Scatter Read/Write ▼ “if”, “while” •Buffer/Cache ✈PyTorch ✈DlPack ▼ Not store in CPU-Mem. Cf. NumPy functions ▼ • • • ML Packages PyConJP2018/9 Y. Okuda

Summary 35 ➊ CuPy: NumPy compatible CUDA ☞ TensorFlow: CPU-SIMD/CUDA/..,
Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F F F F F F 4IPUT 5JNF <TFD> /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda

Dask (V0.18.0) ✈Official •Background •“Delayed” simple graph for threading ▪
Answer of PyData to Col. W/O-MM-Limit: Hadoop➡Arrow ➡7 systems + Pandas ✈Official ✈Mckinney , HANA(SAP), RevolutionR(MS)✈Official ▪ Conda + DARPA, NSF, Gordon Moore Found., HHMI ▪ Expand NumPy, Pandas, Scikit-Learn ▪ Parallel computing: • Process: Futures •Thread: Delayed PyConJP2018/9 Y. Okuda

Graph for Multi-Threading 37 ▪“delayed” deﬁnes nodes of parallel computing✍
# Thread •mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ▪Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda

The Results at 3 Threads 38 ▪NumPy shows little improvement
/VN1Z %BTL • Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ▪No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL •ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ▪The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda

Delayed vs ThreadPoolExecutor 39 ▪ NogilPy ThreadPool shows lower launch,
higher speed • Delayed •ThreadPool F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF 4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF 4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda

Summary 40 ➊ No guide about GIL-Safe •Only inhibit “+=,
–=” without reasoning ➋ Large Overheads for the πcalculation ▪ A tool for Dask components ? ▪ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions • PyTorch Freeze • TensorFlow@CPU segmentation fault F F F F F F 4IPUT 5JNF <TFD> /VN1Z /VN1Z!5ISFBE /PHJM1Z!5ISFBE Dask PyConJP2018/9 Y. Okuda

Threading and Nogil • ThreadPoolExecutor ➊ Conﬁrm Nogil-ness of CuPy
➋ GIL-Safe prediction ➌ Nogil forced NumPy PyConJP2018/9 Y. Okuda

NumPy vs CuPy 42 ▪ NumPy Partial-Nogil, CuPy Full-Nogil ?
• NumPy •CuPy F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5ISFBET 3FMBUJWF 4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF 4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda

Conﬁrm CuPy 43 ▪ Error/π = a · (N)b ✈WikiPi-1
/ < 4IPUT> "CTPMVUF 3FMBUJWF &SSPS %BUB $V1Z!5 /VN1Z&SSPS 1 Loop •CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA •NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda

GIL-Safe Prediction 44 ▪ Almost impossible to predict GIL-Safe Local
functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c • Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ • Rng Count No error ☞Apply Forced Nogil functions F F F F F F / 5JNF <TFD> &SSPST 5 5 5 5 5 5 Count F F F F F F / 5JNF <TFD> 3OH@$PVOU $PVOU 1 Loop Threading and Nogil PyConJP2018/9 Y. Okuda

Numba JIT Options 45 ▪ Set nopython=True for nogil guarantee
? •Local objects are stored in a heap storage of which accesses should be mutexes. •The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL Entry Object Manager Obj-1 Python Heap Storage Obj-n •All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda

Nogil NumPy by Namba 46 ▪ Some NumPy functions require
rewriting • Guaranteed Nogil F F F F F F 4IPUT 5JNF <TFD> 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF 4QFFE ! 4QFFE *EFBM • Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 • Numba speeds up 1.6X • 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda

Summary 47 ➊Apply Nogil functions for Thread-Safe ▪ Set nopython=True
with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F F F F F F 4IPUT 5JNF <TFD> /VN1Z /PHJM/VN1Z Threading and Nogil PyConJP2018/9 Y. Okuda

Conclusion 48 Execution Time Conﬁrmation (ETC) on run time signatures
showed: ➊ Ideal threading acceleration = min(N, M) ➋ A comparison of On-The-Fly packages: • Numba • TensorFlow • PyTorch • CuPy • Dask ➌ Basic issues and Solutions: • GIL • Nogil • GIL-Safe • Threading • Graph • NumPy Indexing • Copy Overhead Enjoy On-The-Fly Own Ways ✍ PyConJP2018/9 Y. Okuda

Questions or Comments PyConJP2018/9 Y. Okuda

Appendix MIT License Copyright ( c ) 2018 Yukio Okuda
Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda

Numba 51 Nogil import numba import random from c o
n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t < 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import\ x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import\ c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

ML:TfEager,PyTorch,Cupy 52 TensorFlow-CPU/CUDA-Eager import t e n s o r
f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

ML:TfGraph 53 TensorFlow-Simple Graph import t e n s o
r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda

ML:TfGraph Dask 54 TensorFlow-Parallel Graph import t e n s
o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda

Miscellaneous 55 GIL-Safe import numpy as np from c o
n c u r r e n t . f u t u r e s \ import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . \ asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda

Comparing On-The-Fly Accelerating Packages: Num...

Comparing On-The-Fly Accelerating Packages: Numba, TensorFlow, Dask, etc

More Decks by Yukio Okuda

Other Decks in Programming

Featured

Transcript