Accelerate CPU bound codes of Python by on-the-fly packages with multi-cores and GPGPU.
Thread safe and GIL safe are discussed for error-free coding.
Tags: PyConJP Python Numba TensorFlow Dask PyTorch CuPy GIL-Error GIL-Safe multithreading
Fast Execution Compile Languages: CPython-extension No GIL: Cython, PyPy, Jython, IronPython, .. Device depend: OpenMP, OpenACC, PyCuda Fast Development • Compatibility • Portability On-The-Fly (OTF) Packages PDF in clouds Codes in Appendix: ✍ Links: ✈GIL Introduction PyConJP2018/9 Y. Okuda
time down F F F F F F 4IPUT 5JNF <TFD> 1ZUIPO /VN1Z /VN1Z5G 1ZUIPO/VNCB $1ZUIPO&YU /VN1Z5G!(QV 1ZUIPO/VNCB 5ISFBE 5G(SBQI8IJMF!(QV 1Z5PSDI$V1Z!(QV Note: Very few data transfer, No tune up at packages 1000 Loops Monte Carlo ΠCalculation PyConJP2018/9 Y. Okuda
1× Series ▼ (25%Down@TruboOn) •Process 1.8× Series F F F 4IPUT 5JNF <TFD> 5ISFBE 4FSJFT 1SPDFTT "EE4FS1SP5IS) ▪ Launch time •Thread Zero •Process 6 msec /each F F F 4IPUT 5JNF <TFD> 5ISFBE 4FSJFT 1SPDFTT "EE4FS1SP5IS-PX Background PyConJP2018/9 Y. Okuda
in range(n): g += 1 def sub(n): global g for in range(n): g –= 1 g = None def a s(n): global g g = 0 t1 = Thread( .. add,.. n) t2 = Thread( .. sub,.. n) .. .. .. return g for n in [ .. ]: gs = [] for in range(1000): gs.append(a s(n)) n0 = not zero count (gs) Background PyConJP2018/9 Y. Okuda
to avoid object corruption ✈Dabeaz ✈Abhinav Ajitsaria • GIL: Global Interpreter Lock ▪ Threads chopped intslice, and lose codes ✈A. Jesse • tslice = 5 msec • Errors from 8 msec ☞ For acceleration, avoid GIL and Python object access ☞ For no error, Finish in tslice or apply GIL-Safe opera- tions Thread1 Thread2 GIL tslice tslice tslice tslice tslice GIL Background PyConJP2018/9 Y. Okuda
in the circle targeting N random shots at a square π = 4 ·H/N ✈WikiPi-2 ✈LLNL Error/π = a · Nb ✈WikiPi-1 Python C import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 <= 1 . : h += 1 return 4 . ∗ h / n double pin ( n ) { unsigned i n t s = time (NULL) ; i n t h = 0; for ( i n t i = 0; i < n ; ++ i ) { double x = ( double ) ( ( double ) r a n d r (&s ) / ( double )RAND MAX) ; double y = ( double ) ( ( double ) r a n d r (&s ) / ( double )RAND MAX) ; double r2 = x∗x + y∗y ; i f ( r2 <= 1 . ) h += 1; return 4 . ∗ ( double ) h / ( double ) n ; }} Background PyConJP2018/9 Y. Okuda
n shots ➡ 4 ·h/n ▪ m Threading: pinm(n, m) Launch h1 in n/m h2 in n/m ... ... hm in n/m Map h = sum(h1, h2, .., hm) Reduce 4 ·h/n Background PyConJP2018/9 Y. Okuda
of scope of this talk ▪ A issue in this trial: rand r, random r • rand r: Low randomness, ideal speed up ➡Selected • random r : Good randomness, speed down at threading • random r is slower at threading ✈stackoverflow • Standard shows no clear speed specification at multi-thread ✈open-std • 80 stdlib functions are not thread-safe✈opengroup • Not thread-safe: rand, random, drand48, lrand48, mrand48 • “more standardization for compilers, users, and libraries ..activation of threads” Shameem, P.291 Multi-Core Programming ✈Intel-Press ☞ Check speeds of Official thread-safe functions 0e+00 5e+04 1e+05 # Shots 0.00 0.02 0.04 Π Error 0.0001 -0.005 Rand r Random r F F F 4IPUT 5JNF <TFD> 5XP 5ISFBE /P 5ISFBE Background PyConJP2018/9 Y. Okuda
Move “for loops” into functions • Numpy Vector/Matrix functions are compiled C-codes ▪ Not only numeric calculation • count nonzero • less equal, less, .. • sort, lexsort, .. • where, searchsorted • I/O Python NumPy import random def pin ( n ) : h = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r2 = x∗x + y∗y i f r2 <= 1 . : h += 1 return 4 . ∗ h / n import numpy as np def np pi ( n ) : x = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) y = np . random . rand ( n ) . as ty p e ( np . f l o a t 6 4 ) r s = np . add ( np . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , np . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t = np . count nonzero ( l s s ) pi = np . f l o a t 6 4 ( 4 . ) ∗ np . f l o a t 6 4 ( h i t ) / \ np . f l o a t 6 4 ( n ) return pi Background PyConJP2018/9 Y. Okuda
AVX, AVX2, AVX-512) •@numba.jit Just in Time Compile ▪ Few user’s guides ✈Conda2018Slide ▪ An excellent review ✈Matthew Rocklin ▪ Supported by Conda, Inc ▪ The Gordon and Betty Moore Foundation ▪ GPU version free from end of 2017 ▪ Require: mkl, mkl fft, mkl random, ncurses, llvmlite ▪ CUDA 2.0 or above PyConJP2018/9 Y. Okuda
, actual 100✕ ✈Murillo • “for loop” and a function vector operations on List and NdArray by native and Jit def for add(n, vs): for i in range(n): vs[i] += 1 def np add(n, vs): a = np.add (vs, 1) F F F F F F 4IPUT 5JNF <TFD> 'PS/E"SSBZ 'PS-JTU /Q"EE-JTU +JU'PS-JTU /Q"EE/E"SSBZ +JU'PS/E"SSBZ ▼ NdArray indexing is 3.8✕ slower than List ✈stackoverflow ▼ Indexing is required setup calculations, branches in main loops ▼ np.add(NdArray) is 100✕ faster than np.add(List) Numba PyConJP2018/9 Y. Okuda
of scope) • CUDA kernel codes in definitions ▼ Python like, not C in PyCuda • insert “[#blocks, #threads]” in calls ▼ Ex. pin[25, 40](n) • Rewriting π ✍ ➡ 1160✕ ➡ 152✕ of NumPy ▼ Use 2nd run, 1st includes 1.8 sec compile/load time F F F F F F 4IPUT 5JNF <TFD> $6%"TU $6%"OE Overhead ➡ Numba PyConJP2018/9 Y. Okuda
Eager: Python is a direct executor for ordinary actions • Graph: Python is a macro generator for computing graphs • Eager if 1st-code is tf.enable eager execution() else Graph •Two pip packages: CPU, GPU(=GPU+CPU) Implicit: Package set default device Explicit: “with tf.device(’/cpu:0’):” block ▪ PyTorch-torch-pt: [CPU], CUDA = 2 (NN-Graph) • torch.func(.., device=D,..) D=device(’cuda’); D=device(’cpu’) • Implicit: auto-decide from operands ➡ Fast • Explicit-2: torch.func(..).cuda() ➡ Slow ▪ CuPy-cp: CUDA = 1 (NN-Graph) •Only CUDA, use NumPy for CPU ML Packages PyConJP2018/9 Y. Okuda
some func names ➌ Add “tf.cast” some func ➍ Select env. for CUDA ▪ PyTorch✍ /CuPy✍ Graph ➊ np. ➡pt./ cp. ➋ Change some func names/ No ➌ Add “device” options/ No ➍ Set global device type/ No ▪ TensorFlow Graph✍ ➊ Create “tf.placeholder” inputs ➋ Run a function with the inputs ▪ TensorFlow CPU • Execute the same codes on env. of CPU F F F 4IPUT 5JNF <TFD> 5G&BHFS 5G(SQBI $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda
SIMD ? ▪ PyTorch ✍ 0.7✕ for CUDA-less develop/debug F F F 4IPUT 5JNF <TFD> 1Z5PSDI /VN1Z 5G 5G(SBQI TensorFlow PyTorch ▪ In progress of Eager , More functional and faster ? F F F 4IPUT 5JNF <TFD> $POEB&OW 7JSUVBM&OW $POEB.LM • V1.5@Jan./2018: Contribution version ✈ • V1.7: Moving out of contribution • V1.8: SSE, AVX link • V1.9@Aug.: Conda links intel-MKL ✈Conda MKL: Math Kernel Library(BLAS, LAPACK, ScaLAPACK,FFT,NN,..) ✈Intel • V?: Contribution AutoGraph ✈GitHub ML Packages PyConJP2018/9 Y. Okuda
Scatter, etc in CUDA • Concurrent Main Memory accesses from CUDAs and CPUs ▼ Written by non-portable special control functions, not Python – Macro-Language ▼ Hard to understand the functions, but contrib.AutoGraph converts “for, if, ..” to Graph • Slower than PyToch in the π calculation •1000 While@CUDA✍ •10 Parallel@CUDA✍ F F F F F F 4IPUT 5JNF <TFD> 5G8IJMF $V1Z 1Z5PSDI F F F F F F 4IPUT 5JNF <TFD> 5G1BSB $V1Z 1Z5PSDI ML Packages PyConJP2018/9 Y. Okuda
fft, cv, solvers, etc • TensorFlow: tf.( 1. linalg 2. math 3. image 4. distributions 5. sets 6. strings ) tf.contrib.( 1. linalg 2. integrate 3. image 4. ffmpeg 5. signal 6. timeseries ) • CuPy: 1. linalg 2. math 3. fft ▪ Prediction of getting array OHs at ordinary cases •NumPy ➊ Cupy–Array 1/16✕ ➋ Cupy–Scalar CPU np.RNG(n) xs xs[0] x CPU CUDA cp.RNG(n) xs nd cp.asnumpy nd[0] x CPU CUDA cp.RNG(n) xs xs[0] Scalar x cp.asnumpy RNG: Random Number Generator F F F F F F 4IPUT 5JNF <TFD> /VN1Z "SSBZ 4DBMBS F F F F F F 4IPUT 5JNF <TFD> "SSBZ 4DBMBS { ▼ Transfer time from CUDA to CPU ▼ Jump caused by Cache ? ML Packages PyConJP2018/9 Y. Okuda
•NumPy • Accelerator CPU def f(p1, p2): a1 p1 a2 p2 • • • r return rf CPU Acc. a1 p1 copy in a2 p2 • • • r copy return rf copy out ▼ copy in F F F F F F 4IPUT 5JNF <TFD> 5G!$QV 5G!(QV 1Z5PSDI $V1Z ▼ copy out F F F F F F 4IPUT 5JNF <TFD> 5G!(QV $V1Z 1Z5PSDI /VN1Z 5G!$QV − ▼ NumPy-copy F F F F F F 4IPUT 5JNF <TFD> /VN1Z = ▼ copy F F F F F F 4IPUT 5JNF <TFD> 5G!(QV $V1Z 1Z5PSDI 5G!$QV Modify copy ML Packages PyConJP2018/9 Y. Okuda
Application modules ☞ PyTorch: debugging on CPU ☞ Consider Copy-In/Out overhead F F F F F F 4IPUT 5JNF <TFD> /VN1Z 5G!$QV 5G!(QV $V1Z!(QV 5G8IJMF!(QV 1Z5PSDI!(QV ML Packages PyConJP2018/9 Y. Okuda
# Thread •mn.visualize() at m=3 cnt = int(n/ m) ps = [ ] for in range(m): p = dask.delayed( get pi)(cnt) ps.append(p) mn = dask.delayed(np.mean)(ps) pi = mn.compute() Execute ▪Apply to all the get pi functions with m=3 ① ② ③ Dask PyConJP2018/9 Y. Okuda
/VN1Z %BTL • Ufuncs nogil ✈HP affect acceleration ▼ Short intervals of “add, multiply, less equal” ▪No-GIL functions show well improvement $1ZUIPO %BTL /PHJM1Z %BTL •ThreadPoolExecutor showed: ▼ 3X at CPython ▼ 3X at NogilPy ▪The others show no improvement, CuPy may have nogil func. 1ZUIPO %BTL 1ZUIPO!+JU %BTL 5G$QV %BTL $V1Z %BTL Dask PyConJP2018/9 Y. Okuda
higher speed • Delayed •ThreadPool F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 5ISFBET 3FMBUJWF 4MPQF ! ! 4MPQF *EFBM 5ISFBET 3FMBUJWF 4QFFE ! ! 4QFFE *EFBM Dask PyConJP2018/9 Y. Okuda
–=” without reasoning ➋ Large Overheads for the πcalculation ▪ A tool for Dask components ? ▪ Too Early to Evaluate ➊ NumPy has Nogil functions ➋ CuPy may have Nogil functions • PyTorch Freeze • TensorFlow@CPU segmentation fault F F F F F F 4IPUT 5JNF <TFD> /VN1Z /VN1Z!5ISFBE /PHJM1Z!5ISFBE Dask PyConJP2018/9 Y. Okuda
• NumPy •CuPy F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5 5 F F F F F F 4IPUT 5JNF <TFD> 5 5 5 5 5 5ISFBET 3FMBUJWF 4QFFE ! 4QFFE *EFBM 5ISFBET 3FMBUJWF 4MPQF 4MPQF *EFBM Threading and Nogil PyConJP2018/9 Y. Okuda
/ < 4IPUT> "CTPMVUF 3FMBUJWF &SSPS %BUB $V1Z!5 /VN1Z&SSPS 1 Loop •CuPy at 8 threads ▼ Thread-safe RNG ▼ Paralell execution in CUDA •NumPy at 8 threads ▼ GIL Error caused by, h = 0 for v in lss: if v == 1: h = h + 1 not += Threading and Nogil PyConJP2018/9 Y. Okuda
functions show Safe or Not non-deterministic # def rng count(n) ✍ x = np.random.rand(n) # def count(n) ones = np.ones(n) c = np.count nonzero(ones) return c # n == c • Count: 14 errors No error@T2,3,4 on the test-bench No error on Intel-Atom✍ • Rng Count No error ☞Apply Forced Nogil functions F F F F F F / 5JNF <TFD> &SSPST 5 5 5 5 5 5 Count F F F F F F / 5JNF <TFD> 3OH@$PVOU $PVOU 1 Loop Threading and Nogil PyConJP2018/9 Y. Okuda
? •Local objects are stored in a heap storage of which accesses should be mutexes. •The accesses of the heap storage are controlled by GIL block intervals, not mutexes of the each accesses. Guaranteed @jit( nogil=True, nopython = True) Non-guaranteed @jit( nogil=True, nopython = False) Thread-1 Variables NameSpaces • • • Thread-2 LLVM Objects • • • Thread-3 Release GIL Variables NameSpaces Catch GIL GIL Entry Object Manager Obj-1 Python Heap Storage Obj-n •All Accesses Threading and Nogil PyConJP2018/9 Y. Okuda
rewriting • Guaranteed Nogil F F F F F F 4IPUT 5JNF <TFD> 3FXSJUFE 0SJHJOBM 5ISFBE 5ISFBET 3FMBUJWF 4QFFE ! 4QFFE *EFBM • Rewriting slows down 0.02X h = count nonzero(lss) h = 0 for v in lss: if v == 1: h = h + 1 • Numba speeds up 1.6X • 6 Threads speeds up 3.2X 5x of Original Threading and Nogil PyConJP2018/9 Y. Okuda
with nogil=True in numba.jit ➋Almost impossible to predict GIL-Safe ➌CuPy paralell execution in CUDA ? F F F F F F 4IPUT 5JNF <TFD> /VN1Z /PHJM/VN1Z Threading and Nogil PyConJP2018/9 Y. Okuda
Permissio n i s hereby granted , f r e e of charge , to any person o b t a i n i n g a copy of t h i s s o f t w a r e and a s s o c i a t e d documentation f i l e s ( th e ” Software ” ) , to d eal in th e Software with o u t r e s t r i c t i o n , i n c l u d i n g with o u t l i m i t a t i o n th e r i g h t s to use , copy , modify , merge , p u b lish , d i s t r i b u t e , s u b l i c e n s e , and / or s e l l co p ies of th e Software , and to p ermit p erso n s to whom th e Software i s f u r n i s h e d to do so , s u b j e c t to th e f o l l o w i n g c o n d i t i o n s : The above c o p y r i g h t n o t i c e and t h i s p ermissio n n o t i c e s h a l l be i n c l u d e d in a l l co p ies or s u b s t a n t i a l p o r t i o n s of th e Software . THE SOFTWARE IS PROVIDED ”AS IS ” , WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED , INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY , WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. PyConJP2018/9 Y. Okuda
n c u r r e n t . f u t u r e s import ThreadPoolExecutor @numba . j i t ( n o g i l =True , nopython=True ) def n b a p i n o g i l ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 return 4 . ∗ h i t / n tp e = ThreadPoolExecutor ( max workers =12) #−− def n b a p i n o g i l t p n m ( n , m) : g lo ba l tp e cn t = i n t ( n /m) i f cn t < 1 : cn t = 1 ans = [ ] for i in range (m) : ans . append ( tp e . submit ( n b a p i n o g i l , cn t ) ) p i = 0 . for f in ans : p i += f . r e s u l t ( ) return p i /m print ( ’ Test ’ , n b a p i n o g i l t p n m (10∗∗5 , 4 ) ) CUDA import numba import numpy as np from numba . cuda . random import\ x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 from numba . cuda . random import\ c r e a t e x o r o s h i r o 1 2 8 p s t a t e s @numba . cuda . j i t ( ) def nba cuda ( n , pi , rng ) : t h r e a d i d = numba . cuda . g r i d ( 1 ) h i t = 0 for in range ( n ) : x = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) y = x o r o s h i r o 1 2 8 p u n i f o r m f l o a t 6 4 ( rng , t h r e a d i d ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 p i [ t h r e a d i d ] = 4 . ∗ h i t / n def n b a cu d a rec ( n ) : t h r e a d s p e r b l o c k = 25 b lo ck s = 40 r n g s t a t e s = c r e a t e x o r o s h i r o 1 2 8 p s t a t e s ( t h r e a d s p e r b l o c k ∗ blocks , seed =1) p i s = np . ones ( t h r e a d s p e r b l o c k ∗ blocks , dtype=np . f l o a t 6 4 ) nba cuda [ blocks , t h r e a d s p e r b l o c k ] ( n , pis , r n g s t a t e s ) return p i s . mean ( ) print ( ’ Test ’ , n b a cu d a rec (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
f l o w as t f t f . c o n t r i b . eag er . e n a b l e e a g e r e x e c u t i o n ( ) # t f . e n a b l e e a g e r e x e c u t i o n ( ) def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i . numpy ( ) print ( ’ Test ’ , t f p i n (1 0 ∗ ∗ 5 )) CuPy-CUDA import cupy as cp import numpy as np def cp p i g p u ( n ) : x = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) y = cp . random . rand ( n , dtype=cp . f l o a t 6 4 ) r s = cp . add ( cp . m u l t i p l y ( x , x , dtype=np . f l o a t 6 4 ) , cp . m u l t i p l y ( y , y , dtype=np . f l o a t 6 4 ) , dtype=np . f l o a t 6 4 ) ones = cp . ones ( n , dtype=cp . f l o a t 6 4 ) l s s = cp . l e s s e q u a l ( rs , ones ) h i t = cp . co u n t n o n zero ( l s s ) PyTorch-CPU import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) def p t p i c p u ( n ) : x = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) y = t o r c h . rand ( n , dtype= t o r c h . f l o a t 6 4 ) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , dtype= t o r c h . f l o a t 6 4 ) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] p i = 4 . ∗ h i t / n return p i print ( ’ Test ’ , p t p i c p u (1 0 ∗ ∗ 5 )) PyTorch-CUDA import t o r c h t o r c h . s e t d e f a u l t d t y p e ( t o r c h . f l o a t 6 4 ) DEVICE = t o r c h . d ev ice ( ’ cuda ’ ) def p t p i g p u a l l ( n ) : x = t o r c h . rand ( n , d ev ice=DEVICE) y = t o r c h . rand ( n , d ev ice=DEVICE) r s = t o r c h . add ( t o r c h . mul ( x , x ) , t o r c h . mul ( y , y ) ) ones = t o r c h . ones ( n , d ev ice=DEVICE) l s s = t o r c h . l e ( rs , ones ) h i t = t o r c h . nonzero ( l s s ) . s i z e ( ) [ 0 ] return 4 . ∗ h i t / n print ( ’ Test ’ , p t p i g p u a l l (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
r f l o w as t f def t f p i n ( n ) : xs = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ n ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ n ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . co u n t n o n zero ( l s s ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f g r a p h = t f p i n ( t f n ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h , f e e d d i c t ={ t f n : n }) return p i i f name == ” m a i n ” : print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) TensorFlow-While Graph import t e n s o r f l o w as t f from t f g r a p h s i m p l e import t f p i n def t f g r a p h p i n w h i l e s u b ( i , n , p i s ) : p i s = t f . add ( p i s , t f p i n ( n ) ) return p i s def t f g r a p h p i n w h i l e ( n , loop ) : i = t f . c o n s t a n t ( 0 ) p i s = t f . c o n s t a n t ( 0 . , dtype= t f . f l o a t 6 4 ) i , p i s = t f . wh ile lo o p ( lambda i , p i s : t f . l e s s ( i , loop ) , lambda i , p i s : ( t f . add ( i , 1 ) , t f g r a p h p i n w h i l e s u b ( i , n , p i s ) ) , [ i , p i s ] ) p i = t f . d i v i d e ( p i s , t f . c a s t ( loop , t f . f l o a t 6 4 ) ) return p i t f n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) t f l o o p = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’ loop ’ ) t f g r a p h w h i l e = t f g r a p h p i n w h i l e ( t f n , t f l o o p ) s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( n ) : p i = s e s s i o n . run ( t f g r a p h w h i l e , f e e d d i c t ={ t f n : n , t f l o o p : 1000}) return p i print ( ’ Test ’ , g e t p i (1 0 ∗ ∗ 5 )) Appendix PyConJP2018/9 Y. Okuda
o r f l o w as t f M = 10 m = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’m’ ) n = t f . p l a c e h o l d e r ( t f . int32 , [ ] , name= ’n ’ ) s t e p = t f . c a s t ( t f . d i v i d e ( n , m) , dtype= t f . i n t 3 2 ) h i t = t f . zero s ( [ ] , dtype= t f . int64 , name= ’ h i t ’ ) for in range (M) : xs = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) ys = t f . random uniform ( shape =[ s t e p ] , minval =0 . , maxval =1 . , dtype= t f . f l o a t 6 4 ) r s = t f . add ( t f . m u l t i p l y ( xs , xs ) , t f . m u l t i p l y ( ys , ys ) ) ones = t f . ones ( [ s t e p ] , dtype= t f . f l o a t 6 4 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t = t f . add ( h it , t f . co u n t n o n zero ( lss , dtype= t f . i n t 6 4 ) ) p i = t f . d i v i d e ( t f . m u l t i p l y ( t f . c a s t ( 4 . , t f . f l o a t 6 4 ) , t f . c a s t ( h it , t f . f l o a t 6 4 ) ) , t f . c a s t ( n , t f . f l o a t 6 4 ) ) ans = p i s e s s i o n = t f . Sessio n ( ) s e s s i o n . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) def g e t p i ( in n , in m ) : p i = s e s s i o n . run ( ans , f e e d d i c t ={n : in n , m: in m }) return p i print ( ’ Test ’ , g e t p i (10∗∗5 , 1 0 )) Dask-Numba import numpy as np import random import dask import numba @numba . j i t ( n o g i l =True ) def g e t p i ( n ) : h i t = 0 for in range ( n ) : x = random . random ( ) y = random . random ( ) r = x∗x + y∗y i f r <= 1 . : h i t += 1 return 4 . ∗ h i t / n def d s k n b a p i n o g i l ( n , m, v= False ) : cn t = i n t ( n /m) ps = [ ] for in range (m) : p = dask . delayed ( g e t p i ) ( cn t ) ps . append ( p ) mn = dask . delayed ( np . mean ) ( ps ) i f v : mn. v i s u a l i z e ( o p t i m i z e g r a p h=True ) p i = 0 e l s e : p i = mn . compute ( ) return p i # v i s u a l i z e ( ) r e q u i r e s python g ra p h viz and # Graphviz u t i l i t y # g en era te . / mydask . png # d s k n b a p i n o g i l (10∗∗5 , 3 , v=True ) print ( ’ Test ’ , d s k n b a p i n o g i l (10∗∗5 , 3 ) ) Appendix PyConJP2018/9 Y. Okuda
n c u r r e n t . f u t u r e s \ import ThreadPoolExecutor tp e = ThreadPoolExecutor ( max workers =25) def rn g co u n t ( n ) : x = np . random . rand ( n ) . \ asty p e ( np . f l o a t 6 4 ) ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def count ( n ) : ones = np . ones ( n , dtype=np . f l o a t 6 4 ) c = np . co u n t n o n zero ( ones ) return c def tpe pi nm min ( n , m, f ) : g lo ba l tp e t s = [ ] for i in range (m) : t s . append ( tp e . submit ( f , n ) ) p i s = [ ] for t in t s : p i s . append ( t . r e s u l t ( ) ) return min ( p i s ) for n in (7∗10∗∗6 , 8∗10∗∗6 , 9∗10∗∗6 , 10∗∗7): c = tpe pi nm min ( n , 9 , count ) print ( ” count : ” , n==c , n , c ) c = tpe pi nm min ( n , 9 , rn g co u n t ) print ( ” rn g co u n t : ” , n==c , n , c ) GIL-Safe-Note R e s u l t s of print depend on e x e c u t i n g machine Bench mark machine : count : False 7000000 34302 rn g co u n t : True 7000000 7000000 count : False 8000000 10750 rn g co u n t : True 8000000 8000000 count : False 9000000 525822 rn g co u n t : True 9000000 9000000 count : False 10000000 455166 rn g co u n t : True 10000000 10000000 I n t e l −Atom N3150 @ 1.60GHz 4 Cores no Hyper−Thread s t e p p i n g =3 a l l True ! ! Appendix PyConJP2018/9 Y. Okuda