Slide 1

Slide 1 text

( ) [email protected] TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 2

Slide 2 text

2 Fortran, PL/I, RPG 360/IBM Bliss, C, X11 VAX/DEC (ADA, Objective-C) C++, Octave, Tcl Sparc/Sun CERN-Root PC/Intel JavaScript Python /Nvidia? DL(NN/AI) TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 3

Slide 3 text

3 ❶DL ❷Python ■ Python GIL ➡ ● CPU➡ GHz ➡ ● GPU➡GP-GPU➡CUDA ● Spark✈ , TPU,Phi✈ , Nervana✈ , Tegra✈ , ARM, QUALCOMM ■ Python: ➡ ■ : ● PyCUDA, OpenCL, PySpark ● Pandas ➡Dask, Intel-Python ● List : Numba ● NumPy : TensorFlow(TF), PyTorch, CuPy ▼ TF: ➡ TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 4

Slide 4 text

4 ❶ ❷ ❸ ❹ ❺ SG: Static Graph @session.run EG: Eager, not Graph TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 5

Slide 5 text

H/W S/W 5 env-1, env-2,...env-n Python 3.5 Conda 4.3.30 Mint Linux(Ubuntu 16.04) CPU + GPU SSH, NFS CPU: i7-2630QM (Sandy Bridge’12 ) 2.4 GHz 4 8 L1=256K, L2=1M, L3=6M PCIe II 5GT/s DDR3 16G 21.3G/s QM77, NF9G(Jetway ) GPU: GTX-1060 (Pascal GP-106) 1.5 GHz 1280 L2=1.5M(192bI/F) PCIe II 5GT/s DDR5 6G 8G/s CUDA-8 CC-6.1 TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 6

Slide 6 text

Env 6 $ conda create -n XXX python=3.5 $ source activate XXX $ pip install YYYYY.whl CPU GPU 1.1 TensorFlow-SG 1.4 1.2 TensorFlow-SG 1.4 1.3 TensorFlow-EG 1.5-dev20171127 1.4 TensorFlow-EG 1.5-dev20171127 1.5 TensorFlow-EG/SG 1.5 AVX Python-2.7 2.1 PyTorch 0.2.0 4 3.1 CuPy 2.2.0 4.1 Numba 0.36.1 5.1 Intel Python TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 7

Slide 7 text

7 ■ 0, 1) shot hit r 0 1 1 ■NumPy ● SIMD ■ np. ➡tf. torch. cp.(CuPy) π = 4 · hit/shot import numpy as np def get(size, shot, hit): x = np.random.rand(size) y = np.random.rand(size) r = np.sqrt(np.add( np.multiply(x, x), np.multiply(y, y) one = np.array([1.]*size) lss = np.less equal(r, one) hit one = np.count nonzero(lss) hit += hit one shot += size return shot, hit TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 8

Slide 8 text

8 for size in [10**2, ..., 10**6]: shot, hit = 0, 0 shot, hit = get(size, shot, hit) pi = 4 * hit/ shot rec( pi, shot, size) F F F 4J[F ܦա࣌ؒɹ<ඵ> /1࣌ؒ — — — ઈରޡࠩɹMPH /1ޡࠩ for size in [10**2, ..., 10**6]: for in range(1000): shot, hit = 0, 0 shot, hit = get(size, shot, hit) pi = 4 * hit/ shot rec( pi, shot, size) F F F 4J[F ॲཧ࣌ؒɹ<ඵ> Y Y /1 5' TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 9

Slide 9 text

TF 9 CPU 4 6 !! CPU GPU 24 76 While F F F F F F ॲཧ࣌ؒɹ<ඵ> /Q &H$QV 4H$QV 4H(QV 4H8IJMF(QV TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 10

Slide 10 text

NumPy 10 ■np. ➡tf. torch. NumPy: np.xxx() x = np . random . rand ( s i z e ) . ast ype ( np . f l o a t 3 2 ) y = np . random . rand ( s i z e ) . ast ype ( np . f l o a t 3 2 ) r s = np . s q r t ( np . add ( np . m u l t i p l y ( x , x ) , np . m u l t i p l y ( y , y ) ) ) ones = np . a r r a y ( [ 1 . ] ∗ si ze , dtype=np . f l o a t 3 2 ) l s s = np . l e s s e q u a l ( rs , ones ) h i t o n e = np . count nonzero ( l s s ) 1 TF-Eg: tf.xxx() x = t f . random uniform ( shape =[ s i z e ] , minval =0. , maxval =1. , dtype= t f . f l o a t 3 2 ) y = t f . random uniform ( shape =[ s i z e ] , minval =0. , maxval =1. , dtype= t f . f l o a t 3 2 ) r s = t f . s q r t ( t f . add ( t f . m u l t i p l y ( x , x ) , t f . m u l t i p l y ( y , y ) ) ) ones = t f . ones ( [ s i z e ] , dtype= t f . f l o a t 3 2 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t o n e = t f . count nonzero ( l ss , dtype= t f . i n t 6 4 ) 8! PyTorch: torch.xxx() x = t f . random uniform ( shape =[ s i z e ] , minval =0. , maxval =1. , dtype= t f . f l o a t 3 2 ) y = t f . random uniform ( shape =[ s i z e ] , minval =0. , maxval =1. , dtype= t f . f l o a t 3 2 ) r s = t f . s q r t ( t f . add ( t f . m u l t i p l y ( x , x ) , t f . m u l t i p l y ( y , y ) ) ) ones = t f . ones ( [ s i z e ] , dtype= t f . f l o a t 3 2 ) l s s = t f . l e s s e q u a l ( rs , ones ) h i t o n e = t f . count nonzero ( l ss , dtype= t f . i n t 6 4 ) 3+ EG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 11

Slide 11 text

Np EG Pt CPU 11 ■PyTorch(Pt) ■EG F F F ॲཧ࣌ؒɹ<ඵ> /Q 1U$QV &H$QV ■EG ● 0.5M ● float32= 4byte ● 2 ● 4x3x0.5=6M=L3 ■ AVX F F F ॲཧ࣌ؒɹ<ඵ> F &H$QV &H"WY EG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 12

Slide 12 text

CPU 12 ■ N3150 4.3 (Braswell’15 Celeron/Atom) 1.6-2GH, 4 L1=224K L2=2M(1Mx2) F F F ॲཧ࣌ؒɹ<ඵ> /Q &H$QV AVX 4.2 3.7 4.6 2 1.6-2.4GH, 4 8 L1=256K, L2=1M, L3=6M F F F ॲཧ࣌ؒɹ<ඵ> /Q &H$QV &H"WY EG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 13

Slide 13 text

EG,SG 13 Eager Static Graph # # def gen(a, loop): for in range(loop): a = tf.add(a, 1) return a # # ( ) N = 3 def gen(a): for in range(N): a = tf.add(a, 1) return a ap= tf.placeholder(··) ag = gen(ap) sess = tf.Session() # a = tf.zeros(··) a = gen(a,3) # sess.run(tf.··initializer()) a = np.zeros(··) a = sess.run(ag, feed dict=ap: a) tf.Tensor tf.Variable tf.TensorArray ▼ tf.Variable ▼ tfe.Variable tf.xxx TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 14

Slide 14 text

EG,SG 14 for in range(N): a = tf.add(a, ) EG CPU a a + + DDR➡M-Con.➡CPU➡PCIe➡GPU-M-Bus➡DDR a1 SG CPU a a +1 +2 +N +1 +2 +N a1 a2 aN N N TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 15

Slide 15 text

EG,SG GPU 15 ■EG GPU ● : =9, (8 + 8) =.. F F F ॲཧ࣌ؒɹ<ඵ> &H$QV &H(QV ■SG GPU 3.8 ● : = , 0, (2, 2) F F F ॲཧ࣌ؒɹ<ඵ> 4H$QV 4H(QV EG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 16

Slide 16 text

SG2 16 ■ Run ap = tf.placeholder(...) aa = tf.add(ap, 1) sess = tf.Session() sess.run(tf.global variables initializer()) a = np.zeros([10]) for in range(N): a = sess.run(aa, feed dict=ap: a) CPU a ap + + aa ■ While loop ap = tf.placeholder(...) i, a = tf.while loop( lambda i, a: tf.less(i,N), lambda i, a: (tf.add(i, 1), tf.add(a, 1)), [0, ap]) sess = tf.Session(); sess.run(tf.global variables initializer()) a = sess.run(a, feed dict=ap:..) CPU a0 ap + WL + WL a i a TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 17

Slide 17 text

SG While loop 17 ■GPU 3.2 F F F ॲཧ࣌ؒɹ<ඵ> 4H(QV 4H8IJMF(QV ■CPU 1. F F F ॲཧ࣌ؒɹ<ඵ> 4H$QV 4H8IJMF$QV SG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 18

Slide 18 text

EG While loop 18 ■ SG (SG ) ●GPU 0.9 1.3 F F F ॲཧ࣌ؒɹ<ඵ> &H(QV &H8IJMF(QV ●CPU 0.9 1.2 F F F ॲཧ࣌ؒɹ<ඵ> &H$QV &H8IJMF$QV SG TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 19

Slide 19 text

TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 20

Slide 20 text

(1/3) 20 ■ EG : teg ≈ tsgw + N(ted + teu) ● tf.xxx(..) , EG/SG ● EG N : teg CPU a ap + + aa ted tec teu teg = N(tec + ted + teu ) = Ntec + N(ted + teu ) ● SG-While loop : tsgw CPU a0 ap + WL + WL a i a twd twc = Ntec twu tsgw = twc + twd + twu ≈ twc TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 21

Slide 21 text

(2/3) 21 ■ 1000 tDU = 1000(ted + teu) CPU for in range(1000): buf = np.empty(size) ans = sess.run(ans get, feed dict= a: buf) :np.empty a = tf.placeholder(...) ans get = func(a) def func(ain): return tf.add(ain, 1.) ■ ➡ F F F ॲཧ࣌ؒɹ<ඵ> (16 ฏۉ F F F ॲཧ࣌ؒɹ<ඵ> $16 ฏۉ ted teu TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 22

Slide 22 text

(3/3) 22 ■ teg = tsgw + 8tDU ➡GPU , AVX ➡ ● EG,SG ● F F F ॲཧ࣌ؒɹ<ඵ> (16 ਪఆ teg tsgw F F F ॲཧ࣌ؒɹ<ඵ> $16 ਪఆ tegavx tsgw TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 23

Slide 23 text

EG 23 ■GPU (AVX)➡Phi/Intel Wiki F F F ॲཧ࣌ؒɹ<ඵ> &H (QV $QV "WY ☞ GPU ☞CPU ☞N NumPy N Xeon:4 22 (L3=20 60MB) Wiki Atom:2 16 (L3=2 16MB) Wiki ARM/DynamIQ: 1 8 GPU ▼ + ➡GPU ▼ / + ➡GPU :Google+Intel for Phi✈ Xeon(22 ), AVX2 :PyTorch, Chainer: TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 24

Slide 24 text

SG 24 ■ : feed dict={x: xs, y: ys... } GPU 25.8 ➡1.2 AVX 1.6 F F F ॲཧ࣌ؒɹ<ඵ> ཚ਺సૹ /Q 4H(QV &H"WY 4Hసૹແ F F F ॲཧ࣌ؒɹ<ඵ> ཚ਺సૹ੒෼ /Q 4Hཚ਺సૹ 4Hཚ਺సૹ tDU ཚ਺ੜ੒ 4H } } } } TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 25

Slide 25 text

(1/2): 25 : a n1 = fc1 (a) ➡ n2 = fc2 (a) SIMD n = fc (a) n[0 : 10] = fc (a[0 : 10]) n[10 : 20] = fc (a[10 : 20]) a fc1 fc2 n1 n2 a[i] = fu (m) tf.while loop tf.cond Eager Session.run tf.Variable tfe.Variable tf.TensorArray tf.assign xxx() tf.scater xxx() tf.control dependencies · · TF TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 26

Slide 26 text

(2/2): 26 TF: H/W,OS, .. ➡ , ▼ ▼ DL Help: HW OS Linux Windows Esxi Android iOS Web H/W CUDA Phi/Intel AVX,Intel-Python GFX(AMD) ?/ARM TF Eager Run-SIMD I/O DL Model Based Data Analysis TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 27

Slide 27 text

27 ■ ,TF ●EG,SG ▼ GPU ▼ tf.xxx() ▼ CPU EG ) ●EG ▼ , ( , Lib ) ▼ Python ▼ EG SG ●SG ▼ ❙ ❙ ❙ ▼ ▼ ■ : ,Variable,DL .. TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 28

Slide 28 text

Q/A or TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 29

Slide 29 text

TFUG-Tokyo#7-1 2018/1/29 Y. Okuda

Slide 30

Slide 30 text

30 EG for ... # −∗− coding : utf −8 −∗− import t e n s o r f l o w as t f import t e n s o r f l o w . c o n t r i b . eag er as t f e t f e . e n a b l e e a g e r e x e c u t i o n ( ) def gen ( a , loop ) : f o r in range ( loop ) : a = t f . add ( a , 1) r e t u r n a a = t f . zero s ( [ 1 0 ] ) a = gen ( a , 3) p r i n t ( a ) SG for ... # −∗− coding : utf −8 −∗− import t e n s o r f l o w as t f import numpy as np N = 3 def gen ( a ) : f o r in range (N) : a = t f . add ( a , 1) r e t u r n a ap = t f . p l a c e h o l d e r ( t f . int32 , None , name= ’a ’ ) ag = gen ( ap ) s e s s = t f . Sessio n ( ) s e s s . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) a = np . zero s ( [ 1 0 ] ) a = s e s s . run ( ag , f e e d d i c t ={ap : a }) p r i n t ( a ) Run # −∗− coding : utf −8 −∗− import t e n s o r f l o w as t f import numpy as np ap = t f . p l a c e h o l d e r ( t f . int32 , None , name= ’ap ’ ) aa = t f . add ( ap , 1) s e s s = t f . Sessio n ( ) s e s s . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) a = np . zero s ( [ 1 0 ] ) f o r in range ( 3 ) : a = s e s s . run ( aa , f e e d d i c t ={ap : a }) p r i n t ( a ) While loop # −∗− coding : utf −8 −∗− import t e n s o r f l o w as t f import numpy as np ap = t f . p l a c e h o l d e r ( t f . int32 , None , name= ’a ’ ) i , a = t f . wh ile lo o p ( lambda i , a : t f . l e s s ( i , 3 ) , lambda i , a : ( t f . add ( i , 1 ) , t f . add ( a , 1 ) ) , [0 , ap ] ) s e s s = t f . Sessio n ( ) s e s s . run ( t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( ) ) a = s e s s . run ( a , f e e d d i c t ={ap : np . zero s ( [ 1 0 ] ) } ) p r i n t ( a ) TFUG-Tokyo#7-1 2018/1/29 Y. Okuda