Improve Red Chainer and Numo::NArray performance

Improve Red Chainer and Numo::NArray performance

RubyKaigi 2018 LT
Improve Red Chainer and Numo::NArray performance

7dd50c456191c49bdfe3dc73a94eddc2?s=128

NAITOH Jun

May 31, 2018
Tweet

Transcript

  1. Improve Red Chainer and Numo::NArray performance NAITOH Jun (@naitoh) RubyKaigi

    2018 Lightning Talks 31 May 2018
  2. 3FE$IBJOFS $IBJOFS • Red Data Tools project / Red Chainer

    (Ruby port of Chainer 2) • In my task, I am porting to Ruby using py2rb.py now. self.introduce • redmine.tokyo study group staff. • My gem is RBPDF gem. (HTML to PDF generation library). It is used in Redmine PDF export. • My python pip is py2rb. (code translator using AST from Python to Ruby)(talked at RejectKaigi 2017) QZSCQZ USBOTMBUFXJUI 1ZUIPO"45 "CTUSBDU4ZOUBY5SFF OVNQZ /VNP/"SSBZ QZUIPODPEF SVCZDPEF
  3. Problem of Red Chainer Red Chainer is x3 slower than

    Chainer 2 (CPU ver.). *1: 2 cores on CentOS7(x86_64) *2: python 2.7 / numpy 2.7 *3: Ruby 2.5.1 / Numo::NArray 0.9.1.2 OpenBLAS by loading Linalg $16WFS NOJTUFQPD  $IBJOFS 26.0 sec *2 3FE$IBJOFS 83.68 sec *3 x3.21 slower JNQSPWFE CZNZ13 YGBTUFS 48.32 sec
  4. The difference between Chainer and Red Chainer. EFGBVMU OVNQZ float64

    (fp64) /VNP/"SSBZ DFloat (fp64) 4BNF • numpy and Numo::NArray default processing is fp64 (Double precision floating point number). GPS$MBTT .FUIPE /"SSBZ GQ GPS *OTUBODF .FUIPE DFloat (fp64) SFloat (fp32) #FGPSF/VNP/"SSBZ<> MPH@Q  "GUFSMPH@ZEDMBTT<> MPH@Q LFFQBDDVSBDZBT4'MPBU 1SPDFTTFEXJUI/VNP/"SSBZTEFGBVMU%'MPBU 4'MPBUPCKFDU • When casting Numo::NArray objects, using class methods will be treated as DFloat. VTF $IBJOFS numpy float32 (fp32) 3FE$IBJOFS Numo::NArray DFloat (fp64) EJ⒎FS SFloat (fp32) XBOUUP$IBOHFUP4'MPBU YGBTUFS
  5. Numo::NArray Benchmark • numpy's BroadCast(b),(c) is fast, [python 2.7 +

    numpy loop=10000] x = np.ones([1000,784], dtype=np.float32) y = np.ones([1000,784], dtype=np.float32) z = np.ones([1000,1], dtype=np.float32) (a) x += y 5.1500 (b) x += z 4.1600 (c) x += 1.0 2.2600 (a) x -= y 5.1200 (b) x -= z 4.0700 (c) x -= 1.0 2.1000 (a) x *= y 5.1400 (b) x *= z 3.9200 (c) x *= 1.0 2.0500 (a) x /= y 5.6300 (b) x /= z 6.4200 (c) x /= 1.0 5.1400 [sec] GBTU GBTUFS Z   B Y   [   C Y   D Y    TDBMBS — º × — º × — º × GBTU [ruby 2.5.1 + Numo::NArray 0.9.1.2 loop=10000] x = Numo::SFloat.ones([1000,784]) y = Numo::SFloat.ones([1000,784]) z = Numo::SFloat.ones([1000,1]) (a) x.inplace + y 7.131974 (b) x.inplace + z 7.193444 (c) x.inplace + 1.0 6.813332 (a) x.inplace - y 7.172393 (b) x.inplace - z 7.689382 (c) x.inplace - 1.0 7.161590 (a) x.inplace * y 7.312234 (b) x.inplace * z 7.630061 (c) x.inplace * 1.0 7.427569 (a) x.inplace / y 21.111190 (b) x.inplace / z 20.350092 (c) x.inplace / 1.0 20.636417 [sec] TBNF YTMPXFS but Numo::NArray’s BroadCast is not fast. YTMPXFS • The division of Numo::NArray is x3-4 times slower than numpy. 6.480560 3.696468 3.070180 6.197381 3.499792 3.383020 6.105243 3.560802 3.301580 6.609833 5.425440 5.449255 [sec] JNQSPWFE CZNZ13 GBTU Y GBTUFS YGBTUFS
  6. • ext/numo/narray/gen/tmpl/binary.c D@JUFS OB@MPPQ@U DPOTUMQ  \TJ[F@UJ O DIBS Q

⒎FSFOUEUZQFDBTF DBTFD#SPBEDBTUXJUITDBMBSWBMVF  PQMPPQ JO$IFDL J  Q Q  Q Q T 1PJOUFSNPWF  Q T 1PJOUFSNPWF  Q T 1PJOUFSNPWF  #SPBEDBTU TDBMBS PQFSBUJPO DPVOUJTY )PXUPPQUJNJ[F Y Z [ TDBMBS MFGUNFNCFS
  7. Optimization processing by type of variable. • ext/numo/narray/gen/tmpl/binary.c JG QQ

    \JOQMBDFDBTF GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF Q <J>N@OBNF EUZQF Q <J> EUZQF Q <J>  ^ ^FMTF\ GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF Q <J>N@OBNF EUZQF Q <J> EUZQF Q <J>  ^ ^ GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF Q <J>N@OBNF EUZQF Q <J> EUZQF Q <J>  ^ T T T 4BNFEUZQFDBTF DBTFB C  *OQMBDFDBTFPQUJNJ[FE MFTTNFNPSZ #FGPSF "GUFS OPVTFQ QQ JOQMBDFDBTF
  8. JG T \#SPBEDBTUJOHGSPNTDBMBSWBMVF DIFDL@JOUEJW[FSP EUZQF Q  JG TTJ[FPG EUZQF

    TTJ[FPG EUZQF  \ JG QQ \JOQMBDFDBTF GPS JJOJ \  EUZQF Q <J>N@OBNF EUZQF Q <J> EUZQF Q  ^ ^FMTF\ GPS JJOJ \  EUZQF Q <J>N@OBNF EUZQF Q <J> EUZQF Q  ^ ^ ^FMTF\ GPS JJOJ \  EUZQF QN@OBNF EUZQF Q EUZQF Q  Q T Q T ^ ^ ^FMTF\ JG QQ \JOQMBDFDBTF GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF QN@OBNF EUZQF Q EUZQF Q  Q T Q T ^ ^FMTF\ GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF QN@OBNF EUZQF Q EUZQF Q  Q T Q T Q T • ext/numo/narray/gen/tmpl/binary.c %J⒎FSFOU EUZQF DBTF #SPBEDBTUJOH DBTFPQUJNJ[FE GPS JJOJ \ DIFDL@JOUEJW[FSP EUZQF Q   EUZQF QN@OBNF EUZQF Q EUZQF Q  Q T Q T Q T PQ PQ MFTTNFNPSZ T T 4BNFEUZQFDBTF  DBTFD #FGPSF "GUFS T T %J⒎FSFOUEUZQFDBTF PQ OPDIBOHFQ T T #SPBEDBTUXJUITDBMBSWBMVF  DBTFD << > < >>  PQ MFTTNFNPSZ PQ OPVTFQ QQ PUIFSDBTF T  JOQMBDFDBTF
  9. Future: Remaining bottleneck 1. “x.dot(w.transpose)” is slow on Numo::NArray. (dot

    for Numo::NArray's view is slow) #95 →x.dot(w.dup.transpose). (But I want to fixed with Numo::NArray.) 2. Numo::NMath.sqrt is slow. → It is possible to cope with SIMD calculation. NOJTUFQPD VTF%'MPBU VTF4'MPBU NZ13 'VUVSF "GUFSUIFBCPWFDPSSFDUJPO $IBJOFS 26.0 sec ← ←(26.0 sec) 3FE$IBJOFS 83.68 sec 48.32 sec 30.29 sec TMPXFS x3.21 x1.85 x1.16 (16% slower) /VNP4'MPBU WJFX /VNP4'MPBU 2 cores on CentOS7(x86_64) python 2.7 / numpy 2.7 Ruby 2.5.1 / Numo::NArray 0.9.1.2 OpenBLAS by loading Linalg
  10. Reference • py2rb.py • https://github.com/naitoh/py2rb • http://naitoh.hatenablog.com/entry/2018/01/27/012333 • Red Data

    Tools project (https://red-data-tools.github.io/) • Red Chainer (https://github.com/red-data-tools/red-chainer/) • Lots of memcpy are issued on a.dot(b.transpose) • https://github.com/ruby-numo/numo-narray/issues/95