Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Improve Red Chainer and Numo::NArray performance

Improve Red Chainer and Numo::NArray performance

RubyKaigi 2018 LT
Improve Red Chainer and Numo::NArray performance

NAITOH Jun

May 31, 2018
Tweet

More Decks by NAITOH Jun

Other Decks in Programming

Transcript

  1. Improve
    Red Chainer and Numo::NArray
    performance
    NAITOH Jun (@naitoh)
    RubyKaigi 2018 Lightning Talks
    31 May 2018

    View full-size slide

  2. 3FE$IBJOFS
    $IBJOFS
    • Red Data Tools project / Red Chainer
    (Ruby port of Chainer 2)

    • In my task, I am porting to Ruby
    using py2rb.py now.
    self.introduce
    • redmine.tokyo study group staff.
    • My gem is RBPDF gem.
    (HTML to PDF generation
    library). It is used in Redmine
    PDF export.
    • My python pip is py2rb. (code translator using AST from Python
    to Ruby)(talked at RejectKaigi 2017)
    QZSCQZ
    USBOTMBUFXJUI
    1ZUIPO"45
    "CTUSBDU4ZOUBY5SFF

    OVNQZ
    /VNP/"SSBZ
    QZUIPODPEF
    SVCZDPEF

    View full-size slide

  3. Problem of Red Chainer
    Red Chainer is x3 slower than
    Chainer 2 (CPU ver.).
    *1: 2 cores on CentOS7(x86_64)
    *2: python 2.7 / numpy 2.7
    *3: Ruby 2.5.1 / Numo::NArray 0.9.1.2 OpenBLAS by loading Linalg
    $16WFS NOJTUFQPD
    $IBJOFS 26.0 sec *2
    3FE$IBJOFS 83.68 sec *3
    x3.21 slower
    JNQSPWFE
    CZNZ13
    YGBTUFS
    48.32 sec

    View full-size slide

  4. The difference between
    Chainer and Red Chainer.
    EFGBVMU
    OVNQZ float64 (fp64)
    /VNP/"SSBZ DFloat (fp64)
    4BNF
    • numpy and Numo::NArray default
    processing is fp64 (Double precision
    floating point number).
    GPS$MBTT
    .FUIPE
    /"SSBZ GQ

    GPS
    *OTUBODF
    .FUIPE
    DFloat (fp64)
    SFloat (fp32)
    #FGPSF/VNP/"SSBZ<> MPH@Q

    "GUFSMPH@ZEDMBTT<> MPH@Q

    LFFQBDDVSBDZBT4'MPBU
    1SPDFTTFEXJUI/VNP/"SSBZTEFGBVMU%'MPBU
    4'MPBUPCKFDU
    • When casting Numo::NArray objects, using class methods will be treated as DFloat.
    VTF
    $IBJOFS numpy float32 (fp32)
    3FE$IBJOFS Numo::NArray DFloat (fp64)
    EJ⒎FS
    SFloat (fp32)
    XBOUUP$IBOHFUP4'MPBU
    YGBTUFS

    View full-size slide

  5. Numo::NArray Benchmark
    • numpy's BroadCast(b),(c) is fast,
    [python 2.7 + numpy loop=10000]
    x = np.ones([1000,784], dtype=np.float32)
    y = np.ones([1000,784], dtype=np.float32)
    z = np.ones([1000,1], dtype=np.float32)
    (a) x += y 5.1500
    (b) x += z 4.1600
    (c) x += 1.0 2.2600
    (a) x -= y 5.1200
    (b) x -= z 4.0700
    (c) x -= 1.0 2.1000
    (a) x *= y 5.1400
    (b) x *= z 3.9200
    (c) x *= 1.0 2.0500
    (a) x /= y 5.6300
    (b) x /= z 6.4200
    (c) x /= 1.0 5.1400
    [sec]
    GBTU
    GBTUFS

    Z


    B
    Y


    [


    C
    Y


    D
    Y



    TDBMBS

    —
    º
    ×

    —
    º
    ×

    —
    º
    ×
    GBTU
    [ruby 2.5.1 + Numo::NArray 0.9.1.2 loop=10000]
    x = Numo::SFloat.ones([1000,784])
    y = Numo::SFloat.ones([1000,784])
    z = Numo::SFloat.ones([1000,1])
    (a) x.inplace + y 7.131974
    (b) x.inplace + z 7.193444
    (c) x.inplace + 1.0 6.813332
    (a) x.inplace - y 7.172393
    (b) x.inplace - z 7.689382
    (c) x.inplace - 1.0 7.161590
    (a) x.inplace * y 7.312234
    (b) x.inplace * z 7.630061
    (c) x.inplace * 1.0 7.427569
    (a) x.inplace / y 21.111190
    (b) x.inplace / z 20.350092
    (c) x.inplace / 1.0 20.636417
    [sec]
    TBNF
    YTMPXFS
    but Numo::NArray’s BroadCast is not fast.
    YTMPXFS
    • The division of Numo::NArray is x3-4 times slower than numpy.
    6.480560
    3.696468
    3.070180
    6.197381
    3.499792
    3.383020
    6.105243
    3.560802
    3.301580
    6.609833
    5.425440
    5.449255
    [sec]
    JNQSPWFE
    CZNZ13
    GBTU
    Y
    GBTUFS
    YGBTUFS

    View full-size slide

  6. • ext/numo/narray/gen/tmpl/binary.c
    D@JUFS OB@MPPQ@UDPOTUMQ

    \TJ[F@UJ O
    DIBSQ Q QTTJ[F@UT T T
    */*5@$06/5&3 MQ O

    */*5@153 MQ Q T
    */*5@153 MQ Q T
    */*5@153 MQ Q T

    JGOFFE@BMJHO
    JG JT@BMJHOFE Q TJ[FPG EUZQF

    JT@BMJHOFE Q TJ[FPG EUZQF

    JT@BMJHOFE Q TJ[FPG EUZQF


    \
    JG TTJ[FPG EUZQF
    TTJ[FPG EUZQF
    TTJ[FPG EUZQF

    \
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q


    ^
    SFUVSO
    ^
    JG JT@BMJHOFE@TUFQ T TJ[FPG EUZQF

    JT@BMJHOFE@TUFQ T TJ[FPG EUZQF

    JT@BMJHOFE@TUFQ T TJ[FPG EUZQF


    \
    FOE
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    QN@OBNF EUZQF
    Q EUZQF
    Q

    QT
    QT
    QT
    ^
    SFUVSO
    JGOFFE@BMJHO
    ^
    ^
    PQMPPQ
    JO 3BOHF$IFDL

    J -PPQ$PVOU

    QQQ 0QFSBUPJO

    $PNNPOGVODUJPO
    T T T
    4BNFEUZQFDBTF DBTFB C

    4JNQMF-PPQ
    T T T
    %J⒎FSFOUEUZQFDBTF
    DBTFD#SPBEDBTUXJUITDBMBSWBMVF

    PQMPPQ
    JO$IFDL
    J
    QQQ
    QT 1PJOUFSNPWF

    QT 1PJOUFSNPWF

    QT 1PJOUFSNPWF

    #SPBEDBTU
    TDBMBS
    PQFSBUJPO
    DPVOUJTY
    )PXUPPQUJNJ[F
    Y Z [ TDBMBS
    MFGUNFNCFS

    View full-size slide

  7. Optimization processing by
    type of variable.
    • ext/numo/narray/gen/tmpl/binary.c
    JG QQ
    \JOQMBDFDBTF
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q


    ^
    ^FMTF\
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q


    ^
    ^
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q


    ^
    T T T
    4BNFEUZQFDBTF
    DBTFB C

    *OQMBDFDBTFPQUJNJ[FE
    MFTTNFNPSZ
    #FGPSF "GUFS
    OPVTFQ QQ

    JOQMBDFDBTF

    View full-size slide

  8. JG T
    \#SPBEDBTUJOHGSPNTDBMBSWBMVF
    DIFDL@JOUEJW[FSP EUZQF
    Q

    JG TTJ[FPG EUZQF
    TTJ[FPG EUZQF

    \
    JG QQ
    \JOQMBDFDBTF
    GPS JJOJ
    \
    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q

    ^
    ^FMTF\
    GPS JJOJ
    \
    EUZQF
    Q
    N@OBNF EUZQF
    Q
    EUZQF
    Q

    ^
    ^
    ^FMTF\
    GPS JJOJ
    \
    EUZQF
    QN@OBNF EUZQF
    Q EUZQF
    Q

    QT
    QT
    ^
    ^
    ^FMTF\
    JG QQ
    \JOQMBDFDBTF
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    QN@OBNF EUZQF
    Q EUZQF
    Q

    QT
    QT
    ^
    ^FMTF\
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    QN@OBNF EUZQF
    Q EUZQF
    Q

    QT
    QT
    QT
    • ext/numo/narray/gen/tmpl/binary.c
    %J⒎FSFOU
    EUZQF
    DBTF
    #SPBEDBTUJOH
    DBTFPQUJNJ[FE
    GPS JJOJ
    \
    DIFDL@JOUEJW[FSP EUZQF
    Q

    EUZQF
    QN@OBNF EUZQF
    Q EUZQF
    Q

    QT
    QT
    QT
    PQ PQ
    MFTTNFNPSZ
    T T
    4BNFEUZQFDBTF
    DBTFD

    #FGPSF
    "GUFS
    T T
    %J⒎FSFOUEUZQFDBTF
    PQ
    OPDIBOHFQ T

    T
    #SPBEDBTUXJUITDBMBSWBMVF
    DBTFD
    << > < >>
    PQ
    MFTTNFNPSZ
    PQ
    OPVTFQ QQ

    PUIFSDBTF T

    JOQMBDFDBTF

    View full-size slide

  9. Future: Remaining bottleneck
    1. “x.dot(w.transpose)” is slow on Numo::NArray. (dot for Numo::NArray's view is
    slow) #95

    →x.dot(w.dup.transpose). (But I want to fixed with Numo::NArray.)

    2. Numo::NMath.sqrt is slow.

    → It is possible to cope with SIMD calculation.
    NOJTUFQPD VTF%'MPBU
    VTF4'MPBU
    NZ13

    'VUVSF
    "GUFSUIFBCPWFDPSSFDUJPO

    $IBJOFS 26.0 sec ← ←(26.0 sec)
    3FE$IBJOFS 83.68 sec 48.32 sec 30.29 sec
    TMPXFS x3.21 x1.85
    x1.16
    (16% slower)
    /VNP4'MPBU WJFX

    /VNP4'MPBU
    2 cores on CentOS7(x86_64)

    python 2.7 / numpy 2.7

    Ruby 2.5.1 / Numo::NArray 0.9.1.2 OpenBLAS by loading Linalg

    View full-size slide

  10. Reference
    • py2rb.py

    • https://github.com/naitoh/py2rb

    • http://naitoh.hatenablog.com/entry/2018/01/27/012333

    • Red Data Tools project (https://red-data-tools.github.io/)

    • Red Chainer (https://github.com/red-data-tools/red-chainer/)

    • Lots of memcpy are issued on a.dot(b.transpose)

    • https://github.com/ruby-numo/numo-narray/issues/95

    View full-size slide