Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PythonとApache Arrow

Sinhrks
December 08, 2018
1.4k

PythonとApache Arrow

@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/

Sinhrks

December 08, 2018
Tweet

Transcript

  1. PythonͱApache Arrow
    Masaaki Horikoshi @ ARISE analytics

    View Slide

  2. ࣗݾ঺հ
    • ງӽ ਅө
    • (ג)ARISE analytics
    • σʔλ෼ੳͱ͔
    • A member of core developers (ϦϋϏϦத):
    • GitHub: https://github.com/sinhrks

    View Slide

  3. ຊ೔͓࿩͢͠Δ͜ͱ
    1ZUIPOͷσʔλ෼ੳΤίγεςϜ
    σʔλॲཧʹ͓͚Δ՝୊
    "SSPXʹظ଴͢Δ͜ͱ

    View Slide

  4. ຊ೔͓࿩͢͠Δ͜ͱ
    1ZUIPOͷσʔλ෼ੳΤίγεςϜ
    σʔλॲཧʹ͓͚Δ՝୊
    "SSPXʹظ଴͢Δ͜ͱ

    View Slide

  5. Pythonͷσʔλ෼ੳΤίγεςϜͷݱঢ়
    #PLFI NBUQMPUMJC
    5FOTPS'MPX
    1Z5BCMFT 42-"MDIFNZ
    *CJT
    1Z4QBSL
    QBOEBT
    7JTVBMJ[BUJPO
    #JH%BUB
    *0
    .BDIJOF-FBSOJOH
    SQZ
    0UIFS1SPHSBNNJOH
    -BOHVBHFT
    4DJLJUMFBSO /VN1Z
    %BTL
    %BUB)BOEMJOH

    View Slide

  6. NumPy ndarray
    • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷ഑ྻ
    • ෺ཧతͳදݱΛͲ͏ѻ͑͹ྑ͍͔ΛϝλσʔλͰ؅ཧ




    00000001000000100000001100000100
    base view
    ෺ཧදݱ
    ࿦ཧදݱ
    4MJDF

    View Slide

  7. pandas DataFrame
    • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-)
    • ྻ͝ͱʹෳ਺ͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ
    • ಺෦ͷσʔλ͸ྻํ޲ͷϒϩοΫͰ؅ཧ
    • ֤ϒϩοΫͷ࣮ମ͸NumPy ndarray

    $PMVNO
    *OEFY
    .JYFEEBUBUZQFT
    $PMVNOT
    *OEFY

    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL
    $PMVNOTNBZCF
    DPOTPMJEBUFEQFSUZQFT

    View Slide

  8. Dask DataFrame
    • pandasͷॲཧΛฒྻɾ෼ࢄ࣮ߦ
    • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ
    Blocked Algorithm
    4VN
    $PODBU
    Dask DataFrame
    pandas DataFrame
    4VN
    4VN

    View Slide

  9. ຊ೔͓࿩͢͠Δ͜ͱ
    1ZUIPOͷσʔλ෼ੳΤίγεςϜ
    σʔλॲཧʹ͓͚Δ՝୊
    "SSPXʹظ଴͢Δ͜ͱ

    View Slide

  10. σʔλॲཧʹ͓͚Δ՝୊
    • NumPy͸ඇৗʹ׬੒౓͕ߴ͍਺஋ܭࢉύοέʔδͰ͋Δ
    • σʔλετϨʔδͱͯ͠΋͏·͘ಈ͘
    • ͔͠͠ͳ͕Βɺඞͣ͠΋࠷దͰ͸ͳ͍Ϣʔεέʔε΋ݟ͖͑ͯ
    ͍ͯΔ

    View Slide

  11. ܽଛ஋ͷ੍໿
    • NumPy:
    • Ұ؏ͨܽ͠ଛ஋͕ͳ͍
    • Ұ෦ͷσʔλܕ (float, datetime, timedelta) ʹ͸ܽଛ஋૬౰ͷ஋͕ଘࡏ (NaN, NaT)
    • Ұൠʹ͸masked arrayͰରԠ
    • pandas:
    • NumPyͷܽଛ஋ରԠͷ੍໿Λड͚Δ
    • ܽଛ஋ͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม׵͕ൃੜ (ӈද)
    • ܽଛͷ༗ແΛௐ΂ΔͨΊʹ஋ͷ૸͕ࠪඞཁ
    NEP 12 — Missing Data Functionality in NumPy
    https://www.numpy.org/neps/nep-0012-missing-data.html
    0SJHJOBM /"JOTFSTJPOSFTVMU
    JOU qPBU
    qPBU qPBU
    CPPM qPBU
    EBUFUJNF EBUFUJNF
    UJNFEFMUB UJNFEFMUB
    PCKFDU PCKFDU
    DBUFHPSJDBM DBUFHPSJDBM

    View Slide

  12. CopyͱView
    • NumPy, pandasͷڍಈΛ(׬શʹ)༧ଌ͢Δ͜ͱ͕೉͍͠
    • ҙਤ͠ͳ͍σʔλͷॻ͖׵͑ʹର͢Δ๷ޚతͳίϐʔ͕ඞཁ
    جຊతͳϧʔϧ͸͋Δ΋ͷͷɺOS/CPU΍ܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ৔߹΋
    https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

    View Slide

  13. Object ndarray
    • ஗͍
    • ࣮ମ͕࿈ଓͨ͠෺ཧྖҬʹͳ͍৔߹͕͋Δ
    • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)
    • pandasͰ͸ɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛ஋ରԠͷͨΊ)
    OEBSSBZ1SJNJUJWF
    [email protected])&"%
    EBUB
    OE
    ʜ





    OEBSSBZ0CKFDU
    [email protected])&"%
    EBUB
    OE
    ʜ
    1Z0CKFDU
    1Z0CKFDU
    1Z0CKFDU
    1Z0CKFDU
    1Z0CKFDU
    1Z0CKFDU
    [email protected])&"%
    ʜ

    Why Python is Slow: Looking Under the Hood
    https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l
    1Z0CKFDU
    [email protected])&"%
    ʜ

    View Slide

  14. ࣜධՁ
    • ࣜ͸ஞ࣍ධՁ͞ΕΔ
    • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ
    ॲཧͱϓϥοτϑΥʔϜʹΑͬͯ͸ɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ
    https://github.com/numpy/numpy/pull/7997

    View Slide

  15. ࣜධՁ
    • ࣜ͸ஞ࣍ධՁ͞ΕΔ
    • σʔλͱॲཧͷத਎͕૝૾Ͱ͖͍ͯͳ͍ͱɺޮ཰తͳॲཧΛॻ͘ͷ͸೉͍͠
    • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ
    • NumExpr: Fast numerical array expression evaluator
    • Numba: NumPy aware dynamic Python compiler using LLVM
    df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’]
    DaskͰ΋ܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ
    http://docs.dask.org/en/latest/optimize.html

    View Slide

  16. ߦํ޲ͷ௥Ճ
    • ஗͍
    • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱ஋ͷ
    ίϐʔ͕ൃੜ
    • ϧʔϓͰ1ߦͣͭߦΛ௥ՃΛ͢Δ
    ͱ… …
    $PMVNOT
    *OEFY
    *OEF
    $PMVNOT
    +

    View Slide

  17. ߦ΁ͷؔ਺ద༻
    $PMVNOT
    *OEFY
    df.apply(lambda row: row[‘a’] + row[‘b’],
    axis=1)
    • ஗͍

    • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ
    ࡞੒

    • SeriesΛ࡞੒͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ
    ࿦͕ൃੜ

    • Series͔Β஋ΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ
    ݕࡧ͢Δॲཧ͕ൃੜ

    • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ࿦

    • ॲཧ݁Ռ͕ඇܾఆత

    View Slide

  18. • ֎෦ύοέʔδ

    • Python࣮૷ͩͱ஗͍

    • ಠ࣮ࣗ૷

    • ϝϯςφϯε͕ͭΒ͍
    σʔλIO
    magic = (b"\x00\x00\x00\x00\x00\x00\x00\x00" +
    b"\x00\x00\x00\x00\xc2\xea\x81\x60" +
    b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" +
    b"\x09\xc7\x31\x8c\x18\x1f\x10\x11")
    class SASIndex(object):
    row_size_index = 0
    column_size_index = 1
    subheader_counts_index = 2
    column_text_index = 3
    column_name_index = 4

    subheader_signature_to_index = {
    b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index,
    b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index,

    https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas

    View Slide

  19. • ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ
    σʔλIO
    https://pandas.pydata.org/pandas-docs/stable/io.html

    View Slide

  20. σʔλIO (ύοέʔδؒ)
    • Pythonͷσʔλ෼ੳΤίγεςϜ͸NumPyΛத৺ʹൃల
    • Scikit-learn͸NumPy ndarrayΛೖྗͱͯ͠૝ఆ
    • pandas DataFrame͕ೖྗ͞ΕΔͱɺ಺෦Ͱ NumPy ndarrayʹม׵
    • NumPy͕ APIΛنఆ (Array interface)
    NumPy: The Array Interface
    https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html
    QBOEBT
    %BUB'SBNF
    /VN1Z
    OEBSSBZ
    "SSBZ*OUFSGBDFʹΑΓɺ
    /VN1ZOEBSSBZʹม׵
    ܭࢉॲཧ

    View Slide

  21. ฒྻॲཧ
    • Global Interpreter Lock (GIL)
    • CPythonΠϯλʔϓϦλ্Ͱ͸ෳ਺ͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍
    • CythonͰ໌ࣔతʹղ์Ͱ͖Δ
    • ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ
    • ղ์ޙ͸ PyObject Λѻ͑ͳ͍
    • ฒྻ෼ࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ
    • Dask: A flexible library for parallel computing in Python
    • γϦΞϥΠζ͸pickleͳͲ਺छྨΛ࢖͍෼͚
    Understanding the Python GIL
    http://www.dabeaz.com/GIL/

    View Slide

  22. ຊ೔͓࿩͢͠Δ͜ͱ
    1ZUIPOͷσʔλ෼ੳΤίγεςϜ
    σʔλॲཧʹ͓͚Δ՝୊
    "SSPXʹظ଴͢Δ͜ͱ

    View Slide

  23. Arrowʹظ଴͢Δػೳ
    • Ұ؏ͨܽ͠ଛ஋ͷαϙʔτ
    • ಈ࡞ͷ༧ଌՄೳੑ
    • Query optimization (Gandiva?)
    • ॊೈͳσʔλ௥Ճ (Chunk?)
    • ߦͷந৅දݱ (RowSet?)
    • ߴ଎ͳIO
    • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)

    View Slide

  24. Arrow͕΋ͨΒ͢ະདྷ
    • σʔλ෼ੳύοέʔδ͸Arrow͕نఆ͢ΔදݱΛॲཧ͢Δ΋ͷ
    ʹͳΔʁ
    • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ
    δʹ෼͔Ε͍ͯ͘ʁɹ

    View Slide

  25. PythonͰؤுΔ΂͖͜ͱ
    • ͱ͸͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱ͸Ͱ͖ͳ͍ͷͰ͸
    • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ੔߹ੑ
    • จࣈྻॲཧɺ೔෇ॲཧͱ͔
    • Object Array
    • ΍Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ
    ཁʁ

    View Slide

  26. ExtensionArray
    • ೚ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍΍͍͢಺෦දݱʹϚοϐϯ
    ά͢ΔͨΊͷΠϯλʔϑΣʔε
    • pandas ExtensionArrayͱͯ͠APIΛ੔උத
    Extension Arrays for pandas
    https://tomaugspurger.github.io/pandas-extension-arrays.html
    IP Address ExtensionArray IP Address
    4FU (FU
    '192.168.1.1'
    '2001:0db8:85a3…'
    '192.168.1.1'
    '2001:0db8:85a3…'
    hi: [0, …].
    lo: [3232235777, …]

    View Slide

  27. Arrow Integration
    • pandasͰ͸ɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ
    • ςετίʔυதʹ͸ɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮૷͋
    Γ
    https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py

    View Slide

  28. GPU DataFrame
    • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷ෼ੳχʔζ͕૿େ
    • σʔλ(ྻ)ͷಛ௃ʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ෼͚Δඞཁ
    • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͢΂͖ʁ
    • ϝλσʔλ΍indexerͷόεసૹ͕՝୊ʁ

    $PMVNOT
    *OEFY
    *OU
    $16

    0CKFDU
    $16

    *OU
    (16

    'MPBU
    (16

    RAPIDS: GPU-Accelerated Data Analytics & Machine Learning
    https://developer.nvidia.com/rapids

    View Slide

  29. ·ͱΊ
    • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝୊͸େ͖͘վળ͞Εͦ͏
    • σʔλॲཧύοέʔδͷ։ൃ͸໘ന͍
    • ίΞ෦෼ʹՃ͑ɺจࣈྻɺ೔࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕
    • ࠷ॳ͸؆୯ͳ΋ͷ͔Β
    • υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼
    • ՄࢹԽͳͲɺinternalͷAPIΛ࢖Θͳ͍ͱ͜Ζ

    View Slide