PythonとApache Arrow

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
December 08, 2018
970

PythonとApache Arrow

@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

December 08, 2018
Tweet

Transcript

  1. PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics

  2. ࣗݾ঺հ • ງӽ ਅө • (ג)ARISE analytics • σʔλ෼ੳͱ͔ •

    A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
  3. ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

  4. ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

  5. Pythonͷσʔλ෼ੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO

    #JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
  6. NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷ഑ྻ • ෺ཧతͳදݱΛͲ͏ѻ͑͹ྑ͍͔ΛϝλσʔλͰ؅ཧ    

      00000001000000100000001100000100 base view ෺ཧදݱ ࿦ཧදݱ 4MJDF
  7. pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳ਺ͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ಺෦ͷσʔλ͸ྻํ޲ͷϒϩοΫͰ؅ཧ

    • ֤ϒϩοΫͷ࣮ମ͸NumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
  8. Dask DataFrame • pandasͷॲཧΛฒྻɾ෼ࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU

    Dask DataFrame pandas DataFrame 4VN 4VN
  9. ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

  10. σʔλॲཧʹ͓͚Δ՝୊ • NumPy͸ඇৗʹ׬੒౓͕ߴ͍਺஋ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠΋͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠΋࠷దͰ͸ͳ͍Ϣʔεέʔε΋ݟ͖͑ͯ ͍ͯΔ

  11. ܽଛ஋ͷ੍໿ • NumPy: • Ұ؏ͨܽ͠ଛ஋͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)

    ʹ͸ܽଛ஋૬౰ͷ஋͕ଘࡏ (NaN, NaT) • Ұൠʹ͸masked arrayͰରԠ • pandas: • NumPyͷܽଛ஋ରԠͷ੍໿Λड͚Δ • ܽଛ஋ͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม׵͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐ΂ΔͨΊʹ஋ͷ૸͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
  12. CopyͱView • NumPy, pandasͷڍಈΛ(׬શʹ)༧ଌ͢Δ͜ͱ͕೉͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖׵͑ʹର͢Δ๷ޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͸͋Δ΋ͷͷɺOS/CPU΍ܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ৔߹΋ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

  13. Object ndarray • ஗͍ • ࣮ମ͕࿈ଓͨ͠෺ཧྖҬʹͳ͍৔߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)

    • pandasͰ͸ɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛ஋ରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ      OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
  14. ࣜධՁ • ࣜ͸ஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯ͸ɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997

  15. ࣜධՁ • ࣜ͸ஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத਎͕૝૾Ͱ͖͍ͯͳ͍ͱɺޮ཰తͳॲཧΛॻ͘ͷ͸೉͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast

    numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰ΋ܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
  16. ߦํ޲ͷ௥Ճ • ஗͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱ஋ͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛ௥ՃΛ͢Δ ͱ… …

    $PMVNOT *OEFY *OEF $PMVNOT +
  17. ߦ΁ͷؔ਺ద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •

    ஗͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞੒ • SeriesΛ࡞੒͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ࿦͕ൃੜ • Series͔Β஋ΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ࿦ • ॲཧ݁Ռ͕ඇܾఆత
  18. • ֎෦ύοέʔδ • Python࣮૷ͩͱ஗͍ • ಠ࣮ࣗ૷ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic

    = (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
  19. • ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html

  20. σʔλIO (ύοέʔδؒ) • Pythonͷσʔλ෼ੳΤίγεςϜ͸NumPyΛத৺ʹൃల • Scikit-learn͸NumPy ndarrayΛೖྗͱͯ͠૝ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ಺෦Ͱ

    NumPy ndarrayʹม׵ • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม׵ ܭࢉॲཧ
  21. ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্Ͱ͸ෳ਺ͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ

    • ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ͸ PyObject Λѻ͑ͳ͍ • ฒྻ෼ࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζ͸pickleͳͲ਺छྨΛ࢖͍෼͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
  22. ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

  23. Arrowʹظ଴͢Δػೳ • Ұ؏ͨܽ͠ଛ஋ͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •

    ॊೈͳσʔλ௥Ճ (Chunk?) • ߦͷந৅දݱ (RowSet?) • ߴ଎ͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
  24. Arrow͕΋ͨΒ͢ະདྷ • σʔλ෼ੳύοέʔδ͸Arrow͕نఆ͢ΔදݱΛॲཧ͢Δ΋ͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ෼͔Ε͍ͯ͘ʁɹ

  25. PythonͰؤுΔ΂͖͜ͱ • ͱ͸͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱ͸Ͱ͖ͳ͍ͷͰ͸ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ੔߹ੑ • จࣈྻॲཧɺ೔෇ॲཧͱ͔ •

    Object Array • ΍Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ
  26. ExtensionArray • ೚ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍΍͍͢಺෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛ੔උத Extension Arrays for

    pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
  27. Arrow Integration • pandasͰ͸ɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹ͸ɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮૷͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py

  28. GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷ෼ੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛ௃ʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ෼͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͢΂͖ʁ • ϝλσʔλ΍indexerͷόεసૹ͕՝୊ʁ

    … $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids
  29. ·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝୊͸େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ͸໘ന͍ • ίΞ෦෼ʹՃ͑ɺจࣈྻɺ೔࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ͸؆୯ͳ΋ͷ͔Β •

    υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛ࢖Θͳ͍ͱ͜Ζ