PythonとApache Arrow

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
December 08, 2018
910

PythonとApache Arrow

@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

December 08, 2018
Tweet

Transcript

  1. 2.

    ࣗݾ঺հ • ງӽ ਅө • (ג)ARISE analytics • σʔλ෼ੳͱ͔ •

    A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
  2. 5.

    Pythonͷσʔλ෼ੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO

    #JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
  3. 7.

    pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳ਺ͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ಺෦ͷσʔλ͸ྻํ޲ͷϒϩοΫͰ؅ཧ

    • ֤ϒϩοΫͷ࣮ମ͸NumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
  4. 11.

    ܽଛ஋ͷ੍໿ • NumPy: • Ұ؏ͨܽ͠ଛ஋͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)

    ʹ͸ܽଛ஋૬౰ͷ஋͕ଘࡏ (NaN, NaT) • Ұൠʹ͸masked arrayͰରԠ • pandas: • NumPyͷܽଛ஋ରԠͷ੍໿Λड͚Δ • ܽଛ஋ͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม׵͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐ΂ΔͨΊʹ஋ͷ૸͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
  5. 13.

    Object ndarray • ஗͍ • ࣮ମ͕࿈ଓͨ͠෺ཧྖҬʹͳ͍৔߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)

    • pandasͰ͸ɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛ஋ରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ      OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
  6. 15.

    ࣜධՁ • ࣜ͸ஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத਎͕૝૾Ͱ͖͍ͯͳ͍ͱɺޮ཰తͳॲཧΛॻ͘ͷ͸೉͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast

    numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰ΋ܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
  7. 17.

    ߦ΁ͷؔ਺ద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •

    ஗͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞੒ • SeriesΛ࡞੒͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ࿦͕ൃੜ • Series͔Β஋ΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ࿦ • ॲཧ݁Ռ͕ඇܾఆత
  8. 18.

    • ֎෦ύοέʔδ • Python࣮૷ͩͱ஗͍ • ಠ࣮ࣗ૷ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic

    = (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
  9. 20.

    σʔλIO (ύοέʔδؒ) • Pythonͷσʔλ෼ੳΤίγεςϜ͸NumPyΛத৺ʹൃల • Scikit-learn͸NumPy ndarrayΛೖྗͱͯ͠૝ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ಺෦Ͱ

    NumPy ndarrayʹม׵ • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม׵ ܭࢉॲཧ
  10. 21.

    ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্Ͱ͸ෳ਺ͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ

    • ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ͸ PyObject Λѻ͑ͳ͍ • ฒྻ෼ࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζ͸pickleͳͲ਺छྨΛ࢖͍෼͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
  11. 23.

    Arrowʹظ଴͢Δػೳ • Ұ؏ͨܽ͠ଛ஋ͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •

    ॊೈͳσʔλ௥Ճ (Chunk?) • ߦͷந৅දݱ (RowSet?) • ߴ଎ͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
  12. 26.

    ExtensionArray • ೚ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍΍͍͢಺෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛ੔උத Extension Arrays for

    pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
  13. 28.

    GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷ෼ੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛ௃ʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ෼͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͢΂͖ʁ • ϝλσʔλ΍indexerͷόεసૹ͕՝୊ʁ

    … $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids