Slide 1

Slide 1 text

PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

ࣗݾ঺հ • ງӽ ਅө • (ג)ARISE analytics • σʔλ෼ੳͱ͔ • A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

Slide 4

Slide 4 text

ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

Slide 5

Slide 5 text

Pythonͷσʔλ෼ੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO #JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH

Slide 6

Slide 6 text

NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷ഑ྻ • ෺ཧతͳදݱΛͲ͏ѻ͑͹ྑ͍͔ΛϝλσʔλͰ؅ཧ 00000001000000100000001100000100 base view ෺ཧදݱ ࿦ཧදݱ 4MJDF

Slide 7

Slide 7 text

pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳ਺ͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ಺෦ͷσʔλ͸ྻํ޲ͷϒϩοΫͰ؅ཧ • ֤ϒϩοΫͷ࣮ମ͸NumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT

Slide 8

Slide 8 text

Dask DataFrame • pandasͷॲཧΛฒྻɾ෼ࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU Dask DataFrame pandas DataFrame 4VN 4VN

Slide 9

Slide 9 text

ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

Slide 10

Slide 10 text

σʔλॲཧʹ͓͚Δ՝୊ • NumPy͸ඇৗʹ׬੒౓͕ߴ͍਺஋ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠΋͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠΋࠷దͰ͸ͳ͍Ϣʔεέʔε΋ݟ͖͑ͯ ͍ͯΔ

Slide 11

Slide 11 text

ܽଛ஋ͷ੍໿ • NumPy: • Ұ؏ͨܽ͠ଛ஋͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta) ʹ͸ܽଛ஋૬౰ͷ஋͕ଘࡏ (NaN, NaT) • Ұൠʹ͸masked arrayͰରԠ • pandas: • NumPyͷܽଛ஋ରԠͷ੍໿Λड͚Δ • ܽଛ஋ͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม׵͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐ΂ΔͨΊʹ஋ͷ૸͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM

Slide 12

Slide 12 text

CopyͱView • NumPy, pandasͷڍಈΛ(׬શʹ)༧ଌ͢Δ͜ͱ͕೉͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖׵͑ʹର͢Δ๷ޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͸͋Δ΋ͷͷɺOS/CPU΍ܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ৔߹΋ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html

Slide 13

Slide 13 text

Object ndarray • ஗͍ • ࣮ମ͕࿈ଓͨ͠෺ཧྖҬʹͳ͍৔߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…) • pandasͰ͸ɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛ஋ରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ

Slide 14

Slide 14 text

ࣜධՁ • ࣜ͸ஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯ͸ɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997

Slide 15

Slide 15 text

ࣜධՁ • ࣜ͸ஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத਎͕૝૾Ͱ͖͍ͯͳ͍ͱɺޮ཰తͳॲཧΛॻ͘ͷ͸೉͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰ΋ܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html

Slide 16

Slide 16 text

ߦํ޲ͷ௥Ճ • ஗͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱ஋ͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛ௥ՃΛ͢Δ ͱ… … $PMVNOT *OEFY *OEF $PMVNOT +

Slide 17

Slide 17 text

ߦ΁ͷؔ਺ద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) • ஗͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞੒ • SeriesΛ࡞੒͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ࿦͕ൃੜ • Series͔Β஋ΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ࿦ • ॲཧ݁Ռ͕ඇܾఆత

Slide 18

Slide 18 text

• ֎෦ύοέʔδ • Python࣮૷ͩͱ஗͍ • ಠ࣮ࣗ૷ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic = (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas

Slide 19

Slide 19 text

• ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html

Slide 20

Slide 20 text

σʔλIO (ύοέʔδؒ) • Pythonͷσʔλ෼ੳΤίγεςϜ͸NumPyΛத৺ʹൃల • Scikit-learn͸NumPy ndarrayΛೖྗͱͯ͠૝ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ಺෦Ͱ NumPy ndarrayʹม׵ • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม׵ ܭࢉॲཧ

Slide 21

Slide 21 text

ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্Ͱ͸ෳ਺ͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ • ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ͸ PyObject Λѻ͑ͳ͍ • ฒྻ෼ࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζ͸pickleͳͲ਺छྨΛ࢖͍෼͚ Understanding the Python GIL http://www.dabeaz.com/GIL/

Slide 22

Slide 22 text

ຊ೔͓࿩͢͠Δ͜ͱ 1ZUIPOͷσʔλ෼ੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝୊ "SSPXʹظ଴͢Δ͜ͱ

Slide 23

Slide 23 text

Arrowʹظ଴͢Δػೳ • Ұ؏ͨܽ͠ଛ஋ͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) • ॊೈͳσʔλ௥Ճ (Chunk?) • ߦͷந৅දݱ (RowSet?) • ߴ଎ͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)

Slide 24

Slide 24 text

Arrow͕΋ͨΒ͢ະདྷ • σʔλ෼ੳύοέʔδ͸Arrow͕نఆ͢ΔදݱΛॲཧ͢Δ΋ͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ෼͔Ε͍ͯ͘ʁɹ

Slide 25

Slide 25 text

PythonͰؤுΔ΂͖͜ͱ • ͱ͸͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱ͸Ͱ͖ͳ͍ͷͰ͸ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ੔߹ੑ • จࣈྻॲཧɺ೔෇ॲཧͱ͔ • Object Array • ΍Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ

Slide 26

Slide 26 text

ExtensionArray • ೚ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍΍͍͢಺෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛ੔උத Extension Arrays for pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]

Slide 27

Slide 27 text

Arrow Integration • pandasͰ͸ɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹ͸ɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮૷͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py

Slide 28

Slide 28 text

GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷ෼ੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛ௃ʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ෼͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͢΂͖ʁ • ϝλσʔλ΍indexerͷόεసૹ͕՝୊ʁ … $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids

Slide 29

Slide 29 text

·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝୊͸େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ͸໘ന͍ • ίΞ෦෼ʹՃ͑ɺจࣈྻɺ೔࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ͸؆୯ͳ΋ͷ͔Β • υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛ࢖Θͳ͍ͱ͜Ζ