Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PythonとApache Arrow
Search
Sinhrks
December 08, 2018
6
1.8k
PythonとApache Arrow
@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/
Sinhrks
December 08, 2018
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
370
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.1k
機械学習と解釈可能性
sinhrks
7
5.6k
LIME
sinhrks
2
1.3k
データ分析言語R 1年の振り返り
sinhrks
5
2.4k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Dask Distributedによる分散機械学習
sinhrks
4
1.4k
Data processing using pandas and Dask
sinhrks
1
230
pandasでのOSS活動事例
sinhrks
0
730
Featured
See All Featured
How To Stay Up To Date on Web Technology
chriscoyier
788
250k
XXLCSS - How to scale CSS and keep your sanity
sugarenia
246
1.3M
Building an army of robots
kneath
302
43k
Rebuilding a faster, lazier Slack
samanthasiow
79
8.7k
Become a Pro
speakerdeck
PRO
25
5k
RailsConf & Balkan Ruby 2019: The Past, Present, and Future of Rails at GitHub
eileencodes
131
33k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.4k
Building Your Own Lightsaber
phodgson
103
6.1k
Docker and Python
trallard
40
3.1k
Git: the NoSQL Database
bkeepers
PRO
427
64k
The Psychology of Web Performance [Beyond Tellerrand 2023]
tammyeverts
44
2.2k
Bash Introduction
62gerente
608
210k
Transcript
PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • ງӽ ਅө • (ג)ARISE analytics • σʔλੳͱ͔ •
A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
PythonͷσʔλੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO
#JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷྻ • ཧతͳදݱΛͲ͏ѻ͑ྑ͍͔ΛϝλσʔλͰཧ
00000001000000100000001100000100 base view ཧදݱ ཧදݱ 4MJDF
pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ෦ͷσʔλྻํͷϒϩοΫͰཧ
• ֤ϒϩοΫͷ࣮ମNumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
Dask DataFrame • pandasͷॲཧΛฒྻɾࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU
Dask DataFrame pandas DataFrame 4VN 4VN
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
σʔλॲཧʹ͓͚Δ՝ • NumPyඇৗʹ͕ߴ͍ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠࠷దͰͳ͍Ϣʔεέʔεݟ͖͑ͯ ͍ͯΔ
ܽଛͷ੍ • NumPy: • Ұ؏ͨܽ͠ଛ͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)
ʹܽଛ૬ͷ͕ଘࡏ (NaN, NaT) • Ұൠʹmasked arrayͰରԠ • pandas: • NumPyͷܽଛରԠͷ੍Λड͚Δ • ܽଛͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐΔͨΊʹͷ͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
CopyͱView • NumPy, pandasͷڍಈΛ(શʹ)༧ଌ͢Δ͜ͱ͕͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖͑ʹର͢Δޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͋ΔͷͷɺOS/CPUܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ߹ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
Object ndarray • ͍ • ࣮ମ͕࿈ଓͨ͠ཧྖҬʹͳ͍߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)
• pandasͰɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத͕૾Ͱ͖͍ͯͳ͍ͱɺޮతͳॲཧΛॻ͘ͷ͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast
numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
ߦํͷՃ • ͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛՃΛ͢Δ ͱ… …
$PMVNOT *OEFY *OEF $PMVNOT +
ߦͷؔద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •
͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞ • SeriesΛ࡞͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ͕ൃੜ • Series͔ΒΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ • ॲཧ݁Ռ͕ඇܾఆత
• ֎෦ύοέʔδ • Python࣮ͩͱ͍ • ಠ࣮ࣗ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic
= (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
• ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html
σʔλIO (ύοέʔδؒ) • PythonͷσʔλੳΤίγεςϜNumPyΛத৺ʹൃల • Scikit-learnNumPy ndarrayΛೖྗͱͯ͠ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ෦Ͱ
NumPy ndarrayʹม • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม ܭࢉॲཧ
ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্ͰෳͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ
• ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ PyObject Λѻ͑ͳ͍ • ฒྻࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζpickleͳͲछྨΛ͍͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
Arrowʹظ͢Δػೳ • Ұ؏ͨܽ͠ଛͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •
ॊೈͳσʔλՃ (Chunk?) • ߦͷநදݱ (RowSet?) • ߴͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
Arrow͕ͨΒ͢ະདྷ • σʔλੳύοέʔδArrow͕نఆ͢ΔදݱΛॲཧ͢Δͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ͔Ε͍ͯ͘ʁɹ
PythonͰؤுΔ͖͜ͱ • ͱ͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱͰ͖ͳ͍ͷͰ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ߹ੑ • จࣈྻॲཧɺॲཧͱ͔ •
Object Array • Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ
ExtensionArray • ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍͍͢෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛඋத Extension Arrays for
pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
Arrow Integration • pandasͰɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py
GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͖͢ʁ • ϝλσʔλindexerͷόεసૹ͕՝ʁ
… $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids
·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ໘ന͍ • ίΞ෦ʹՃ͑ɺจࣈྻɺ࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ؆୯ͳͷ͔Β •
υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛΘͳ͍ͱ͜Ζ