Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
PythonとApache Arrow
Search
Sinhrks
December 08, 2018
6
1.8k
PythonとApache Arrow
@Apache Arrow東京ミートアップ2018
https://speee.connpass.com/event/103514/
Sinhrks
December 08, 2018
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
370
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.1k
機械学習と解釈可能性
sinhrks
7
5.6k
LIME
sinhrks
2
1.3k
データ分析言語R 1年の振り返り
sinhrks
5
2.4k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Dask Distributedによる分散機械学習
sinhrks
4
1.4k
Data processing using pandas and Dask
sinhrks
1
230
pandasでのOSS活動事例
sinhrks
0
740
Featured
See All Featured
Agile that works and the tools we love
rasmusluckow
328
21k
Learning to Love Humans: Emotional Interface Design
aarron
273
40k
The Illustrated Children's Guide to Kubernetes
chrisshort
48
48k
"I'm Feeling Lucky" - Building Great Search Experiences for Today's Users (#IAC19)
danielanewman
226
22k
The World Runs on Bad Software
bkeepers
PRO
65
11k
Art, The Web, and Tiny UX
lynnandtonic
298
20k
Building an army of robots
kneath
302
44k
Understanding Cognitive Biases in Performance Measurement
bluesmoon
26
1.5k
Into the Great Unknown - MozCon
thekraken
33
1.5k
Mobile First: as difficult as doing things right
swwweet
222
9k
The Myth of the Modular Monolith - Day 2 Keynote - Rails World 2024
eileencodes
17
2.2k
Imperfection Machines: The Place of Print at Facebook
scottboms
266
13k
Transcript
PythonͱApache Arrow Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • ງӽ ਅө • (ג)ARISE analytics • σʔλੳͱ͔ •
A member of core developers (ϦϋϏϦத): • GitHub: https://github.com/sinhrks
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
PythonͷσʔλੳΤίγεςϜͷݱঢ় #PLFI NBUQMPUMJC 5FOTPS'MPX 1Z5BCMFT 42-"MDIFNZ *CJT 1Z4QBSL QBOEBT 7JTVBMJ[BUJPO
#JH%BUB *0 .BDIJOF-FBSOJOH SQZ 0UIFS1SPHSBNNJOH -BOHVBHFT 4DJLJUMFBSO /VN1Z %BTL %BUB)BOEMJOH
NumPy ndarray • (ݪଇͱͯ͠)୯ҰͷܕΛ࣋ͭ࿈ଓͨ͠ϝϞϦ্ͷྻ • ཧతͳදݱΛͲ͏ѻ͑ྑ͍͔ΛϝλσʔλͰཧ
00000001000000100000001100000100 base view ཧදݱ ཧදݱ 4MJDF
pandas DataFrame • Arrow։ൃऀͷWes McKinney͕։ൃ (2010͝Ζ-) • ྻ͝ͱʹෳͷܕΛ࣋ͭςʔϒϧܗࣜͷσʔλ • ෦ͷσʔλྻํͷϒϩοΫͰཧ
• ֤ϒϩοΫͷ࣮ମNumPy ndarray … $PMVNO *OEFY .JYFEEBUBUZQFT $PMVNOT *OEFY … *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
Dask DataFrame • pandasͷॲཧΛฒྻɾࢄ࣮ߦ • ܭࢉάϥϑΛಈతʹεέδϡʔϦϯά࣮ͯ͠ߦ Blocked Algorithm 4VN $PODBU
Dask DataFrame pandas DataFrame 4VN 4VN
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
σʔλॲཧʹ͓͚Δ՝ • NumPyඇৗʹ͕ߴ͍ܭࢉύοέʔδͰ͋Δ • σʔλετϨʔδͱͯ͠͏·͘ಈ͘ • ͔͠͠ͳ͕Βɺඞͣ͠࠷దͰͳ͍Ϣʔεέʔεݟ͖͑ͯ ͍ͯΔ
ܽଛͷ੍ • NumPy: • Ұ؏ͨܽ͠ଛ͕ͳ͍ • Ұ෦ͷσʔλܕ (float, datetime, timedelta)
ʹܽଛ૬ͷ͕ଘࡏ (NaN, NaT) • Ұൠʹmasked arrayͰରԠ • pandas: • NumPyͷܽଛରԠͷ੍Λड͚Δ • ܽଛͷૠೖʹΑΓɺҙਤ͠ͳ͍ܕม͕ൃੜ (ӈද) • ܽଛͷ༗ແΛௐΔͨΊʹͷ͕ࠪඞཁ NEP 12 — Missing Data Functionality in NumPy https://www.numpy.org/neps/nep-0012-missing-data.html 0SJHJOBM /"JOTFSTJPOSFTVMU JOU qPBU qPBU qPBU CPPM qPBU EBUFUJNF EBUFUJNF UJNFEFMUB UJNFEFMUB PCKFDU PCKFDU DBUFHPSJDBM DBUFHPSJDBM
CopyͱView • NumPy, pandasͷڍಈΛ(શʹ)༧ଌ͢Δ͜ͱ͕͍͠ • ҙਤ͠ͳ͍σʔλͷॻ͖͑ʹର͢Δޚతͳίϐʔ͕ඞཁ جຊతͳϧʔϧ͋ΔͷͷɺOS/CPUܕʹΑΓҙਤ͠ͳ͍݁ՌʹͳΔ߹ https://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html
Object ndarray • ͍ • ࣮ମ͕࿈ଓͨ͠ཧྖҬʹͳ͍߹͕͋Δ • CPythonΠϯλʔϓϦλ্Ͱܭࢉॲཧ͕ߦΘΕΔ (GIL, ϝιου໊ղܾ…)
• pandasͰɺจࣈྻΛObject ܕͱͯ͠ѻ͏ (ܽଛରԠͷͨΊ) OEBSSBZ1SJNJUJWF 1Z0CKFDU@)&"% EBUB OE ʜ OEBSSBZ0CKFDU 1Z0CKFDU@)&"% EBUB OE ʜ 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU 1Z0CKFDU@)&"% ʜ … Why Python is Slow: Looking Under the Hood https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/l 1Z0CKFDU 1Z0CKFDU@)&"% ʜ
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • தؒ݁ՌͷͨΊͷྖҬ֬อͳͲΦʔόʔϔου͕ൃੜ ॲཧͱϓϥοτϑΥʔϜʹΑͬͯɺҰ࣌ྖҬͷ֬อ͕ෆཁͱͳΔΑ͏࠷దԽ͞ΕΔ https://github.com/numpy/numpy/pull/7997
ࣜධՁ • ࣜஞ࣍ධՁ͞ΕΔ • σʔλͱॲཧͷத͕૾Ͱ͖͍ͯͳ͍ͱɺޮతͳॲཧΛॻ͘ͷ͍͠ • ΫΤϦ࠷దԽͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • NumExpr: Fast
numerical array expression evaluator • Numba: NumPy aware dynamic Python compiler using LLVM df[df[‘Year’] >= 2018].groupby(‘Store’).sum()[‘price’] DaskͰܭࢉάϥϑϨϕϧͰಛఆॲཧͷ࠷దԽΛ࣮ࢪ http://docs.dask.org/en/latest/optimize.html
ߦํͷՃ • ͍ • ྻ͝ͱʹɺϝϞϦͷ࠶֬อͱͷ ίϐʔ͕ൃੜ • ϧʔϓͰ1ߦͣͭߦΛՃΛ͢Δ ͱ… …
$PMVNOT *OEFY *OEF $PMVNOT +
ߦͷؔద༻ $PMVNOT *OEFY df.apply(lambda row: row[‘a’] + row[‘b’], axis=1) •
͍ • 1ߦͣͭεϥΠεͯ͠Series(ҰྻͷΈͷσʔλ)Λ ࡞ • SeriesΛ࡞͢ΔͨͼɺϝϞϦͷ֬อͱܕͷਪ ͕ൃੜ • Series͔ΒΛऔΓग़͢ࡍʹϥϕϧ͔ΒҐஔΛ ݕࡧ͢Δॲཧ͕ൃੜ • ֤ߦͷॲཧ݁Ռ͔Βɺ݁ՌશମͷܕΛਪ • ॲཧ݁Ռ͕ඇܾఆత
• ֎෦ύοέʔδ • Python࣮ͩͱ͍ • ಠ࣮ࣗ • ϝϯςφϯε͕ͭΒ͍ σʔλIO magic
= (b"\x00\x00\x00\x00\x00\x00\x00\x00" + b"\x00\x00\x00\x00\xc2\xea\x81\x60" + b"\xb3\x14\x11\xcf\xbd\x92\x08\x00" + b"\x09\xc7\x31\x8c\x18\x1f\x10\x11") class SASIndex(object): row_size_index = 0 column_size_index = 1 subheader_counts_index = 2 column_text_index = 3 column_name_index = 4 … subheader_signature_to_index = { b"\xF7\xF7\xF7\xF7": SASIndex.row_size_index, b"\x00\x00\x00\x00\xF7\xF7\xF7\xF7": SASIndex.row_size_index, … https://github.com/pandas-dev/pandas/tree/master/pandas/io/sas
• ඪ४ (+ެࣜαϒύοέʔδ)ͰҎԼΛαϙʔτ σʔλIO https://pandas.pydata.org/pandas-docs/stable/io.html
σʔλIO (ύοέʔδؒ) • PythonͷσʔλੳΤίγεςϜNumPyΛத৺ʹൃల • Scikit-learnNumPy ndarrayΛೖྗͱͯ͠ఆ • pandas DataFrame͕ೖྗ͞ΕΔͱɺ෦Ͱ
NumPy ndarrayʹม • NumPy͕ APIΛنఆ (Array interface) NumPy: The Array Interface https://docs.scipy.org/doc/numpy-1.15.1/reference/arrays.interface.html QBOEBT %BUB'SBNF /VN1Z OEBSSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม ܭࢉॲཧ
ฒྻॲཧ • Global Interpreter Lock (GIL) • CPythonΠϯλʔϓϦλ্ͰෳͷεϨουΛಉ࣮࣌ߦͰ͖ͳ͍ • CythonͰ໌ࣔతʹղ์Ͱ͖Δ
• ύοέʔδଆͰॲཧ͝ͱʹରԠ͕ඞཁ • ղ์ޙ PyObject Λѻ͑ͳ͍ • ฒྻࢄॲཧͷͨΊͷϥΠϒϥϦ͕ผʹଘࡏ • Dask: A flexible library for parallel computing in Python • γϦΞϥΠζpickleͳͲछྨΛ͍͚ Understanding the Python GIL http://www.dabeaz.com/GIL/
ຊ͓͢͠Δ͜ͱ 1ZUIPOͷσʔλੳΤίγεςϜ σʔλॲཧʹ͓͚Δ՝ "SSPXʹظ͢Δ͜ͱ
Arrowʹظ͢Δػೳ • Ұ؏ͨܽ͠ଛͷαϙʔτ • ಈ࡞ͷ༧ଌՄೳੑ • Query optimization (Gandiva?) •
ॊೈͳσʔλՃ (Chunk?) • ߦͷநදݱ (RowSet?) • ߴͳIO • ฒྻॲཧͰͷσʔλڞ༗ (Plasma?)
Arrow͕ͨΒ͢ະདྷ • σʔλੳύοέʔδArrow͕نఆ͢ΔදݱΛॲཧ͢Δͷ ʹͳΔʁ • ArrowΛૢ࡞͢Δύοέʔδͱɺ֎෦ͱͷIOΛ୲͏ύοέʔ δʹ͔Ε͍ͯ͘ʁɹ
PythonͰؤுΔ͖͜ͱ • ͱ͍͑ɺશͯͷॲཧΛ Arrow Ͱߦ͏͜ͱͰ͖ͳ͍ͷͰ • Pythonඪ४ɺσϑΝΫτύοέʔδͱͷ߹ੑ • จࣈྻॲཧɺॲཧͱ͔ •
Object Array • Γ͍ͨ͜ͱʹԠͯ͡ɺPython্ͰArrowͷදݱΛॲཧ͢Δඞ ཁʁ
ExtensionArray • ҙͷΫϥεΛɺ(ArrowͳͲ)ѻ͍͍͢෦දݱʹϚοϐϯ ά͢ΔͨΊͷΠϯλʔϑΣʔε • pandas ExtensionArrayͱͯ͠APIΛඋத Extension Arrays for
pandas https://tomaugspurger.github.io/pandas-extension-arrays.html IP Address ExtensionArray IP Address 4FU (FU '192.168.1.1' '2001:0db8:85a3…' '192.168.1.1' '2001:0db8:85a3…' hi: [0, …]. lo: [3232235777, …]
Arrow Integration • pandasͰɺExtensionArrayΛܦ༝ͯ͠PyArrowΛϥοϓ • ςετίʔυதʹɺطʹPyArrowΛ༻͍ͨ Bool ܕͷ࣮͋ Γ https://github.com/pandas-dev/pandas/blob/master/pandas/tests/extension/arrow/bool.py
GPU DataFrame • σʔλαΠζͷංେԽʹΑΓɺGPUͰͷੳχʔζ͕૿େ • σʔλ(ྻ)ͷಛʹԠͯ͡ɺGPU/CPUʹॲཧΛৼΓ͚Δඞཁ • ྻ͝ͱʹ࠷దͳσόΠεʹ֨ೲ͖͢ʁ • ϝλσʔλindexerͷόεసૹ͕՝ʁ
… $PMVNOT *OEFY *OU $16 0CKFDU $16 *OU (16 'MPBU (16 RAPIDS: GPU-Accelerated Data Analytics & Machine Learning https://developer.nvidia.com/rapids
·ͱΊ • ArrowʹΑͬͯɺσʔλॲཧʹ͓͚Δ՝େ͖͘վળ͞Εͦ͏ • σʔλॲཧύοέʔδͷ։ൃ໘ന͍ • ίΞ෦ʹՃ͑ɺจࣈྻɺ࣌ɺIOɺՄࢹԽͳͲ༷ʑͳॲཧ͕ • ࠷ॳ؆୯ͳͷ͔Β •
υΩϡϝϯτɺΤϥʔϝοηʔδमਖ਼ • ՄࢹԽͳͲɺinternalͷAPIΛΘͳ͍ͱ͜Ζ