Slide 1

Slide 1 text

Dask ܰྔฒྻܭࢉϑϨʔϜϫʔΫ

Slide 2

Slide 2 text

ࣗݾ঺հ • Data Analyst • OSS ׆ಈ: • PyData Development Team (pandasίϛολ) • Blaze Development Team (Daskίϛολ) • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

Dask • ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ (ϊʔυ಺ฒྻ͕ओ) • NumPy, PyToolz, pandasͷAPI (αϒηοτ) Λ΋ ͭσʔλߏ଄Λఏڙ αϒϞδϡʔϧ ϕʔεύοέʔδ EBTLBSSBZ /VN1ZOEBSSBZ EBTLCBH 1Z5PPM[ MJTU TFU EJDUʹର͢Δॲཧ EBTLEBUBGSBNF QBOEBT%BUB'SBNF

Slide 4

Slide 4 text

DataFrame • pandas.DataFrame: ϥϕϧ෇͖ͷ2࣍ݩσʔλ • Dask.DataFrame: pandas.DataFrame Λ෼ׂͯ͠ ॲཧ QBOEBT%BUB'SBNF %BTL%BUB'SBNF

Slide 5

Slide 5 text

Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf dd.DataFrame ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ %BTL%BUB'SBNFΛ࡞੒

Slide 6

Slide 6 text

DaskͰͷܭࢉॲཧ ddf + 1 dd.DataFrame (ddf + 1).compute() EG EEG DPNQVUF EEG શମʹΛՃࢉɻ ࣮ࡍͷܭࢉॲཧ͸·࣮ͩߦ͞Εͳ͍ ܭࢉΛ࣮ߦ

Slide 7

Slide 7 text

Blocked Algorithm (Ճࢉ) $PODBU (ddf + 1).compute() ॲཧલͷ QBOEBT%BUB'SBNF %BTL%BUB'SBNF ʹม׵ ෼ׂ͞Εͨσʔλʹରͯ͠ ܭࢉ࣮ߦ݁ՌΛ݁߹ ॲཧޙͷ QBOEBT%BUB'SBNF

Slide 8

Slide 8 text

Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN TVN ճ໨ DPODBU TVN ճ໨

Slide 9

Slide 9 text

Blocked Algorithm (߹ܭ) ddf.sum().visualize() TVN ճ໨ DPODBU TVN ճ໨ ॲཧલͷ %BTL%BUB'SBNF

Slide 10

Slide 10 text

Blocked Algorithm (ฏۉ) ddf.mean().visualize() TVN DPVOU NFBOTVNDPVOU

Slide 11

Slide 11 text

Blocked Algorithm (ཁ໿౷ܭྔ) ddf.describe().visualize() ݁Ռ

Slide 12

Slide 12 text

Dask DataFrameͷػೳ • ࢛ଇԋࢉ/ൺֱԋࢉ • ౷ܭྔ • ϥϕϧʹΑΔσʔλબ୒ • άϧʔϓԽ / ू໿ • ࿈݁/݁߹ (merge, join, concat…)

Slide 13

Slide 13 text

ύϑΥʔϚϯεൺֱ • AWS EC2: c4.2xlarge (vCPU: 8, ϝϞϦ: 15 GiB) n = 100000000 df = pd.DataFrame({'a': np.random.randint(1, 100, n), 'b': np.random.randn(n)}) df ddf = dd.from_pandas(df, 5) ddf dd.DataFrame ԯߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ %BTL%BUB'SBNFΛ࡞੒

Slide 14

Slide 14 text

ύϑΥʔϚϯεൺֱ %timeit df.describe() 1 loops, best of 3: 25.3 s per loop %timeit ddf.describe().compute() 1 loops, best of 3: 3.87 s per loop QBOEBT %BTL

Slide 15

Slide 15 text

݁Ռ • ※ percentile ͸ۙࣅΞϧΰϦζϜΛར༻͢ΔͨΊɺ஋ ʹࠩҟ͕ग़Δ৔߹͕͋Δ (੺࿮) df.describe() ddf.describe().compute() QBOEBT %BTL

Slide 16

Slide 16 text

݁Ռ ddf.describe().visualize()

Slide 17

Slide 17 text

·ͱΊ • Dask: ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ • NumPy, PyToolz, pandas ͷAPIͷαϒηοτΛ ఏڙ • Ϣʔβ͸ NumPy / PyToolz / pandas ͷ API Λ ར༻ͯ͠ฒྻܭࢉ͕Ͱ͖Δ