Sinhrks
October 10, 2015
# PyConJP 2015: Dask: 軽量並列計算フレームワーク (Lightning talks)

October 10, 2015

## Transcript

1. Dask

ܰྔฒྻܭࢉϑϨʔϜϫʔΫ

2. ࣗݾ঺հ
• Data Analyst

• OSS ׆ಈ:

• PyData Development Team (pandasίϛολ)

• Blaze Development Team (Daskίϛολ)

• GitHub: https://github.com/sinhrks

3. Dask
• ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ (ϊʔυ಺ฒྻ͕ओ)

• NumPy, PyToolz, pandasͷAPI (αϒηοτ) Λ΋
ͭσʔλߏ଄Λఏڙ
αϒϞδϡʔϧ ϕʔεύοέʔδ
EBTLBSSBZ /VN1ZOEBSSBZ
EBTLCBH 1Z5PPM[ MJTU TFU EJDUʹର͢Δॲཧ

EBTLEBUBGSBNF QBOEBT%BUB'SBNF

4. DataFrame
• pandas.DataFrame: ϥϕϧ෇͖ͷ2࣍ݩσʔλ

• Dask.DataFrame: pandas.DataFrame Λ෼ׂͯ͠
ॲཧ
QBOEBT%BUB'SBNF %BTL%BUB'SBNF

5. Dask DataFrame
import pandas as pd
df = pd.DataFrame({'X': np.arange(10),
'Y': np.arange(10, 20),
'Z': np.arange(20, 30)},
index=list('abcdefghij'))
df
import dask.dataframe as dd
ddf = dd.from_pandas(df, 2)
ddf
dd.DataFrame
ߦྻͷ
QBOEBT%BUB'SBNFΛ࡞੒
σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ
%BTL%BUB'SBNFΛ࡞੒

6. DaskͰͷܭࢉॲཧ
ddf + 1
dd.DataFrame
(ddf + 1).compute()

EG EEG
DPNQVUF

EEG
શମʹΛՃࢉɻ
࣮ࡍͷܭࢉॲཧ͸·࣮ͩߦ͞Εͳ͍
ܭࢉΛ࣮ߦ

7. Blocked Algorithm (Ճࢉ)

\$PODBU
(ddf + 1).compute()
ॲཧલͷ
QBOEBT%BUB'SBNF
%BTL%BUB'SBNF
ʹม׵
෼ׂ͞Εͨσʔλʹରͯ͠
ܭࢉ࣮ߦ݁ՌΛ݁߹
ॲཧޙͷ
QBOEBT%BUB'SBNF

8. Blocked Algorithm (߹ܭ)
ddf.sum().compute()
4VN
4VN
\$PODBU
4VN
TVN ճ໨

DPODBU
TVN ճ໨

9. Blocked Algorithm (߹ܭ)
ddf.sum().visualize()
TVN ճ໨

DPODBU
TVN ճ໨

ॲཧલͷ
%BTL%BUB'SBNF

10. Blocked Algorithm (ฏۉ)
ddf.mean().visualize()
TVN DPVOU
NFBOTVNDPVOU

11. Blocked Algorithm (ཁ໿౷ܭྔ)
ddf.describe().visualize()
݁Ռ

12. Dask DataFrameͷػೳ
• ࢛ଇԋࢉ/ൺֱԋࢉ

• ౷ܭྔ

• ϥϕϧʹΑΔσʔλબ୒

• άϧʔϓԽ / ू໿

• ࿈݁/݁߹ (merge, join, concat…)

13. ύϑΥʔϚϯεൺֱ
• AWS EC2: c4.2xlarge (vCPU: 8, ϝϞϦ: 15 GiB)
n = 100000000
df = pd.DataFrame({'a': np.random.randint(1, 100, n),
'b': np.random.randn(n)})
df
ddf = dd.from_pandas(df, 5)
ddf
dd.DataFrame80000000, 99999999)>
ԯߦྻͷ
QBOEBT%BUB'SBNFΛ࡞੒
σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ
%BTL%BUB'SBNFΛ࡞੒

14. ύϑΥʔϚϯεൺֱ
%timeit df.describe()
1 loops, best of 3: 25.3 s per loop
%timeit ddf.describe().compute()
1 loops, best of 3: 3.87 s per loop
QBOEBT
%BTL

15. ݁Ռ
• ※ percentile ͸ۙࣅΞϧΰϦζϜΛར༻͢ΔͨΊɺ஋
ʹࠩҟ͕ग़Δ৔߹͕͋Δ (੺࿮)
df.describe() ddf.describe().compute()
QBOEBT %BTL

16. ݁Ռ
ddf.describe().visualize()

17. ·ͱΊ
• Dask: ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ

• NumPy, PyToolz, pandas ͷAPIͷαϒηοτΛ
ఏڙ

• Ϣʔβ͸ NumPy / PyToolz / pandas ͷ API Λ
ར༻ͯ͠ฒྻܭࢉ͕Ͱ͖Δ