Sinhrks
June 28, 2017
@PyData Tokyo #13 Lightning Talk
June 28, 2017

## Transcript

෼ࢄػցֶश

Masaaki Horikoshi @ ARISE analytics

2. ࣗݾ঺հ
• OSS׆ಈ:

• GitHub: https://github.com/sinhrks

• ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ

• NumPy, pandasޓ׵(αϒηοτ)ͷσʔλߏ଄Λఏڙ

• λεΫ͸ಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ

Airﬂow

• ෳ਺ͷpandas DataFramesʹΑΓߏ੒

• ॎʹ෼ׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ
QBOEBT%BUB'SBNF %BTL%BUB'SBNF
QBSUJUJPO
EJWJTJPO
EJWJTJPO

5. import pandas as pd
df = pd.DataFrame({'X': np.arange(10),
'Y': np.arange(10, 20),
'Z': np.arange(20, 30)},
index=list('abcdefghij'))
df
ddf = dd.from_pandas(df, 2)
ddf
ߦྻͷ
QBOEBT%BUB'SBNFΛ࡞੒
QBSUJUJPO
QBSUJUJPO
EJWJTJPO
EJWJTJPO
EJWJTJPO

6. Blocked Algorithm (߹ܭ)
ddf.sum().compute()
4VN
4VN
\$PODBU
4VN
߹ܭ
શମ

࿈݁
߹ܭ
QBSUJUJPO͝ͱ

• εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ

• ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓

• WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ

• ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ

• ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍
%JTUSJCVUFE
8PSLFS
%JTUSJCVUFE
8PSLFS
%JTUSJCVUFE
4DIFEVMFS
%JTUSJCVUFE
\$MJFOU

8. Scikit-Learnͷฒྻॲཧ
• “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ

• ಺෦తʹ͸joblibΛར༻

• Scikit-Learnίϛολத৺ʹ։ൃ

from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, cv=3, n_jobs=12,
param_grid=param_grid)

9. Distributed joblib
• ϓϥΨϒϧAPI (0.10.0-)

• with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ

• ஫ҙ఺

• scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib)

• ෼ࢄͰ͖ͳ͍৔߹΋͋Δ

import distributed.joblib
from sklearn.externals.joblib import parallel_backend
grid.fit(digits.data, digits.target)