Slide 1

Slide 1 text

Dask DistributedʹΑΔ ෼ࢄػցֶश Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

ࣗݾ঺հ • OSS׆ಈ: • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

Daskͱ͸ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ׵(αϒηοτ)ͷσʔλߏ଄Λఏڙ • λεΫ͸ಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow

Slide 4

Slide 4 text

Dask DataFrame • ෳ਺ͷpandas DataFramesʹΑΓߏ੒ • ॎʹ෼ׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO EJWJTJPO EJWJTJPO

Slide 5

Slide 5 text

import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO

Slide 6

Slide 6 text

Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ ࿈݁ ߹ܭ QBSUJUJPO͝ͱ

Slide 7

Slide 7 text

Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓ • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU

Slide 8

Slide 8 text

Scikit-Learnͷฒྻॲཧ • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ • ಺෦తʹ͸joblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυ಺ฒྻ (threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)

Slide 9

Slide 9 text

Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ • ஫ҙ఺ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib) • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)

Slide 10

Slide 10 text

dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓ׵ʹͨ͠΋ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • API͸Scikit-Learnͱڞ௨ • Dask Array΍ DataFrameΛೖྗͱͯ͠౉ͤΔ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦