Dask Distributedによる分散機械学習

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
June 28, 2017
1.2k

Dask Distributedによる分散機械学習

@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

June 28, 2017
Tweet

Transcript

  1. Dask DistributedʹΑΔ ෼ࢄػցֶश Masaaki Horikoshi @ ARISE analytics

  2. ࣗݾ঺հ • OSS׆ಈ: • GitHub: https://github.com/sinhrks

  3. Daskͱ͸ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ׵(αϒηοτ)ͷσʔλߏ଄Λఏڙ • λεΫ͸ಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ

    ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
  4. Dask DataFrame • ෳ਺ͷpandas DataFramesʹΑΓߏ੒ • ॎʹ෼ׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO

    EJWJTJPO EJWJTJPO
  5. import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,

    20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
  6. Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ

    ࿈݁ ߹ܭ QBSUJUJPO͝ͱ
  7. Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓ • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ

    • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
  8. Scikit-Learnͷฒྻॲཧ • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ • ಺෦తʹ͸joblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυ಺ฒྻ

    (threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
  9. Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ

    • ஫ҙ఺ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib) • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
  10. dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓ׵ʹͨ͠΋ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • API͸Scikit-Learnͱڞ௨ •

    Dask Array΍ DataFrameΛೖྗͱͯ͠౉ͤΔ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦