Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dask Distributedによる分散機械学習

Sinhrks
June 28, 2017
1.3k

Dask Distributedによる分散機械学習

@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/

Sinhrks

June 28, 2017
Tweet

Transcript

  1. Dask DistributedʹΑΔ

    ෼ࢄػցֶश

    Masaaki Horikoshi @ ARISE analytics

    View Slide

  2. ࣗݾ঺հ
    • OSS׆ಈ:

    • GitHub: https://github.com/sinhrks

    View Slide

  3. Daskͱ͸
    • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ

    • NumPy, pandasޓ׵(αϒηοτ)ͷσʔλߏ଄Λఏڙ

    • λεΫ͸ಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
    ϥʹΑͬͯฒྻ࣮ߦ

    • DaskΛར༻͢Δύοέʔδ(Ұ෦):
    Airflow

    View Slide

  4. Dask DataFrame
    • ෳ਺ͷpandas DataFramesʹΑΓߏ੒

    • ॎʹ෼ׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ
    QBOEBT%BUB'SBNF %BTL%BUB'SBNF
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO

    View Slide

  5. import pandas as pd
    df = pd.DataFrame({'X': np.arange(10),
    'Y': np.arange(10, 20),
    'Z': np.arange(20, 30)},
    index=list('abcdefghij'))
    df
    import dask.dataframe as dd
    ddf = dd.from_pandas(df, 2)
    ddf
    ߦྻͷ
    QBOEBT%BUB'SBNFΛ࡞੒
    Dask DataFrame
    QBSUJUJPO
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO
    EJWJTJPO

    View Slide

  6. Blocked Algorithm (߹ܭ)
    ddf.sum().compute()
    4VN
    4VN
    $PODBU
    4VN
    ߹ܭ
    શମ

    ࿈݁
    ߹ܭ
    QBSUJUJPO͝ͱ

    View Slide

  7. Dask Distributed
    • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ

    • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓

    • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ

    • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ

    • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    4DIFEVMFS
    %JTUSJCVUFE
    $MJFOU

    View Slide

  8. Scikit-Learnͷฒྻॲཧ
    • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ

    • ಺෦తʹ͸joblibΛར༻

    • Scikit-Learnίϛολத৺ʹ։ൃ

    • ϊʔυ಺ฒྻ (threading, multiprocessing)
    from sklearn.model_selection import GridSearchCV
    grid = GridSearchCV(pipe, cv=3, n_jobs=12,
    param_grid=param_grid)

    View Slide

  9. Distributed joblib
    • ϓϥΨϒϧAPI (0.10.0-)

    • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ

    • ஫ҙ఺

    • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib)

    • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ

    • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ
    import distributed.joblib
    from sklearn.externals.joblib import parallel_backend
    with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’):
    grid.fit(digits.data, digits.target)

    View Slide

  10. dask-searchcv
    • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓ׵ʹͨ͠΋ͷ:

    • GridSearchCVͱRandomizedSearchCVΛαϙʔτ

    • API͸Scikit-Learnͱڞ௨

    • Dask Array΍ DataFrameΛೖྗͱͯ͠౉ͤΔ

    • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ

    • PipelineॲཧͰ༗༻

    ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦

    View Slide