Lock in $30 Savings on PRO—Offer Ends Soon! ⏳
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Dask Distributedによる分散機械学習
Search
Sinhrks
June 28, 2017
4
1.5k
Dask Distributedによる分散機械学習
@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/
Sinhrks
June 28, 2017
Tweet
Share
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
410
PythonとApache Arrow
sinhrks
6
2k
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.2k
機械学習と解釈可能性
sinhrks
7
5.7k
LIME
sinhrks
2
1.4k
データ分析言語R 1年の振り返り
sinhrks
5
2.5k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
19k
Data processing using pandas and Dask
sinhrks
1
270
pandasでのOSS活動事例
sinhrks
0
790
Featured
See All Featured
Documentation Writing (for coders)
carmenintech
76
5.1k
Cheating the UX When There Is Nothing More to Optimize - PixelPioneers
stephaniewalter
285
14k
The Web Performance Landscape in 2024 [PerfNow 2024]
tammyeverts
11
950
Leading Effective Engineering Teams in the AI Era
addyosmani
8
1.2k
How Fast Is Fast Enough? [PerfNow 2025]
tammyeverts
3
350
Responsive Adventures: Dirty Tricks From The Dark Corners of Front-End
smashingmag
253
22k
Six Lessons from altMBA
skipperchong
29
4.1k
BBQ
matthewcrist
89
9.9k
Learning to Love Humans: Emotional Interface Design
aarron
274
41k
Speed Design
sergeychernyshev
33
1.3k
A better future with KSS
kneath
239
18k
Building Applications with DynamoDB
mza
96
6.8k
Transcript
Dask DistributedʹΑΔ ࢄػցֶश Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • OSS׆ಈ: • GitHub: https://github.com/sinhrks
Daskͱ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ(αϒηοτ)ͷσʔλߏΛఏڙ • λεΫಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
Dask DataFrame • ෳͷpandas DataFramesʹΑΓߏ • ॎʹׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO
EJWJTJPO EJWJTJPO
import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,
20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ
࿈݁ ߹ܭ QBSUJUJPO͝ͱ
Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳϊʔυͰࢄͰ͖Δ • ϨΠςϯγ: λεΫຖͷΦʔόʔϔου1msఔ • WorkerؒͰͷσʔλڞ༗: σʔλసૹWorkerؒͰ࣮ࢪ
• ෳࡶͳεέδϡʔϦϯά: ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
Scikit-Learnͷฒྻॲཧ • “n_jobs” ҾͰฒྻ࣮ߦΛࢦఆ • ෦తʹjoblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυฒྻ
(threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ
• ҙ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ͏ (sklearn.externals.joblib) • ࢄͰ͖ͳ͍߹͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓʹͨ͠ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • APIScikit-Learnͱڞ௨ •
Dask Array DataFrameΛೖྗͱͯͤ͠Δ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦