Upgrade to Pro
— share decks privately, control downloads, hide ads and more …
Speaker Deck
Features
Speaker Deck
PRO
Sign in
Sign up for free
Search
Search
Dask Distributedによる分散機械学習
Search
Sinhrks
June 28, 2017
1.6k
4
Share
Dask Distributedによる分散機械学習
@PyData Tokyo #13 Lightning Talk
https://pydatatokyo.connpass.com/event/58954/
Sinhrks
June 28, 2017
More Decks by Sinhrks
See All by Sinhrks
daskperiment: Reproducibility for Humans
sinhrks
1
440
PythonとApache Arrow
sinhrks
6
2k
大規模データの機械学習におけるDaskの活用
sinhrks
10
3.3k
機械学習と解釈可能性
sinhrks
7
5.8k
LIME
sinhrks
2
1.5k
データ分析言語R 1年の振り返り
sinhrks
5
2.6k
pandasでのOSS活動事例と最初の一歩
sinhrks
2
20k
Data processing using pandas and Dask
sinhrks
1
300
pandasでのOSS活動事例
sinhrks
0
830
Featured
See All Featured
The AI Revolution Will Not Be Monopolized: How open-source beats economies of scale, even for LLMs
inesmontani
PRO
3
3.5k
Building the Perfect Custom Keyboard
takai
2
780
Information Architects: The Missing Link in Design Systems
soysaucechin
0
950
GraphQLとの向き合い方2022年版
quramy
50
15k
Marketing Yourself as an Engineer | Alaka | Gurzu
gurzu
0
210
Exploring the Power of Turbo Streams & Action Cable | RailsConf2023
kevinliebholz
37
6.5k
A brief & incomplete history of UX Design for the World Wide Web: 1989–2019
jct
2
390
The Mindset for Success: Future Career Progression
greggifford
PRO
0
350
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
inesmontani
PRO
3
2.2k
Impact Scores and Hybrid Strategies: The future of link building
tamaranovitovic
0
300
A better future with KSS
kneath
240
18k
DBのスキルで生き残る技術 - AI時代におけるテーブル設計の勘所
soudai
PRO
65
55k
Transcript
Dask DistributedʹΑΔ ࢄػցֶश Masaaki Horikoshi @ ARISE analytics
ࣗݾհ • OSS׆ಈ: • GitHub: https://github.com/sinhrks
Daskͱ • ॊೈͳฒྻɾOut of CoreॲཧϑϨʔϜϫʔΫ • NumPy, pandasޓ(αϒηοτ)ͷσʔλߏΛఏڙ • λεΫಈతͳܭࢉάϥϑͱͯ͠දݱ͞Εɺεέδϡʔ
ϥʹΑͬͯฒྻ࣮ߦ • DaskΛར༻͢Δύοέʔδ(Ұ෦): Airflow
Dask DataFrame • ෳͷpandas DataFramesʹΑΓߏ • ॎʹׂ͞ΕͨDataFrame͝ͱʹॲཧΛฒྻԽ QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO
EJWJTJPO EJWJTJPO
import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10,
20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf ߦྻͷ QBOEBT%BUB'SBNFΛ࡞ Dask DataFrame QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN ߹ܭ શମ
࿈݁ ߹ܭ QBSUJUJPO͝ͱ
Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳϊʔυͰࢄͰ͖Δ • ϨΠςϯγ: λεΫຖͷΦʔόʔϔου1msఔ • WorkerؒͰͷσʔλڞ༗: σʔλసૹWorkerؒͰ࣮ࢪ
• ෳࡶͳεέδϡʔϦϯά: ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
Scikit-Learnͷฒྻॲཧ • “n_jobs” ҾͰฒྻ࣮ߦΛࢦఆ • ෦తʹjoblibΛར༻ • Scikit-Learnίϛολத৺ʹ։ൃ • ϊʔυฒྻ
(threading, multiprocessing) from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)
Distributed joblib • ϓϥΨϒϧAPI (0.10.0-) • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ
• ҙ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ͏ (sklearn.externals.joblib) • ࢄͰ͖ͳ͍߹͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)
dask-searchcv • Scikit-LearnͷϋΠύʔύϥϝʔλαʔνΛ Dask ޓʹͨ͠ͷ: • GridSearchCVͱRandomizedSearchCVΛαϙʔτ • APIScikit-Learnͱڞ௨ •
Dask Array DataFrameΛೖྗͱͯͤ͠Δ • ಉҰɺಉύϥϝʔλͷֶशثͷ܁Γฦ࣮͠ߦΛආ͚Δ • PipelineॲཧͰ༗༻ ※աڈʹ dklearn ͱͯ͠ެ։͞Ε͍ͯͨύοέʔδͷҰ෦