Slide 1

Slide 1 text

େن໛σʔλͷ ػցֶशʹ͓͚ΔDaskͷ׆༻ Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

ࣗݾ঺հ • OSS׆ಈͳͲ: • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

ίϯςϯπ • Daskͱ͸ • ػցֶशʹ͓͚ΔDaskͷ׆༻ • Dask ML • Distributed

Slide 4

Slide 4 text

Python for Big Data • PythonͰେن໛ͳσʔλΛॲཧ͢Δͱ… • ܭࢉ͕(ݪଇ)୯ҰͷεϨουͰߦΘΕΔͨΊɺ ॲཧ଎౓͕஗͍ • σʔλ͕෺ཧతͳϝϞϦʹ͓͞·Βͳ͍

Slide 5

Slide 5 text

ฒྻॲཧͱOut-of-coreॲཧ ฒྻॲཧ 0VUPGDPSFॲཧ • ฒྻॲཧ: ෳ਺ͷλεΫΛฒྻͰॲཧ͢Δ • Out-of-coreॲཧ: ϝϞϦʹ৐Βͳ͍σʔλΛஞ࣍ॲཧ͢Δ • ഉଞͰ͸ͳ͍ (Out-of-coreॲཧΛฒྻͰߦ͏͜ͱ΋͋Δ)

Slide 6

Slide 6 text

Daskͱ͸ • ॊೈͳฒྻɾOut-of-coreॲཧύοέʔδ • σʔλॲཧɺ਺஋ܭࢉΛओ໨త • NumPy΍pandasͷαϒηοτͱͳΔσʔλ ߏ଄Λఏڙ • ॲཧΛܭࢉάϥϑͱͯ͠දݱ͠ɺಈతʹεέ δϡʔϦϯά࣮ͯ͠ߦ

Slide 7

Slide 7 text

(Incomplete) List of OSS uses Dask • (TFLearn) Deep learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • Executes end-to-end data science and analytics pipelines entirely on GPUs. Airflow

Slide 8

Slide 8 text

Daskͷσʔλߏ଄(API) %BTL"1* #BTF$MBTT %FGBVMU4DIFEVMFS %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH %BTL#BH 1Z5PPM[ MJTU TFU EJDU NVMUJQSPDFTTJOH DBOOPUSFMFBTF(*- %BTL%FMBZFE ೚ҙͷؔ਺ UISFBEJOH

Slide 9

Slide 9 text

Dask Array • ෳ਺ͷNumPy nd-arrayʹΑͬͯߏ੒ • ಺෦తʹ͸ɺ࣠ํ޲ʹԊͬͯChunkʹ෼ׂ /VN1ZOEBSSBZ %BTL"SSBZ $IVOL DIVOLTJ[F

Slide 10

Slide 10 text

array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]]) Dask Array import numpy as np x = np.ones((10, 10)) x import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array YͷOEBSSBZΛ࡞੒ Yͷ%BTL"SSBZΛ࡞੒ ಺෦͸YͷͭͷDIVOLʹ෼ׂ

Slide 11

Slide 11 text

Dask Array import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array dx.visualize() ಺෦ͷܭࢉάϥϑΛදࣔ ֤ϊʔυ ༿ ͕֤DIVOLʹରԠ Yͷ%BTL"SSBZΛ࡞੒ ಺෦͸YͷͭͷDIVOLʹ෼ׂ

Slide 12

Slide 12 text

ߦྻͷ QBOEBT%BUB'SBN FΛ࡞੒ Dask Array dy = dx.sum(axis=0) dy dask.array BYJTʹԊͬͯ஋Λ߹ܭ dy.visualize() dy.compute() array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]) 4VN 4VN

Slide 13

Slide 13 text

Dask Array dy2 = dx.sum() dy2 dask.array શͯͷ஋Λ߹ܭ dy2.visualize() dy2.compute() 100.0

Slide 14

Slide 14 text

Dask Array • Dask Array ͸೚ҙͷ shape ͱ chunk sizeΛ΋ͭ • chunkؒͷॲཧ͕ൃੜ͢Δ৔߹ɺchunk size ͸ Ұகͤͨ͞ํ͕ྑ͍ da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array

Slide 15

Slide 15 text

Dask DataFrame QBSUJUJPO EJWJTJPO EJWJTJPO • ෳ਺ͷpandas DataFrameʹΑͬͯߏ੒ • ಺෦తʹ͸ɺindex (ߦϥϕϧ)ʹԊͬͯPartitionʹ෼ׂ QBOEBT%BUB'SBNF %BTL%BUB'SBNF

Slide 16

Slide 16 text

ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO

Slide 17

Slide 17 text

Blocked Algorithm (Mean) ddf.mean().visualize() 4VN PWFSQBSUJUJPOT $PVOU PWFSQBSUJUJPOT .FBO4VN$PVOU 4VN QFSQBSUJUJPO $PVOU QFSQBSUJUJPO

Slide 18

Slide 18 text

ܭࢉάϥϑͷ࣮ߦΠϝʔδ • ܭࢉάϥϑͷґଘؔ܎ͷղੳ • ࠷ऴ݁ՌʹෆཁͳλεΫͷ࡟আ • ಉҰλεΫͷϚʔδ • ܭࢉॱংͷ੩తɾಈతͳղੳ • ܭࢉάϥϑͷ࠷దԽ • ಛఆॲཧͷΠϯϥΠϯԽ • ࿈ଓ͢ΔλεΫΛಉҰͷϫʔΧʔʹׂΓ౰ͯ

Slide 19

Slide 19 text

Dask Internals • ͋ΒΏΔDaskͷॲཧ͸ܭࢉάϥϑͱͯ͠දݱ͞ ΕΔ • DaskͰ͸ઢܗ୅਺ͳͲͷෳࡶͳΞϧΰϦζϜ ΋࣮૷ (dask.array.linalg) • Ϣʔβࣗ਎ͷσʔλߏ଄΍ؔ਺΋ܭࢉάϥϑ Ͱ࣮૷Մೳ (dask.delayed)

Slide 20

Slide 20 text

Linear Algebra 'VODUJPO %FTDSJQUJPO MJOBMHDIPMFTLZ 3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO PSPGB)FSNJUJBOQPTJUJWFEFpOJUF NBUSJY" MJOBMHJOW $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE CBDLXBSETVCTUJUVUJPOT MJOBMHMTUTR 3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23 EFDPNQPTJUJPO MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY MJOBMHOPSN .BUSJYPSWFDUPSOPSN MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO MJOBMHTGRS %JSFDU4IPSUBOE'BU23 MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN

Slide 21

Slide 21 text

Example: LU decomposition • LU෼ղ • LU෼ղ͸νϟϯΫ(ϒϩοΫ)୯ҐͰͷܭࢉʹ෼ׂ Ͱ͖Δ "ɹɹɹɹ-Y6

Slide 22

Slide 22 text

Blocked LU Decomposition • Diagonal Block • Row-direction(i < j) • Columns direction (i < j) ∴ ∴ * LU: Function to solve LU decomposition * Solve: Function to solve equation

Slide 23

Slide 23 text

Blocked LU Decomposition arr = da.random.random((9, 9), chunks=(3, 3)) arr dask.array from dask import compute t, l, u = da.linalg.lu(arr) t, l, u = compute(t, l, u) ͭͷܭࢉάϥϑΛ Ϛʔδ࣮ͯ͠ߦ

Slide 24

Slide 24 text

Blocked LU Decomposition from dask import visualize visualize(t, l, u)

Slide 25

Slide 25 text

DaskʹΑΔॲཧͷશମ૾ ஞ࣍ॲཧ /VN1Z QBOEBT TDJLJUMFBSO %BTL %BTL %JTUSJCVUFE ฒྻॲཧ ϊʔυ಺ 0VUPGDPSF ॲཧ ϊʔυ಺ ෼ࢄॲཧ ϊʔυؒ %BTL.- ߦྻܭࢉ ςʔϒϧ σʔλॲཧ ػցֶश

Slide 26

Slide 26 text

scikit-learnͷฒྻॲཧ • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ • ಺෦తʹ͸joblibΛར༻ • scikit-learnίϛολத৺ʹ։ൃ • ϊʔυ಺ฒྻ (threading, multiprocessing) • Out-of-coreॲཧ΍ϊʔυؒ෼ࢄॲཧ͸Ͱ͖ͳ͍ from sklearn.model_selection import GridSearchCV grid = GridSearchCV(pipe, cv=3, n_jobs=12, param_grid=param_grid)

Slide 27

Slide 27 text

DaskʹΑΔॲཧͷશମ૾ ஞ࣍ॲཧ /VN1Z QBOEBT TDJLJUMFBSO %BTL %BTL %JTUSJCVUFE ฒྻॲཧ ϊʔυ಺ 0VUPGDPSF ॲཧ ϊʔυ಺ ෼ࢄॲཧ ϊʔυؒ %BTL.- ߦྻܭࢉ ςʔϒϧ σʔλॲཧ ػցֶश

Slide 28

Slide 28 text

ຊ౰ʹશͯͷσʔλ͕ඞཁ͔ʁ • αϯϓϦϯάͰे෼Ͱ͸ʁ IUUQTDJLJUMFBSOPSHTUBCMFBVUP@FYBNQMFTNPEFM@TFMFDUJPOQMPU@MFBSOJOH@DVSWFIUNM

Slide 29

Slide 29 text

Dask ML • ػցֶशͰDaskΛ׆༻͢ΔͨΊͷύοέʔδ • Daskͷσʔλߏ଄ʹରͯ͠ػցֶशΛߦ͏ • ػցֶशΞϧΰϦζϜΛDaskΛ༻͍ͯฒྻɾ Out-of-coreԽ͢Δ

Slide 30

Slide 30 text

Dask ML • ػցֶशͷͨΊͷαϒύοέʔδΛఏڙ • Preprocessing • Model Selection • Cross validation • Hyper parameter search • GLM • Clustering • Incremental • ParallelPostFit • XGBoost • TensorFlow %BTLʹରԠͨ͠ TDJLJUMFBSOޓ׵ͷ ֶशثΛఏڙ TDJLJUMFBSOΛϥοϓ͠ ฒྻɾ0VUPGDPSFԽ TDJLJUMFBSOҎ֎ͷ ύοέʔδରԠ

Slide 31

Slide 31 text

• scikit-learnޓ׵ͷֶशثͰDaskͷσʔλߏ଄Λ ѻ͑Δ Dask ML TDJLJU MFBSO %BTL .- /VN1Z OEBSSBZ %BTL "SSBZ /VN1Z OEBSSBZ /VN1Z OEBSSBZ %BTL "SSBZ %BTL "SSBZ "SSBZ*OUFSGBDFʹΑΓɺ /VN1ZOEBSSBZʹม׵ ܭࢉॲཧ ܭࢉॲཧ %BTLͷσʔλߏ଄Λҡ࣋ͯ͠ ॲཧΛ࣮ߦ

Slide 32

Slide 32 text

k-means IUUQTFOXJLJQFEJBPSHXJLJ,NFBOT@DMVTUFSJOH

Slide 33

Slide 33 text

k-means (Dask ML) ʜ $IVOL ʜ $IVOL ʜ $IVOL 4BNQMJOH σʔληοτ͔Βη ϯτϩΠυͷॳظ஋ Λܾఆ ηϯτϩΠυ͔Βͷ ڑ཭Λܭࢉ͠ɺΫϥ ελʹ෼ྨ $IVOL͝ͱʹ࣮ߦ Ϋϥελ͝ͱʹϨίʔ υΛू໿͠ɺηϯτ ϩΠυΛߋ৽ $IVOL͝ͱʹ࣮ߦˠ ू໿ 4VN 4VN $PVOU ʜ طఆͰ͸4DBMBCMF,NFBOT #BINBOJFUBM ͷ,NFBOTccͰηϯτϩΠυΛॳظԽ

Slide 34

Slide 34 text

Incremental • partial_fitΛॱʹద༻͢Δ ߦྻͷQBOEBT%BUB'SBNFΛ࡞੒ pU %BTL "SSBZ *ODSFNFOUBM QBSUJBM@pU $IVOL $IVOL ʜ ʜ QBSUJBM@pUΛஞ࣮࣍ߦ ฒྻॲཧ͞Εͳ͍

Slide 35

Slide 35 text

Incremental • IncrementalͰֶशثΛϥοϓ • ෼ྨ໰୊ͷ৔߹ɺΫϥεΛclassesҾ਺Ͱ౉͢ • ෳ਺ΤϙοΫͰֶश͢Δ৔߹͸ Incremental.partial_fit • for _ in range(10): inc.partial_fit(X_train, y_train, classes=classes) print('Score:', inc.score(X_test, y_test)) from sklearn.linear_model import SGDClassifier from dask_ml.wrappers import Inclemental clf = SGDClassifier(loss='log', penalty='l2', tol=1e-3)) inc = Incremental(clf, scoring='accuracy') inc.fit(X_train, y_train, classes=classes)

Slide 36

Slide 36 text

ߦྻͷQBOEBT%BUB'SBNFΛ࡞੒ ParallelPostFit • ֶशࡁΈEstimatorͷtransform΍predictΛฒྻͰ ద༻͢Δ QSFEJDU %BTL "SSBZ 1BSBMMFM1PTU'JU QSFEJDU $IVOL QSFEJDU $IVOL ʜ ʜ

Slide 37

Slide 37 text

• ParallelPostFitͰֶशثΛϥοϓ • fit࣌͸Կ΋ߦΘͳ͍ (scikit-learnͷॲཧͱಉҰ) • ΑΓେ͖ͳσʔλʹରͯ͠ predict ΛฒྻͰߦ͏ ParallelPostFit y_pred = clf.predict(X_large) y_pred from sklearn.linear_model import LogisticRegressionCV from dask_ml.wrappers import ParallelPostFit clf = ParallelPostFit(LogisticRegressionCV(cv=3)) clf.fit(X_train, y_train)

Slide 38

Slide 38 text

DaskʹΑΔॲཧͷશମ૾ ஞ࣍ॲཧ /VN1Z QBOEBT TDJLJUMFBSO %BTL %BTL %JTUSJCVUFE ฒྻॲཧ ϊʔυ಺ 0VUPGDPSF ॲཧ ϊʔυ಺ ෼ࢄॲཧ ϊʔυؒ %BTL.- ߦྻܭࢉ ςʔϒϧ σʔλॲཧ ػցֶश

Slide 39

Slide 39 text

Dask Distributed • Daskຊମ͸ҎԼ2छྨͷεέδϡʔϥΛαϙʔ τ: • threading • multiprocessing • Dask Distributedύοέʔδ͸ϊʔυؒ෼ࢄΛߦ ͏εέδϡʔϥΛఏڙ

Slide 40

Slide 40 text

Dask Distributed • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓ • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍ %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU

Slide 41

Slide 41 text

Distributed joblib • ϓϥΨϒϧAPI • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ • ஫ҙ఺ • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib) • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ import distributed.joblib from sklearn.externals.joblib import parallel_backend with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’): grid.fit(digits.data, digits.target)

Slide 42

Slide 42 text

·ͱΊ • Daskͱ͸ • NumPy΍ pandas ޓ׵ͷσʔλߏ଄Λఏڙ • ػցֶशʹ͓͚ΔDaskͷ׆༻ • Dask ML: ػցֶशͰDaskͷσʔλߏ଄Λѻ͑ Δ • Distributed: scikit-learnͷॲཧΛϊʔυؒͰ෼ࢄ ॲཧͰ͖Δ