Upgrade to Pro — share decks privately, control downloads, hide ads and more …

大規模データの機械学習におけるDaskの活用

Sinhrks
October 20, 2018

 大規模データの機械学習におけるDaskの活用

@PyData.Tokyo One Day Conference 2018/10/20

Sinhrks

October 20, 2018
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. େن໛σʔλͷ

    ػցֶशʹ͓͚ΔDaskͷ׆༻

    Masaaki Horikoshi @ ARISE analytics

    View full-size slide

  2. ࣗݾ঺հ
    • OSS׆ಈͳͲ:

    • GitHub: https://github.com/sinhrks

    View full-size slide

  3. ίϯςϯπ
    • Daskͱ͸

    • ػցֶशʹ͓͚ΔDaskͷ׆༻

    • Dask ML

    • Distributed

    View full-size slide

  4. Python for Big Data
    • PythonͰେن໛ͳσʔλΛॲཧ͢Δͱ…

    • ܭࢉ͕(ݪଇ)୯ҰͷεϨουͰߦΘΕΔͨΊɺ
    ॲཧ଎౓͕஗͍

    • σʔλ͕෺ཧతͳϝϞϦʹ͓͞·Βͳ͍

    View full-size slide

  5. ฒྻॲཧͱOut-of-coreॲཧ
    ฒྻॲཧ 0VUPGDPSFॲཧ
    • ฒྻॲཧ: ෳ਺ͷλεΫΛฒྻͰॲཧ͢Δ

    • Out-of-coreॲཧ: ϝϞϦʹ৐Βͳ͍σʔλΛஞ࣍ॲཧ͢Δ

    • ഉଞͰ͸ͳ͍ (Out-of-coreॲཧΛฒྻͰߦ͏͜ͱ΋͋Δ)

    View full-size slide

  6. Daskͱ͸
    • ॊೈͳฒྻɾOut-of-coreॲཧύοέʔδ

    • σʔλॲཧɺ਺஋ܭࢉΛओ໨త

    • NumPy΍pandasͷαϒηοτͱͳΔσʔλ
    ߏ଄Λఏڙ

    • ॲཧΛܭࢉάϥϑͱͯ͠දݱ͠ɺಈతʹεέ
    δϡʔϦϯά࣮ͯ͠ߦ

    View full-size slide

  7. (Incomplete) List of OSS uses Dask
    • (TFLearn) Deep learning library featuring
    a higher-level API for TensorFlow.

    • (Distributed Scheduler) A platform to
    author, schedule and monitor
    workflows.

    • Image Processing SciKit.

    • N-D labeled arrays and datasets in
    Python.

    • Executes end-to-end data science and
    analytics pipelines entirely on GPUs.
    Airflow

    View full-size slide

  8. Daskͷσʔλߏ଄(API)
    %BTL"1* #BTF$MBTT %FGBVMU4DIFEVMFS
    %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH
    %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH
    %BTL#BH 1Z5PPM[ MJTU TFU EJDU

    NVMUJQSPDFTTJOH
    DBOOPUSFMFBTF(*-

    %BTL%FMBZFE ೚ҙͷؔ਺ UISFBEJOH

    View full-size slide

  9. Dask Array
    • ෳ਺ͷNumPy nd-arrayʹΑͬͯߏ੒

    • ಺෦తʹ͸ɺ࣠ํ޲ʹԊͬͯChunkʹ෼ׂ
    /VN1ZOEBSSBZ %BTL"SSBZ
    $IVOL
    DIVOLTJ[F

    View full-size slide

  10. array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
    Dask Array
    import numpy as np
    x = np.ones((10, 10))
    x
    import dask.array as da
    dx = da.ones((10, 10), chunks=(5, 5))
    dx
    dask.array
    YͷOEBSSBZΛ࡞੒
    Yͷ%BTL"SSBZΛ࡞੒
    ಺෦͸YͷͭͷDIVOLʹ෼ׂ

    View full-size slide

  11. Dask Array
    import dask.array as da
    dx = da.ones((10, 10), chunks=(5, 5))
    dx
    dask.array
    dx.visualize()
    ಺෦ͷܭࢉάϥϑΛදࣔ
    ֤ϊʔυ ༿
    ͕֤DIVOLʹରԠ
    Yͷ%BTL"SSBZΛ࡞੒
    ಺෦͸YͷͭͷDIVOLʹ෼ׂ

    View full-size slide

  12. ߦྻͷ
    QBOEBT%BUB'SBN
    FΛ࡞੒
    Dask Array
    dy = dx.sum(axis=0)
    dy
    dask.arraychunksize=(5,)>
    BYJTʹԊͬͯ஋Λ߹ܭ
    dy.visualize()
    dy.compute()
    array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.])
    4VN 4VN

    View full-size slide

  13. Dask Array
    dy2 = dx.sum()
    dy2
    dask.arraychunksize=()>
    શͯͷ஋Λ߹ܭ
    dy2.visualize()
    dy2.compute()
    100.0

    View full-size slide

  14. Dask Array
    • Dask Array ͸೚ҙͷ shape ͱ chunk sizeΛ΋ͭ

    • chunkؒͷॲཧ͕ൃੜ͢Δ৔߹ɺchunk size ͸
    Ұகͤͨ͞ํ͕ྑ͍
    da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2))
    dask.arraychunksize=(3, 7, 1, 2)>

    View full-size slide

  15. Dask DataFrame
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO
    • ෳ਺ͷpandas DataFrameʹΑͬͯߏ੒

    • ಺෦తʹ͸ɺindex (ߦϥϕϧ)ʹԊͬͯPartitionʹ෼ׂ
    QBOEBT%BUB'SBNF %BTL%BUB'SBNF

    View full-size slide

  16. ߦྻͷ
    QBOEBT%BUB'SBNFΛ࡞੒
    Dask DataFrame
    import dask.dataframe as dd
    ddf = dd.from_pandas(df, 2)
    ddf
    QBSUJUJPO
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO
    EJWJTJPO

    View full-size slide

  17. Blocked Algorithm (Mean)
    ddf.mean().visualize()
    4VN
    PWFSQBSUJUJPOT

    $PVOU
    PWFSQBSUJUJPOT

    .FBO4VN$PVOU
    4VN
    QFSQBSUJUJPO

    $PVOU
    QFSQBSUJUJPO

    View full-size slide

  18. ܭࢉάϥϑͷ࣮ߦΠϝʔδ
    • ܭࢉάϥϑͷґଘؔ܎ͷղੳ

    • ࠷ऴ݁ՌʹෆཁͳλεΫͷ࡟আ

    • ಉҰλεΫͷϚʔδ

    • ܭࢉॱংͷ੩తɾಈతͳղੳ

    • ܭࢉάϥϑͷ࠷దԽ

    • ಛఆॲཧͷΠϯϥΠϯԽ

    • ࿈ଓ͢ΔλεΫΛಉҰͷϫʔΧʔʹׂΓ౰ͯ

    View full-size slide

  19. Dask Internals
    • ͋ΒΏΔDaskͷॲཧ͸ܭࢉάϥϑͱͯ͠දݱ͞
    ΕΔ

    • DaskͰ͸ઢܗ୅਺ͳͲͷෳࡶͳΞϧΰϦζϜ
    ΋࣮૷ (dask.array.linalg)

    • Ϣʔβࣗ਎ͷσʔλߏ଄΍ؔ਺΋ܭࢉάϥϑ
    Ͱ࣮૷Մೳ (dask.delayed)

    View full-size slide

  20. Linear Algebra
    'VODUJPO %FTDSJQUJPO
    MJOBMHDIPMFTLZ
    3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO PSPGB)FSNJUJBOQPTJUJWFEFpOJUF
    NBUSJY"
    MJOBMHJOW
    $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE
    CBDLXBSETVCTUJUVUJPOT
    MJOBMHMTUTR
    3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23
    EFDPNQPTJUJPO
    MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY
    MJOBMHOPSN .BUSJYPSWFDUPSOPSN
    MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY
    MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY
    MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY
    MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY
    MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO
    MJOBMHTGRS %JSFDU4IPSUBOE'BU23
    MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN

    View full-size slide

  21. Example: LU decomposition
    • LU෼ղ

    • LU෼ղ͸νϟϯΫ(ϒϩοΫ)୯ҐͰͷܭࢉʹ෼ׂ
    Ͱ͖Δ
    "ɹɹɹɹ-Y6

    View full-size slide

  22. Blocked LU Decomposition
    • Diagonal Block

    • Row-direction(i < j)

    • Columns direction (i < j)


    * LU: Function to solve LU decomposition

    * Solve: Function to solve equation

    View full-size slide

  23. Blocked LU Decomposition
    arr = da.random.random((9, 9), chunks=(3, 3))
    arr
    dask.arraychunksize=(3, 3)>
    from dask import compute
    t, l, u = da.linalg.lu(arr)
    t, l, u = compute(t, l, u)
    ͭͷܭࢉάϥϑΛ
    Ϛʔδ࣮ͯ͠ߦ

    View full-size slide

  24. Blocked LU Decomposition
    from dask import visualize
    visualize(t, l, u)

    View full-size slide

  25. DaskʹΑΔॲཧͷશମ૾
    ஞ࣍ॲཧ
    /VN1Z
    QBOEBT
    TDJLJUMFBSO
    %BTL
    %BTL
    %JTUSJCVUFE
    ฒྻॲཧ
    ϊʔυ಺

    0VUPGDPSF
    ॲཧ ϊʔυ಺

    ෼ࢄॲཧ
    ϊʔυؒ

    %BTL.-
    ߦྻܭࢉ
    ςʔϒϧ
    σʔλॲཧ
    ػցֶश

    View full-size slide

  26. scikit-learnͷฒྻॲཧ
    • “n_jobs” Ҿ਺Ͱฒྻ࣮ߦ਺Λࢦఆ

    • ಺෦తʹ͸joblibΛར༻

    • scikit-learnίϛολத৺ʹ։ൃ

    • ϊʔυ಺ฒྻ (threading, multiprocessing)

    • Out-of-coreॲཧ΍ϊʔυؒ෼ࢄॲཧ͸Ͱ͖ͳ͍
    from sklearn.model_selection import GridSearchCV
    grid = GridSearchCV(pipe, cv=3, n_jobs=12,
    param_grid=param_grid)

    View full-size slide

  27. DaskʹΑΔॲཧͷશମ૾
    ஞ࣍ॲཧ
    /VN1Z
    QBOEBT
    TDJLJUMFBSO
    %BTL
    %BTL
    %JTUSJCVUFE
    ฒྻॲཧ
    ϊʔυ಺

    0VUPGDPSF
    ॲཧ ϊʔυ಺

    ෼ࢄॲཧ
    ϊʔυؒ

    %BTL.-
    ߦྻܭࢉ
    ςʔϒϧ
    σʔλॲཧ
    ػցֶश

    View full-size slide

  28. ຊ౰ʹશͯͷσʔλ͕ඞཁ͔ʁ
    • αϯϓϦϯάͰे෼Ͱ͸ʁ
    IUUQTDJLJUMFBSOPSHTUBCMFBVUP@FYBNQMFTNPEFM@TFMFDUJPOQMPU@MFBSOJOH@DVSWFIUNM

    View full-size slide

  29. Dask ML
    • ػցֶशͰDaskΛ׆༻͢ΔͨΊͷύοέʔδ

    • Daskͷσʔλߏ଄ʹରͯ͠ػցֶशΛߦ͏

    • ػցֶशΞϧΰϦζϜΛDaskΛ༻͍ͯฒྻɾ
    Out-of-coreԽ͢Δ

    View full-size slide

  30. Dask ML
    • ػցֶशͷͨΊͷαϒύοέʔδΛఏڙ
    • Preprocessing

    • Model Selection

    • Cross validation

    • Hyper parameter search

    • GLM

    • Clustering
    • Incremental

    • ParallelPostFit
    • XGBoost

    • TensorFlow
    %BTLʹରԠͨ͠
    TDJLJUMFBSOޓ׵ͷ
    ֶशثΛఏڙ
    TDJLJUMFBSOΛϥοϓ͠
    ฒྻɾ0VUPGDPSFԽ
    TDJLJUMFBSOҎ֎ͷ
    ύοέʔδରԠ

    View full-size slide

  31. • scikit-learnޓ׵ͷֶशثͰDaskͷσʔλߏ଄Λ
    ѻ͑Δ
    Dask ML
    TDJLJU
    MFBSO
    %BTL
    .-
    /VN1Z
    OEBSSBZ
    %BTL
    "SSBZ
    /VN1Z
    OEBSSBZ
    /VN1Z
    OEBSSBZ
    %BTL
    "SSBZ
    %BTL
    "SSBZ
    "SSBZ*OUFSGBDFʹΑΓɺ
    /VN1ZOEBSSBZʹม׵
    ܭࢉॲཧ
    ܭࢉॲཧ
    %BTLͷσʔλߏ଄Λҡ࣋ͯ͠
    ॲཧΛ࣮ߦ

    View full-size slide

  32. k-means
    IUUQTFOXJLJQFEJBPSHXJLJ,NFBOT@DMVTUFSJOH

    View full-size slide

  33. k-means (Dask ML)
    ʜ
    $IVOL
    ʜ
    $IVOL
    ʜ
    $IVOL
    4BNQMJOH
    σʔληοτ͔Βη
    ϯτϩΠυͷॳظ஋
    Λܾఆ
    ηϯτϩΠυ͔Βͷ
    ڑ཭Λܭࢉ͠ɺΫϥ
    ελʹ෼ྨ
    $IVOL͝ͱʹ࣮ߦ

    Ϋϥελ͝ͱʹϨίʔ
    υΛू໿͠ɺηϯτ
    ϩΠυΛߋ৽
    $IVOL͝ͱʹ࣮ߦˠ
    ू໿

    4VN
    4VN
    $PVOU
    ʜ
    طఆͰ͸4DBMBCMF,NFBOT #BINBOJFUBM
    ͷ,NFBOTccͰηϯτϩΠυΛॳظԽ

    View full-size slide

  34. Incremental
    • partial_fitΛॱʹద༻͢Δ
    ߦྻͷQBOEBT%BUB'SBNFΛ࡞੒
    pU
    %BTL
    "SSBZ
    *ODSFNFOUBM
    QBSUJBM@pU
    $IVOL
    $IVOL
    ʜ
    ʜ
    QBSUJBM@pUΛஞ࣮࣍ߦ
    ฒྻॲཧ͞Εͳ͍

    View full-size slide

  35. Incremental
    • IncrementalͰֶशثΛϥοϓ

    • ෼ྨ໰୊ͷ৔߹ɺΫϥεΛclassesҾ਺Ͱ౉͢

    • ෳ਺ΤϙοΫͰֶश͢Δ৔߹͸ Incremental.partial_fit


    for _ in range(10):
    inc.partial_fit(X_train, y_train, classes=classes)
    print('Score:', inc.score(X_test, y_test))
    from sklearn.linear_model import SGDClassifier
    from dask_ml.wrappers import Inclemental
    clf = SGDClassifier(loss='log', penalty='l2', tol=1e-3))
    inc = Incremental(clf, scoring='accuracy')
    inc.fit(X_train, y_train, classes=classes)

    View full-size slide

  36. ߦྻͷQBOEBT%BUB'SBNFΛ࡞੒
    ParallelPostFit
    • ֶशࡁΈEstimatorͷtransform΍predictΛฒྻͰ
    ద༻͢Δ
    QSFEJDU
    %BTL
    "SSBZ
    1BSBMMFM1PTU'JU
    QSFEJDU
    $IVOL
    QSFEJDU
    $IVOL
    ʜ
    ʜ

    View full-size slide

  37. • ParallelPostFitͰֶशثΛϥοϓ

    • fit࣌͸Կ΋ߦΘͳ͍ (scikit-learnͷॲཧͱಉҰ)

    • ΑΓେ͖ͳσʔλʹରͯ͠ predict ΛฒྻͰߦ͏
    ParallelPostFit
    y_pred = clf.predict(X_large)
    y_pred
    from sklearn.linear_model import LogisticRegressionCV
    from dask_ml.wrappers import ParallelPostFit
    clf = ParallelPostFit(LogisticRegressionCV(cv=3))
    clf.fit(X_train, y_train)

    View full-size slide

  38. DaskʹΑΔॲཧͷશମ૾
    ஞ࣍ॲཧ
    /VN1Z
    QBOEBT
    TDJLJUMFBSO
    %BTL
    %BTL
    %JTUSJCVUFE
    ฒྻॲཧ
    ϊʔυ಺

    0VUPGDPSF
    ॲཧ ϊʔυ಺

    ෼ࢄॲཧ
    ϊʔυؒ

    %BTL.-
    ߦྻܭࢉ
    ςʔϒϧ
    σʔλॲཧ
    ػցֶश

    View full-size slide

  39. Dask Distributed
    • Daskຊମ͸ҎԼ2छྨͷεέδϡʔϥΛαϙʔ
    τ:

    • threading

    • multiprocessing

    • Dask Distributedύοέʔδ͸ϊʔυؒ෼ࢄΛߦ
    ͏εέδϡʔϥΛఏڙ

    View full-size slide

  40. Dask Distributed
    • εέδϡʔϥͰͷܭࢉ࣮ߦΛෳ਺ϊʔυͰ෼ࢄͰ͖Δ

    • ௿ϨΠςϯγ: λεΫຖͷΦʔόʔϔου͸1msఔ౓

    • WorkerؒͰͷσʔλڞ༗: σʔλసૹ͸WorkerؒͰ௚઀࣮ࢪ

    • ෳࡶͳεέδϡʔϦϯά: ೚ҙͷܭࢉάϥϑΛ࣮ߦՄ

    • ہॴੑ: WorkerؒͷσʔλసૹΛͳΔ΂͘ߦΘͳ͍
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    4DIFEVMFS
    %JTUSJCVUFE
    $MJFOU

    View full-size slide

  41. Distributed joblib
    • ϓϥΨϒϧAPI

    • with ϒϩοΫͰ joblib.Parallel ͷطఆόοΫΤϯυΛมߋՄ

    • ஫ҙ఺

    • scikit-learnʹόϯυϧ͞Ε͍ͯΔjoblibΛ࢖͏ (sklearn.externals.joblib)

    • ෼ࢄͰ͖ͳ͍৔߹΋͋Δ

    • backendͱͯ͠threading / multiprocessing͕໌ࣔ͞Ε͍ͯΔ΋ͷ
    import distributed.joblib
    from sklearn.externals.joblib import parallel_backend
    with parallel_backend('dask.distributed', scheduler_host=‘scheduler-addr:8786’):
    grid.fit(digits.data, digits.target)

    View full-size slide

  42. ·ͱΊ
    • Daskͱ͸

    • NumPy΍ pandas ޓ׵ͷσʔλߏ଄Λఏڙ

    • ػցֶशʹ͓͚ΔDaskͷ׆༻

    • Dask ML: ػցֶशͰDaskͷσʔλߏ଄Λѻ͑
    Δ

    • Distributed: scikit-learnͷॲཧΛϊʔυؒͰ෼ࢄ
    ॲཧͰ͖Δ

    View full-size slide