Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConJP 2015: Dask: 軽量並列計算フレームワーク (Lightning talks)

Sinhrks
October 10, 2015

PyConJP 2015: Dask: 軽量並列計算フレームワーク (Lightning talks)

Sinhrks

October 10, 2015
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. Dask

    ܰྔฒྻܭࢉϑϨʔϜϫʔΫ

    View Slide

  2. ࣗݾ঺հ
    • Data Analyst

    • OSS ׆ಈ:

    • PyData Development Team (pandasίϛολ)

    • Blaze Development Team (Daskίϛολ)

    • GitHub: https://github.com/sinhrks

    View Slide

  3. Dask
    • ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ (ϊʔυ಺ฒྻ͕ओ)

    • NumPy, PyToolz, pandasͷAPI (αϒηοτ) Λ΋
    ͭσʔλߏ଄Λఏڙ
    αϒϞδϡʔϧ ϕʔεύοέʔδ
    EBTLBSSBZ /VN1ZOEBSSBZ
    EBTLCBH 1Z5PPM[ MJTU TFU EJDUʹର͢Δॲཧ

    EBTLEBUBGSBNF QBOEBT%BUB'SBNF

    View Slide

  4. DataFrame
    • pandas.DataFrame: ϥϕϧ෇͖ͷ2࣍ݩσʔλ

    • Dask.DataFrame: pandas.DataFrame Λ෼ׂͯ͠
    ॲཧ
    QBOEBT%BUB'SBNF %BTL%BUB'SBNF

    View Slide

  5. Dask DataFrame
    import pandas as pd
    df = pd.DataFrame({'X': np.arange(10),
    'Y': np.arange(10, 20),
    'Z': np.arange(20, 30)},
    index=list('abcdefghij'))
    df
    import dask.dataframe as dd
    ddf = dd.from_pandas(df, 2)
    ddf
    dd.DataFrame
    ߦྻͷ
    QBOEBT%BUB'SBNFΛ࡞੒
    σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ
    %BTL%BUB'SBNFΛ࡞੒

    View Slide

  6. DaskͰͷܭࢉॲཧ
    ddf + 1
    dd.DataFrame
    (ddf + 1).compute()

    EG EEG
    DPNQVUF

    EEG
    શମʹΛՃࢉɻ
    ࣮ࡍͷܭࢉॲཧ͸·࣮ͩߦ͞Εͳ͍
    ܭࢉΛ࣮ߦ

    View Slide

  7. Blocked Algorithm (Ճࢉ)


    $PODBU
    (ddf + 1).compute()
    ॲཧલͷ
    QBOEBT%BUB'SBNF
    %BTL%BUB'SBNF
    ʹม׵
    ෼ׂ͞Εͨσʔλʹରͯ͠
    ܭࢉ࣮ߦ݁ՌΛ݁߹
    ॲཧޙͷ
    QBOEBT%BUB'SBNF

    View Slide

  8. Blocked Algorithm (߹ܭ)
    ddf.sum().compute()
    4VN
    4VN
    $PODBU
    4VN
    TVN ճ໨

    DPODBU
    TVN ճ໨

    View Slide

  9. Blocked Algorithm (߹ܭ)
    ddf.sum().visualize()
    TVN ճ໨

    DPODBU
    TVN ճ໨

    ॲཧલͷ
    %BTL%BUB'SBNF

    View Slide

  10. Blocked Algorithm (ฏۉ)
    ddf.mean().visualize()
    TVN DPVOU
    NFBOTVNDPVOU

    View Slide

  11. Blocked Algorithm (ཁ໿౷ܭྔ)
    ddf.describe().visualize()
    ݁Ռ

    View Slide

  12. Dask DataFrameͷػೳ
    • ࢛ଇԋࢉ/ൺֱԋࢉ

    • ౷ܭྔ

    • ϥϕϧʹΑΔσʔλબ୒

    • άϧʔϓԽ / ू໿

    • ࿈݁/݁߹ (merge, join, concat…)

    View Slide

  13. ύϑΥʔϚϯεൺֱ
    • AWS EC2: c4.2xlarge (vCPU: 8, ϝϞϦ: 15 GiB)
    n = 100000000
    df = pd.DataFrame({'a': np.random.randint(1, 100, n),
    'b': np.random.randn(n)})
    df
    ddf = dd.from_pandas(df, 5)
    ddf
    dd.DataFrame80000000, 99999999)>
    ԯߦྻͷ
    QBOEBT%BUB'SBNFΛ࡞੒
    σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ
    %BTL%BUB'SBNFΛ࡞੒

    View Slide

  14. ύϑΥʔϚϯεൺֱ
    %timeit df.describe()
    1 loops, best of 3: 25.3 s per loop
    %timeit ddf.describe().compute()
    1 loops, best of 3: 3.87 s per loop
    QBOEBT
    %BTL

    View Slide

  15. ݁Ռ
    • ※ percentile ͸ۙࣅΞϧΰϦζϜΛར༻͢ΔͨΊɺ஋
    ʹࠩҟ͕ग़Δ৔߹͕͋Δ (੺࿮)
    df.describe() ddf.describe().compute()
    QBOEBT %BTL

    View Slide

  16. ݁Ռ
    ddf.describe().visualize()

    View Slide

  17. ·ͱΊ
    • Dask: ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ

    • NumPy, PyToolz, pandas ͷAPIͷαϒηοτΛ
    ఏڙ

    • Ϣʔβ͸ NumPy / PyToolz / pandas ͷ API Λ
    ར༻ͯ͠ฒྻܭࢉ͕Ͱ͖Δ

    View Slide