Save 37% off PRO during our Black Friday Sale! »

PyConJP 2015: Dask: 軽量並列計算フレームワーク (Lightning talks)

22f56e55955b9aa693081ed5dc6400ae?s=47 Sinhrks
October 10, 2015

PyConJP 2015: Dask: 軽量並列計算フレームワーク (Lightning talks)

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

October 10, 2015
Tweet

Transcript

  1. Dask ܰྔฒྻܭࢉϑϨʔϜϫʔΫ

  2. ࣗݾ঺հ • Data Analyst • OSS ׆ಈ: • PyData Development

    Team (pandasίϛολ) • Blaze Development Team (Daskίϛολ) • GitHub: https://github.com/sinhrks
  3. Dask • ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ (ϊʔυ಺ฒྻ͕ओ) • NumPy, PyToolz, pandasͷAPI (αϒηοτ) Λ΋

    ͭσʔλߏ଄Λఏڙ αϒϞδϡʔϧ ϕʔεύοέʔδ EBTLBSSBZ /VN1ZOEBSSBZ EBTLCBH 1Z5PPM[ MJTU TFU EJDUʹର͢Δॲཧ EBTLEBUBGSBNF QBOEBT%BUB'SBNF
  4. DataFrame • pandas.DataFrame: ϥϕϧ෇͖ͷ2࣍ݩσʔλ • Dask.DataFrame: pandas.DataFrame Λ෼ׂͯ͠ ॲཧ QBOEBT%BUB'SBNF

    %BTL%BUB'SBNF
  5. Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10),

    'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf dd.DataFrame<from_pandas-…, divisions=('a', 'f', 'j')> ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ %BTL%BUB'SBNFΛ࡞੒
  6. DaskͰͷܭࢉॲཧ ddf + 1 dd.DataFrame<elemwise-…, divisions=('a', 'f', 'j')> (ddf +

    1).compute()   EG EEG  DPNQVUF EEG શମʹΛՃࢉɻ ࣮ࡍͷܭࢉॲཧ͸·࣮ͩߦ͞Εͳ͍ ܭࢉΛ࣮ߦ
  7. Blocked Algorithm (Ճࢉ)   $PODBU (ddf + 1).compute() ॲཧલͷ

    QBOEBT%BUB'SBNF %BTL%BUB'SBNF ʹม׵ ෼ׂ͞Εͨσʔλʹରͯ͠ ܭࢉ࣮ߦ݁ՌΛ݁߹ ॲཧޙͷ QBOEBT%BUB'SBNF
  8. Blocked Algorithm (߹ܭ) ddf.sum().compute() 4VN 4VN $PODBU 4VN TVN ճ໨

    DPODBU TVN ճ໨
  9. Blocked Algorithm (߹ܭ) ddf.sum().visualize() TVN ճ໨ DPODBU TVN ճ໨ ॲཧલͷ

    %BTL%BUB'SBNF
  10. Blocked Algorithm (ฏۉ) ddf.mean().visualize() TVN DPVOU NFBOTVNDPVOU

  11. Blocked Algorithm (ཁ໿౷ܭྔ) ddf.describe().visualize() ݁Ռ

  12. Dask DataFrameͷػೳ • ࢛ଇԋࢉ/ൺֱԋࢉ • ౷ܭྔ • ϥϕϧʹΑΔσʔλબ୒ • άϧʔϓԽ

    / ू໿ • ࿈݁/݁߹ (merge, join, concat…)
  13. ύϑΥʔϚϯεൺֱ • AWS EC2: c4.2xlarge (vCPU: 8, ϝϞϦ: 15 GiB)

    n = 100000000 df = pd.DataFrame({'a': np.random.randint(1, 100, n), 'b': np.random.randn(n)}) df ddf = dd.from_pandas(df, 5) ddf dd.DataFrame<from_pandas-…, divisions=(0, 20000000, 40000000, 60000000, 80000000, 99999999)> ԯߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ σʔλΛ಺෦తʹͭʹ෼ׂ͠ɺ %BTL%BUB'SBNFΛ࡞੒
  14. ύϑΥʔϚϯεൺֱ %timeit df.describe() 1 loops, best of 3: 25.3 s

    per loop %timeit ddf.describe().compute() 1 loops, best of 3: 3.87 s per loop QBOEBT %BTL
  15. ݁Ռ • ※ percentile ͸ۙࣅΞϧΰϦζϜΛར༻͢ΔͨΊɺ஋ ʹࠩҟ͕ग़Δ৔߹͕͋Δ (੺࿮) df.describe() ddf.describe().compute() QBOEBT

    %BTL
  16. ݁Ռ ddf.describe().visualize()

  17. ·ͱΊ • Dask: ܰྔฒྻ෼ࢄϑϨʔϜϫʔΫ • NumPy, PyToolz, pandas ͷAPIͷαϒηοτΛ ఏڙ

    • Ϣʔβ͸ NumPy / PyToolz / pandas ͷ API Λ ར༻ͯ͠ฒྻܭࢉ͕Ͱ͖Δ