Data processing using pandas and Dask

Data processing using pandas and Dask

PyCon night Tokyo 2017
https://eventdots.jp/event/617886

22f56e55955b9aa693081ed5dc6400ae?s=128

Sinhrks

May 26, 2017
Tweet

Transcript

  1. Data processing using pandas and Dask Masaaki Horikoshi @ ARISE

    analytics
  2. Self Introduction • OSS Contributions: • A member of core

    developers of: • GitHub: https://github.com/sinhrks
  3. Goal • Understand the fundamentals of: • How pandas handles

    internal data efficiently. • Dask to parallelize data processing easily.
  4. • Cooperative with various scientific packages. PyData Stacks #PLFI NBUQMPUMJC

    4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL %BTL +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT
  5. pandas Efficient Labeled Data Structure

  6. What is pandas? • “pandas” provides high-performance, easy-to-use data structures

    for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 9700↑⭐
  7. import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ

    -JDINBO .  6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF
  8. DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() (SPVQCZ 4FMFDU "HHSFHBUF 4FMFDU

  9. Why pandas? • Capable for real-world (messy) data. • Provides

    intuitive data structures. • Batteries included for data processing.
  10. NumPy • N-dimensional array (nd-array) and matrix • Index (location)

    based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4])         MPDBUJPO /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT
  11. pandas • In addition to NumPy capabilities: • Label based

    access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT
  12. Data Structures • Defined per dimension. • pandas focuses on

    2D data structure. 4FSJFT % %BUB'SBNF % 1BOFM % EFQSFDBUFEJO $PMPSJ[FEDFMMTBSFMBCFMT
  13. pandas Functionality • Vectorized computations • Group by (split-apply-combine) •

    Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, CSV/TSV…) • Flexible time series handling • Plotting • Please refer to the official documentation for details.
  14. DataFrame Internals • Consists of type-based “Block”. $PMVNOT *OEFY ʜ

    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
  15. Internal Implementations • pandas internally uses: • NumPy • Basic

    indexing, basic statistics… • SciPy • Interpolation, advanced statistics… • Cython: • Advanced indexing, hash table, group-by, join, datetime ops…
  16. Cython (Language) • A superset of the Python additionally supports

    C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJUJPOT 3FUVSOCPPMBSSBZJOEJDBUFT BSSBZFMFNFOUTBSFJODMVEFE JOUIFTFU (FUJOQVU`TJUIWBMVF
  17. Release the GIL • For better parallelism (pandas 0.17.0-). def

    duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*-
  18. GIL (Global Interpreter Lock) • It prevents CPython from running

    multiple threads. • GIL is released on I/O. • GIL can be released using Cython. • Scientific packages are working to release GIL. • NumPy, SciPy, Scikit-learn, pandas… • Cannot use Python classes after GIL is released. • If the target is object dtype, GIL cannot be released.
  19. Further Reading • “A look inside pandas design and development”

    by Wes McKinney • “pandas internals” by Masaaki Horikoshi
  20. pandas for Big Data • May face 2 issues: •

    pandas performs computations using single thread. • Users have to parallelize by themselves. • pandas cannot handle data which exceeds physical memory. • Users have to write logic using pandas “chunk” function.
  21. Dask Flexible Parallel Computation Framework

  22. What is Dask? • Dask is a flexible parallel computation

    framework for numeric operations which offers: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation. • Author: Matthew Rocklin • License: BSD • GitHub: 1500↑⭐
  23. (Incomplete) List of OSS uses Dask • (TFLearn) Deep learning

    library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • An interface to query data on different storage systems. • A graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader Airflow
  24. Dask Data Structures • Dask provides following data structures. %BUB4USVDUVSF

    #BTF$MBTT %FGBVMU4DIFEVMFS %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH %BTL#BH 1Z5PPM[ MJTU TFU EJDU NVMUJQSPDFTTJOH DBOOPUSFMFBTF(*-
  25. Dask Array • Consists from multiple NumPy nd-array split along

    axis. /VN1ZOEBSSBZ %BTL"SSBZ $IVOL DIVOLTJ[F
  26. Dask Array import numpy as np x = np.ones((10, 10))

    x import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> $SFBUFYOEBSSBZ $SFBUFY%BTL"SSBZ TQFDJGZJOHYDIVOLTJ[F array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
  27. Dask Array import dask.array as da dx = da.ones((10, 10),

    chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> dx.visualize() $SFBUFY%BTL "SSBZTQFDJGZJOHY DIVOLTJ[F 7JTVBMJ[FJOUFSOBMHSBQIVTJOH (SBQIWJ[ &BDIOPEFDPSSFTQPOETUP FBDIDIVOL
  28. ߦྻͷ QBOEBT%BUB'SBN FΛ࡞੒ Dask Array dy = dx.sum(axis=0) dy dask.array<sum-aggregate,

    shape=(10,), dtype=float64, chunksize=(5,)> 4VNPGBSSBZFMFNFOUT BMPOHBYJT dy.visualize() dy.compute() array([ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]) 4VN 4VN
  29. Dask Array dy2 = dx.sum() dy2 dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>

    4VNPGBSSBZFMFNFOUT dy2.visualize() dy2.compute() 100.0
  30. Dask Array • Dask Array can have arbitrary shape and

    chunk size. • Recommended to use the same chunk size over axis because computations are performed per chunk. da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array<wrapped, shape=(30, 20, 1, 15), dtype=float64, chunksize=(3, 7, 1, 2)>
  31. Dask DataFrame • Consists from multiple pandas DataFrames split along

    index (row labels). QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO EJWJTJPO EJWJTJPO
  32. Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10),

    'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df $SFBUFYQBOEBT %BUB'SBNF
  33. ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame import dask.dataframe as dd ddf =

    dd.from_pandas(df, 2) ddf QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
  34. Dask DataFrame ddf + 1 (ddf + 1).compute()  

    EG EEG  DPNQVUF EEG "EEUPBMMFMFNFOUT $PNQVUBUJPOJTOPUQFSGPSNFEBU UIJTQPJOU 5SJHHFSDPNQVUBUJPO
  35. Blocked Algorithm (Addition)   $PODBU (ddf + 1).compute() QBOEBT%BUB'SBNF

    %BTL%BUB'SBNF 1FSGPSNDPNQVUBUJPO QFSQBSUJUJPO QBOEBT%BUB'SBNF 3FTVMU
  36. Blocked Algorithm (Total) ddf.sum().compute() 4VN 4VN $PODBU 4VN 4VN PWFSQBSUJUJPOT

    $PODBUFOBUF 4VN QFSQBSUJUJPO
  37. Blocked Algorithm (Total) ddf.sum().visualize() 4VN PWFSQBSUJUJPOT $PODBUFOBUF 4VN QFSQBSUJUJPO %BTL%BUB'SBNF

  38. Blocked Algorithm (Mean) ddf.mean().visualize() 4VN PWFSQBSUJUJPOT $PVOU PWFSQBSUJUJPOT .FBO4VN$PVOU 4VN

    QFSQBSUJUJPO $PVOU QFSQBSUJUJPO
  39. Blocked Algorithm (Descriptive Statistics) ddf.describe().visualize() 3FTVMU

  40. Dask DataFrame Functionality • Vectorized parallel computations • Group by

    (split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, CSV/TSV…) • Flexible time series handling • Please refer to the official documentation for detail.
  41. Dask Internals • All Dask computations are expressed as Dask

    Graph. • Dask Graph is flexible enough to implement more complex algorithms such as linear algebra.
  42. Linear Algebra • Dask Array implements: 'VODUJPO %FTDSJQUJPO MJOBMHDIPMFTLZ 3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO

    PSPGB)FSNJUJBOQPTJUJWFEFpOJUF NBUSJY" MJOBMHJOW $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE CBDLXBSETVCTUJUVUJPOT MJOBMHMTUTR 3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23 EFDPNQPTJUJPO MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN
  43. Example: LU decomposition • LU decomposition • LU Decomposition can

    be split to computations per block. "ɹɹɹɹ-Y6
  44. Blocked LU Decomposition • Diagonal Block • Row-direction(i < j)

    • Columns direction (i < j) ∴ ∴ * LU: Function to solve LU decomposition * Solve: Function to solve equation
  45. Blocked LU Decomposition arr = da.random.random((9, 9), chunks=(3, 3)) arr

    dask.array<da.random.random_sample, shape=(9, 9), dtype=float64,chunksize=(3, 3)> from dask import compute t, l, u = da.linalg.lu(arr) t, l, u = compute(t, l, u)
  46. Blocked LU Decomposition from dask import visualize visualize(t, l, u)

  47. Dask Internals • All Dask data structures are represented by

    Dask Graph. • Dask Array and DataFrame operations updates its Dask Graph. • Dask also offers API to make your own algorithm parallel.
  48. Dask Delayed • Assuming following simple computation. • x and

    y can be computed in parallel. def inc(x): return x + 1 def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total 8
  49. Dask Delayed from dask import delayed @delayed def inc(x): return

    x + 1 @delayed def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64') total.compute() 8 6TJOH!EFMBZFEEFDPSBUPSNBLFT XSBQQFEGVODUJPOMB[Z %FMBZFEGVODUJPOTDBOCF DIBJOFE BOEPVUQVUT%FMBZFE JOTUBODF OPUFWBMVBUFE"5. 5SJHHFSDPNQVUBUJPO
  50. Dask Delayed • A chain of Delayed functions is represented

    with a Dask Graph. total.visualize()
  51. Dask Distributed • Dask itself offers 2 types of schedulers

    which works on a single node: • threading • multiprocessing • Dask Distributed package offers a distributed scheduler, which distributes tasks over multiple nodes.
  52. Dask Distributed • A centrally managed, distributed, dynamic task scheduler.

    • Low latency: Each task suffers about 1ms of overhead. • Peer-to-peer data sharing: Workers communicate with each other to share data. • Complex Scheduling: Supports complex workflows. • Data Locality: Scheduling algorithms cleverly execute computations where data lives. %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
  53. Demo • EC2 m4.xlarge • vCPU: 4 • Memory 16GiB

    • 4 Workers with Dask Distributed scheduler
  54. Comparison with Spark • Dask works well when: • Want

    to scale existing NumPy or pandas project. • Parallel / Out-of-Core processing on a single node. • Prototype complex algorithm interactively. • Don’t have Big Data infrastructure you can use freely. • Spark works well when: • Needs to scale large number of clusters. • Workflow requirement meets Spark API (typical ETL or SQL-like ops). • Needs enterprise support.
  55. References • Official Document • http://dask.pydata.org/en/stable/ • Dask Tutorial (includes

    more practical examples) • https://github.com/dask/dask-tutorial • Matthew Rocklin’s Blog Post • http://matthewrocklin.com/blog/
  56. Conclusions • Understand the fundamentals of: • How pandas handles

    internal data efficiently. • Dask to parallelize data processing easily. It provides: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation.
  57. Interested? • Let’s start contribution! • pandas • https://github.com/pandas-dev/pandas •

    Dask • https://github.com/dask/dask