Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data processing using pandas and Dask

Data processing using pandas and Dask

PyCon night Tokyo 2017
https://eventdots.jp/event/617886

Sinhrks

May 26, 2017
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. Self Introduction • OSS Contributions: • A member of core

    developers of: • GitHub: https://github.com/sinhrks
  2. Goal • Understand the fundamentals of: • How pandas handles

    internal data efficiently. • Dask to parallelize data processing easily.
  3. • Cooperative with various scientific packages. PyData Stacks #PLFI NBUQMPUMJC

    4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL %BTL +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT
  4. What is pandas? • “pandas” provides high-performance, easy-to-use data structures

    for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 9700↑⭐
  5. import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ

    -JDINBO .  6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF
  6. Why pandas? • Capable for real-world (messy) data. • Provides

    intuitive data structures. • Batteries included for data processing.
  7. NumPy • N-dimensional array (nd-array) and matrix • Index (location)

    based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4])         MPDBUJPO /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT
  8. pandas • In addition to NumPy capabilities: • Label based

    access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT
  9. Data Structures • Defined per dimension. • pandas focuses on

    2D data structure. 4FSJFT % %BUB'SBNF % 1BOFM % EFQSFDBUFEJO $PMPSJ[FEDFMMTBSFMBCFMT
  10. pandas Functionality • Vectorized computations • Group by (split-apply-combine) •

    Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, CSV/TSV…) • Flexible time series handling • Plotting • Please refer to the official documentation for details.
  11. DataFrame Internals • Consists of type-based “Block”. $PMVNOT *OEFY ʜ

    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
  12. Internal Implementations • pandas internally uses: • NumPy • Basic

    indexing, basic statistics… • SciPy • Interpolation, advanced statistics… • Cython: • Advanced indexing, hash table, group-by, join, datetime ops…
  13. Cython (Language) • A superset of the Python additionally supports

    C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJUJPOT 3FUVSOCPPMBSSBZJOEJDBUFT BSSBZFMFNFOUTBSFJODMVEFE JOUIFTFU (FUJOQVU`TJUIWBMVF
  14. Release the GIL • For better parallelism (pandas 0.17.0-). def

    duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*-
  15. GIL (Global Interpreter Lock) • It prevents CPython from running

    multiple threads. • GIL is released on I/O. • GIL can be released using Cython. • Scientific packages are working to release GIL. • NumPy, SciPy, Scikit-learn, pandas… • Cannot use Python classes after GIL is released. • If the target is object dtype, GIL cannot be released.
  16. Further Reading • “A look inside pandas design and development”

    by Wes McKinney • “pandas internals” by Masaaki Horikoshi
  17. pandas for Big Data • May face 2 issues: •

    pandas performs computations using single thread. • Users have to parallelize by themselves. • pandas cannot handle data which exceeds physical memory. • Users have to write logic using pandas “chunk” function.
  18. What is Dask? • Dask is a flexible parallel computation

    framework for numeric operations which offers: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation. • Author: Matthew Rocklin • License: BSD • GitHub: 1500↑⭐
  19. (Incomplete) List of OSS uses Dask • (TFLearn) Deep learning

    library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • An interface to query data on different storage systems. • A graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader Airflow
  20. Dask Data Structures • Dask provides following data structures. %BUB4USVDUVSF

    #BTF$MBTT %FGBVMU4DIFEVMFS %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH %BTL#BH 1Z5PPM[ MJTU TFU EJDU NVMUJQSPDFTTJOH DBOOPUSFMFBTF(*-
  21. Dask Array • Consists from multiple NumPy nd-array split along

    axis. /VN1ZOEBSSBZ %BTL"SSBZ $IVOL DIVOLTJ[F
  22. Dask Array import numpy as np x = np.ones((10, 10))

    x import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> $SFBUFYOEBSSBZ $SFBUFY%BTL"SSBZ TQFDJGZJOHYDIVOLTJ[F array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
  23. Dask Array import dask.array as da dx = da.ones((10, 10),

    chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> dx.visualize() $SFBUFY%BTL "SSBZTQFDJGZJOHY DIVOLTJ[F 7JTVBMJ[FJOUFSOBMHSBQIVTJOH (SBQIWJ[ &BDIOPEFDPSSFTQPOETUP FBDIDIVOL
  24. ߦྻͷ QBOEBT%BUB'SBN FΛ࡞੒ Dask Array dy = dx.sum(axis=0) dy dask.array<sum-aggregate,

    shape=(10,), dtype=float64, chunksize=(5,)> 4VNPGBSSBZFMFNFOUT BMPOHBYJT dy.visualize() dy.compute() array([ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]) 4VN 4VN
  25. Dask Array • Dask Array can have arbitrary shape and

    chunk size. • Recommended to use the same chunk size over axis because computations are performed per chunk. da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array<wrapped, shape=(30, 20, 1, 15), dtype=float64, chunksize=(3, 7, 1, 2)>
  26. Dask DataFrame • Consists from multiple pandas DataFrames split along

    index (row labels). QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO EJWJTJPO EJWJTJPO
  27. Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10),

    'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df $SFBUFYQBOEBT %BUB'SBNF
  28. ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame import dask.dataframe as dd ddf =

    dd.from_pandas(df, 2) ddf QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO
  29. Dask DataFrame ddf + 1 (ddf + 1).compute()  

    EG EEG  DPNQVUF EEG "EEUPBMMFMFNFOUT $PNQVUBUJPOJTOPUQFSGPSNFEBU UIJTQPJOU 5SJHHFSDPNQVUBUJPO
  30. Blocked Algorithm (Addition)   $PODBU (ddf + 1).compute() QBOEBT%BUB'SBNF

    %BTL%BUB'SBNF 1FSGPSNDPNQVUBUJPO QFSQBSUJUJPO QBOEBT%BUB'SBNF 3FTVMU
  31. Dask DataFrame Functionality • Vectorized parallel computations • Group by

    (split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, CSV/TSV…) • Flexible time series handling • Please refer to the official documentation for detail.
  32. Dask Internals • All Dask computations are expressed as Dask

    Graph. • Dask Graph is flexible enough to implement more complex algorithms such as linear algebra.
  33. Linear Algebra • Dask Array implements: 'VODUJPO %FTDSJQUJPO MJOBMHDIPMFTLZ 3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO

    PSPGB)FSNJUJBOQPTJUJWFEFpOJUF NBUSJY" MJOBMHJOW $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE CBDLXBSETVCTUJUVUJPOT MJOBMHMTUTR 3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23 EFDPNQPTJUJPO MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN
  34. Example: LU decomposition • LU decomposition • LU Decomposition can

    be split to computations per block. "ɹɹɹɹ-Y6
  35. Blocked LU Decomposition • Diagonal Block • Row-direction(i < j)

    • Columns direction (i < j) ∴ ∴ * LU: Function to solve LU decomposition * Solve: Function to solve equation
  36. Blocked LU Decomposition arr = da.random.random((9, 9), chunks=(3, 3)) arr

    dask.array<da.random.random_sample, shape=(9, 9), dtype=float64,chunksize=(3, 3)> from dask import compute t, l, u = da.linalg.lu(arr) t, l, u = compute(t, l, u)
  37. Dask Internals • All Dask data structures are represented by

    Dask Graph. • Dask Array and DataFrame operations updates its Dask Graph. • Dask also offers API to make your own algorithm parallel.
  38. Dask Delayed • Assuming following simple computation. • x and

    y can be computed in parallel. def inc(x): return x + 1 def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total 8
  39. Dask Delayed from dask import delayed @delayed def inc(x): return

    x + 1 @delayed def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64') total.compute() 8 6TJOH!EFMBZFEEFDPSBUPSNBLFT XSBQQFEGVODUJPOMB[Z %FMBZFEGVODUJPOTDBOCF DIBJOFE BOEPVUQVUT%FMBZFE JOTUBODF OPUFWBMVBUFE"5. 5SJHHFSDPNQVUBUJPO
  40. Dask Distributed • Dask itself offers 2 types of schedulers

    which works on a single node: • threading • multiprocessing • Dask Distributed package offers a distributed scheduler, which distributes tasks over multiple nodes.
  41. Dask Distributed • A centrally managed, distributed, dynamic task scheduler.

    • Low latency: Each task suffers about 1ms of overhead. • Peer-to-peer data sharing: Workers communicate with each other to share data. • Complex Scheduling: Supports complex workflows. • Data Locality: Scheduling algorithms cleverly execute computations where data lives. %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU
  42. Demo • EC2 m4.xlarge • vCPU: 4 • Memory 16GiB

    • 4 Workers with Dask Distributed scheduler
  43. Comparison with Spark • Dask works well when: • Want

    to scale existing NumPy or pandas project. • Parallel / Out-of-Core processing on a single node. • Prototype complex algorithm interactively. • Don’t have Big Data infrastructure you can use freely. • Spark works well when: • Needs to scale large number of clusters. • Workflow requirement meets Spark API (typical ETL or SQL-like ops). • Needs enterprise support.
  44. References • Official Document • http://dask.pydata.org/en/stable/ • Dask Tutorial (includes

    more practical examples) • https://github.com/dask/dask-tutorial • Matthew Rocklin’s Blog Post • http://matthewrocklin.com/blog/
  45. Conclusions • Understand the fundamentals of: • How pandas handles

    internal data efficiently. • Dask to parallelize data processing easily. It provides: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation.