Data processing using pandas and Dask

Slide 1

Slide 1 text

Data processing using pandas and Dask Masaaki Horikoshi @ ARISE analytics

Slide 2

Slide 2 text

Self Introduction • OSS Contributions: • A member of core developers of: • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

Goal • Understand the fundamentals of: • How pandas handles internal data eﬃciently. • Dask to parallelize data processing easily.

Slide 4

Slide 4 text

• Cooperative with various scientiﬁc packages. PyData Stacks #PLFI NBUQMPUMJC 4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL %BTL +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT

Slide 5

Slide 5 text

pandas Eﬃcient Labeled Data Structure

Slide 6

Slide 6 text

What is pandas? • “pandas” provides high-performance, easy-to-use data structures for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 9700↑⭐

Slide 7

Slide 7 text

import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ -JDINBO . 6$*.BDIJOF-FBSOJOH3FQPTJUPSZ*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF

Slide 8

Slide 8 text

DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() (SPVQCZ 4FMFDU "HHSFHBUF 4FMFDU

Slide 9

Slide 9 text

Why pandas? • Capable for real-world (messy) data. • Provides intuitive data structures. • Batteries included for data processing.

Slide 10

Slide 10 text

NumPy • N-dimensional array (nd-array) and matrix • Index (location) based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4]) MPDBUJPO /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT

Slide 11

Slide 11 text

pandas • In addition to NumPy capabilities: • Label based access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT

Slide 12

Slide 12 text

Data Structures • Deﬁned per dimension. • pandas focuses on 2D data structure. 4FSJFT % %BUB'SBNF % 1BOFM % EFQSFDBUFEJO $PMPSJ[FEDFMMTBSFMBCFMT

Slide 13

Slide 13 text

pandas Functionality • Vectorized computations • Group by (split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, CSV/TSV…) • Flexible time series handling • Plotting • Please refer to the oﬃcial documentation for details.

Slide 14

Slide 14 text

DataFrame Internals • Consists of type-based “Block”. $PMVNOT *OEFY ʜ *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT

Slide 15

Slide 15 text

Internal Implementations • pandas internally uses: • NumPy • Basic indexing, basic statistics… • SciPy • Interpolation, advanced statistics… • Cython: • Advanced indexing, hash table, group-by, join, datetime ops…

Slide 16

Slide 16 text

Cython (Language) • A superset of the Python additionally supports C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJUJPOT 3FUVSOCPPMBSSBZJOEJDBUFT BSSBZFMFNFOUTBSFJODMVEFE JOUIFTFU (FUJOQVU`TJUIWBMVF

Slide 17

Slide 17 text

Release the GIL • For better parallelism (pandas 0.17.0-). def duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*-

Slide 18

Slide 18 text

GIL (Global Interpreter Lock) • It prevents CPython from running multiple threads. • GIL is released on I/O. • GIL can be released using Cython. • Scientiﬁc packages are working to release GIL. • NumPy, SciPy, Scikit-learn, pandas… • Cannot use Python classes after GIL is released. • If the target is object dtype, GIL cannot be released.

Slide 19

Slide 19 text

Further Reading • “A look inside pandas design and development” by Wes McKinney • “pandas internals” by Masaaki Horikoshi

Slide 20

Slide 20 text

pandas for Big Data • May face 2 issues: • pandas performs computations using single thread. • Users have to parallelize by themselves. • pandas cannot handle data which exceeds physical memory. • Users have to write logic using pandas “chunk” function.

Slide 21

Slide 21 text

Dask Flexible Parallel Computation Framework

Slide 22

Slide 22 text

What is Dask? • Dask is a ﬂexible parallel computation framework for numeric operations which oﬀers: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation. • Author: Matthew Rocklin • License: BSD • GitHub: 1500↑⭐

Slide 23

Slide 23 text

(Incomplete) List of OSS uses Dask • (TFLearn) Deep learning library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • An interface to query data on different storage systems. • A graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader Airflow

Slide 24

Slide 24 text

Dask Data Structures • Dask provides following data structures. %BUB4USVDUVSF #BTF$MBTT %FGBVMU4DIFEVMFS %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH %BTL#BH 1Z5PPM[ MJTU TFU EJDU NVMUJQSPDFTTJOH DBOOPUSFMFBTF(*-

Slide 25

Slide 25 text

Dask Array • Consists from multiple NumPy nd-array split along axis. /VN1ZOEBSSBZ %BTL"SSBZ $IVOL DIVOLTJ[F

Slide 26

Slide 26 text

Dask Array import numpy as np x = np.ones((10, 10)) x import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array $SFBUFYOEBSSBZ $SFBUFY%BTL"SSBZ TQFDJGZJOHYDIVOLTJ[F array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

Slide 27

Slide 27 text

Dask Array import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array dx.visualize() $SFBUFY%BTL "SSBZTQFDJGZJOHY DIVOLTJ[F 7JTVBMJ[FJOUFSOBMHSBQIVTJOH (SBQIWJ[ &BDIOPEFDPSSFTQPOETUP FBDIDIVOL

Slide 28

Slide 28 text

ߦྻͷ QBOEBT%BUB'SBN FΛ࡞੒ Dask Array dy = dx.sum(axis=0) dy dask.array 4VNPGBSSBZFMFNFOUT BMPOHBYJT dy.visualize() dy.compute() array([ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]) 4VN 4VN

Slide 29

Slide 29 text

Dask Array dy2 = dx.sum() dy2 dask.array 4VNPGBSSBZFMFNFOUT dy2.visualize() dy2.compute() 100.0

Slide 30

Slide 30 text

Dask Array • Dask Array can have arbitrary shape and chunk size. • Recommended to use the same chunk size over axis because computations are performed per chunk. da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array

Slide 31

Slide 31 text

Dask DataFrame • Consists from multiple pandas DataFrames split along index (row labels). QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO EJWJTJPO EJWJTJPO

Slide 32

Slide 32 text

Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10), 'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df $SFBUFYQBOEBT %BUB'SBNF

Slide 33

Slide 33 text

ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame import dask.dataframe as dd ddf = dd.from_pandas(df, 2) ddf QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO

Slide 34

Slide 34 text

Dask DataFrame ddf + 1 (ddf + 1).compute() EG EEG DPNQVUF EEG "EEUPBMMFMFNFOUT $PNQVUBUJPOJTOPUQFSGPSNFEBU UIJTQPJOU 5SJHHFSDPNQVUBUJPO

Slide 35

Slide 35 text

Blocked Algorithm (Addition) $PODBU (ddf + 1).compute() QBOEBT%BUB'SBNF %BTL%BUB'SBNF 1FSGPSNDPNQVUBUJPO QFSQBSUJUJPO QBOEBT%BUB'SBNF 3FTVMU

Slide 36

Slide 36 text

Blocked Algorithm (Total) ddf.sum().compute() 4VN 4VN $PODBU 4VN 4VN PWFSQBSUJUJPOT $PODBUFOBUF 4VN QFSQBSUJUJPO

Slide 37

Slide 37 text

Blocked Algorithm (Total) ddf.sum().visualize() 4VN PWFSQBSUJUJPOT $PODBUFOBUF 4VN QFSQBSUJUJPO %BTL%BUB'SBNF

Slide 38

Slide 38 text

Blocked Algorithm (Mean) ddf.mean().visualize() 4VN PWFSQBSUJUJPOT $PVOU PWFSQBSUJUJPOT .FBO4VN$PVOU 4VN QFSQBSUJUJPO $PVOU QFSQBSUJUJPO

Slide 39

Slide 39 text

Blocked Algorithm (Descriptive Statistics) ddf.describe().visualize() 3FTVMU

Slide 40

Slide 40 text

Dask DataFrame Functionality • Vectorized parallel computations • Group by (split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, CSV/TSV…) • Flexible time series handling • Please refer to the oﬃcial documentation for detail.

Slide 41

Slide 41 text

Dask Internals • All Dask computations are expressed as Dask Graph. • Dask Graph is ﬂexible enough to implement more complex algorithms such as linear algebra.

Slide 42

Slide 42 text

Linear Algebra • Dask Array implements: 'VODUJPO %FTDSJQUJPO MJOBMHDIPMFTLZ 3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO PSPGB)FSNJUJBOQPTJUJWFEFpOJUF NBUSJY" MJOBMHJOW $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE CBDLXBSETVCTUJUVUJPOT MJOBMHMTUTR 3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23 EFDPNQPTJUJPO MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN

Slide 43

Slide 43 text

Example: LU decomposition • LU decomposition • LU Decomposition can be split to computations per block. "ɹɹɹɹ-Y6

Slide 44

Slide 44 text

Blocked LU Decomposition • Diagonal Block • Row-direction(i < j) • Columns direction (i < j) ∴ ∴ * LU: Function to solve LU decomposition * Solve: Function to solve equation

Slide 45

Slide 45 text

Blocked LU Decomposition arr = da.random.random((9, 9), chunks=(3, 3)) arr dask.array from dask import compute t, l, u = da.linalg.lu(arr) t, l, u = compute(t, l, u)

Slide 46

Slide 46 text

Blocked LU Decomposition from dask import visualize visualize(t, l, u)

Slide 47

Slide 47 text

Dask Internals • All Dask data structures are represented by Dask Graph. • Dask Array and DataFrame operations updates its Dask Graph. • Dask also oﬀers API to make your own algorithm parallel.

Slide 48

Slide 48 text

Dask Delayed • Assuming following simple computation. • x and y can be computed in parallel. def inc(x): return x + 1 def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total 8

Slide 49

Slide 49 text

Dask Delayed from dask import delayed @delayed def inc(x): return x + 1 @delayed def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64') total.compute() 8 6TJOH!EFMBZFEEFDPSBUPSNBLFT XSBQQFEGVODUJPOMB[Z %FMBZFEGVODUJPOTDBOCF DIBJOFE BOEPVUQVUT%FMBZFE JOTUBODF OPUFWBMVBUFE"5. 5SJHHFSDPNQVUBUJPO

Slide 50

Slide 50 text

Dask Delayed • A chain of Delayed functions is represented with a Dask Graph. total.visualize()

Slide 51

Slide 51 text

Dask Distributed • Dask itself oﬀers 2 types of schedulers which works on a single node: • threading • multiprocessing • Dask Distributed package oﬀers a distributed scheduler, which distributes tasks over multiple nodes.

Slide 52

Slide 52 text

Dask Distributed • A centrally managed, distributed, dynamic task scheduler. • Low latency: Each task suﬀers about 1ms of overhead. • Peer-to-peer data sharing: Workers communicate with each other to share data. • Complex Scheduling: Supports complex workﬂows. • Data Locality: Scheduling algorithms cleverly execute computations where data lives. %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU

Slide 53

Slide 53 text

Demo • EC2 m4.xlarge • vCPU: 4 • Memory 16GiB • 4 Workers with Dask Distributed scheduler

Slide 54

Slide 54 text

Comparison with Spark • Dask works well when: • Want to scale existing NumPy or pandas project. • Parallel / Out-of-Core processing on a single node. • Prototype complex algorithm interactively. • Don’t have Big Data infrastructure you can use freely. • Spark works well when: • Needs to scale large number of clusters. • Workﬂow requirement meets Spark API (typical ETL or SQL-like ops). • Needs enterprise support.

Slide 55

Slide 55 text

References • Oﬃcial Document • http://dask.pydata.org/en/stable/ • Dask Tutorial (includes more practical examples) • https://github.com/dask/dask-tutorial • Matthew Rocklin’s Blog Post • http://matthewrocklin.com/blog/

Slide 56

Slide 56 text

Conclusions • Understand the fundamentals of: • How pandas handles internal data eﬃciently. • Dask to parallelize data processing easily. It provides: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation.

Slide 57

Slide 57 text

Interested? • Let’s start contribution! • pandas • https://github.com/pandas-dev/pandas • Dask • https://github.com/dask/dask