Data processing using pandas and Dask

Data processing using pandas and Dask Masaaki Horikoshi @ ARISE
analytics

Self Introduction • OSS Contributions: • A member of core
developers of: • GitHub: https://github.com/sinhrks

Goal • Understand the fundamentals of: • How pandas handles
internal data eﬃciently. • Dask to parallelize data processing easily.

• Cooperative with various scientiﬁc packages. PyData Stacks #PLFI NBUQMPUMJC
4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL %BTL +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT

pandas Eﬃcient Labeled Data Structure

What is pandas? • “pandas” provides high-performance, easy-to-use data structures
for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 9700↑⭐

import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ
-JDINBO . 6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF

DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() (SPVQCZ 4FMFDU "HHSFHBUF 4FMFDU

Why pandas? • Capable for real-world (messy) data. • Provides
intuitive data structures. • Batteries included for data processing.

NumPy • N-dimensional array (nd-array) and matrix • Index (location)
based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4]) MPDBUJPO /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT

pandas • In addition to NumPy capabilities: • Label based
access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT

Data Structures • Deﬁned per dimension. • pandas focuses on
2D data structure. 4FSJFT % %BUB'SBNF % 1BOFM % EFQSFDBUFEJO $PMPSJ[FEDFMMTBSFMBCFMT

pandas Functionality • Vectorized computations • Group by (split-apply-combine) •
Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, CSV/TSV…) • Flexible time series handling • Plotting • Please refer to the oﬃcial documentation for details.

DataFrame Internals • Consists of type-based “Block”. $PMVNOT *OEFY ʜ
*OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT

Internal Implementations • pandas internally uses: • NumPy • Basic
indexing, basic statistics… • SciPy • Interpolation, advanced statistics… • Cython: • Advanced indexing, hash table, group-by, join, datetime ops…

Cython (Language) • A superset of the Python additionally supports
C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJUJPOT 3FUVSOCPPMBSSBZJOEJDBUFT BSSBZFMFNFOUTBSFJODMVEFE JOUIFTFU (FUJOQVU`TJUIWBMVF

Release the GIL • For better parallelism (pandas 0.17.0-). def
duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*-

GIL (Global Interpreter Lock) • It prevents CPython from running
multiple threads. • GIL is released on I/O. • GIL can be released using Cython. • Scientiﬁc packages are working to release GIL. • NumPy, SciPy, Scikit-learn, pandas… • Cannot use Python classes after GIL is released. • If the target is object dtype, GIL cannot be released.

Further Reading • “A look inside pandas design and development”
by Wes McKinney • “pandas internals” by Masaaki Horikoshi

pandas for Big Data • May face 2 issues: •
pandas performs computations using single thread. • Users have to parallelize by themselves. • pandas cannot handle data which exceeds physical memory. • Users have to write logic using pandas “chunk” function.

Dask Flexible Parallel Computation Framework

What is Dask? • Dask is a ﬂexible parallel computation
framework for numeric operations which oﬀers: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation. • Author: Matthew Rocklin • License: BSD • GitHub: 1500↑⭐

(Incomplete) List of OSS uses Dask • (TFLearn) Deep learning
library featuring a higher-level API for TensorFlow. • (Distributed Scheduler) A platform to author, schedule and monitor workflows. • Image Processing SciKit. • N-D labeled arrays and datasets in Python. • An interface to query data on different storage systems. • A graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader Airflow

Dask Data Structures • Dask provides following data structures. %BUB4USVDUVSF
#BTF$MBTT %FGBVMU4DIFEVMFS %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH %BTL#BH 1Z5PPM[ MJTU TFU EJDU NVMUJQSPDFTTJOH DBOOPUSFMFBTF(*-

Dask Array • Consists from multiple NumPy nd-array split along
axis. /VN1ZOEBSSBZ %BTL"SSBZ $IVOL DIVOLTJ[F

Dask Array import numpy as np x = np.ones((10, 10))
x import dask.array as da dx = da.ones((10, 10), chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> $SFBUFYOEBSSBZ $SFBUFY%BTL"SSBZ TQFDJGZJOHYDIVOLTJ[F array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

Dask Array import dask.array as da dx = da.ones((10, 10),
chunks=(5, 5)) dx dask.array<wrapped, shape=(10, 10), dtype=float64, chunksize=(5, 5)> dx.visualize() $SFBUFY%BTL "SSBZTQFDJGZJOHY DIVOLTJ[F 7JTVBMJ[FJOUFSOBMHSBQIVTJOH (SBQIWJ[ &BDIOPEFDPSSFTQPOETUP FBDIDIVOL

ߦྻͷ QBOEBT%BUB'SBN FΛ࡞੒ Dask Array dy = dx.sum(axis=0) dy dask.array<sum-aggregate,
shape=(10,), dtype=float64, chunksize=(5,)> 4VNPGBSSBZFMFNFOUT BMPOHBYJT dy.visualize() dy.compute() array([ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]) 4VN 4VN

Dask Array dy2 = dx.sum() dy2 dask.array<sum-aggregate, shape=(), dtype=float64, chunksize=()>
4VNPGBSSBZFMFNFOUT dy2.visualize() dy2.compute() 100.0

Dask Array • Dask Array can have arbitrary shape and
chunk size. • Recommended to use the same chunk size over axis because computations are performed per chunk. da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2)) dask.array<wrapped, shape=(30, 20, 1, 15), dtype=float64, chunksize=(3, 7, 1, 2)>

Dask DataFrame • Consists from multiple pandas DataFrames split along
index (row labels). QBOEBT%BUB'SBNF %BTL%BUB'SBNF QBSUJUJPO EJWJTJPO EJWJTJPO

Dask DataFrame import pandas as pd df = pd.DataFrame({'X': np.arange(10),
'Y': np.arange(10, 20), 'Z': np.arange(20, 30)}, index=list('abcdefghij')) df $SFBUFYQBOEBT %BUB'SBNF

ߦྻͷ QBOEBT%BUB'SBNFΛ࡞੒ Dask DataFrame import dask.dataframe as dd ddf =
dd.from_pandas(df, 2) ddf QBSUJUJPO QBSUJUJPO EJWJTJPO EJWJTJPO EJWJTJPO

Dask DataFrame ddf + 1 (ddf + 1).compute()
EG EEG DPNQVUF EEG "EEUPBMMFMFNFOUT $PNQVUBUJPOJTOPUQFSGPSNFEBU UIJTQPJOU 5SJHHFSDPNQVUBUJPO

Blocked Algorithm (Addition) $PODBU (ddf + 1).compute() QBOEBT%BUB'SBNF
%BTL%BUB'SBNF 1FSGPSNDPNQVUBUJPO QFSQBSUJUJPO QBOEBT%BUB'SBNF 3FTVMU

Blocked Algorithm (Total) ddf.sum().compute() 4VN 4VN $PODBU 4VN 4VN PWFSQBSUJUJPOT
$PODBUFOBUF 4VN QFSQBSUJUJPO

Blocked Algorithm (Total) ddf.sum().visualize() 4VN PWFSQBSUJUJPOT $PODBUFOBUF 4VN QFSQBSUJUJPO %BTL%BUB'SBNF

Blocked Algorithm (Mean) ddf.mean().visualize() 4VN PWFSQBSUJUJPOT $PVOU PWFSQBSUJUJPOT .FBO4VN$PVOU 4VN
QFSQBSUJUJPO $PVOU QFSQBSUJUJPO

Blocked Algorithm (Descriptive Statistics) ddf.describe().visualize() 3FTVMU

Dask DataFrame Functionality • Vectorized parallel computations • Group by
(split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, CSV/TSV…) • Flexible time series handling • Please refer to the oﬃcial documentation for detail.

Dask Internals • All Dask computations are expressed as Dask
Graph. • Dask Graph is ﬂexible enough to implement more complex algorithms such as linear algebra.

Linear Algebra • Dask Array implements: 'VODUJPO %FTDSJQUJPO MJOBMHDIPMFTLZ 3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO
PSPGB)FSNJUJBOQPTJUJWFEFpOJUF NBUSJY" MJOBMHJOW $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE CBDLXBSETVCTUJUVUJPOT MJOBMHMTUTR 3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23 EFDPNQPTJUJPO MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN

Example: LU decomposition • LU decomposition • LU Decomposition can
be split to computations per block. "ɹɹɹɹ-Y6

Blocked LU Decomposition • Diagonal Block • Row-direction(i < j)
• Columns direction (i < j) ∴ ∴ * LU: Function to solve LU decomposition * Solve: Function to solve equation

Blocked LU Decomposition arr = da.random.random((9, 9), chunks=(3, 3)) arr
dask.array<da.random.random_sample, shape=(9, 9), dtype=float64,chunksize=(3, 3)> from dask import compute t, l, u = da.linalg.lu(arr) t, l, u = compute(t, l, u)

Blocked LU Decomposition from dask import visualize visualize(t, l, u)

Dask Internals • All Dask data structures are represented by
Dask Graph. • Dask Array and DataFrame operations updates its Dask Graph. • Dask also oﬀers API to make your own algorithm parallel.

Dask Delayed • Assuming following simple computation. • x and
y can be computed in parallel. def inc(x): return x + 1 def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total 8

Dask Delayed from dask import delayed @delayed def inc(x): return
x + 1 @delayed def add(x, y): return x + y x = inc(1) y = inc(5) total = add(x, y) total Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64') total.compute() 8 6TJOH!EFMBZFEEFDPSBUPSNBLFT XSBQQFEGVODUJPOMB[Z %FMBZFEGVODUJPOTDBOCF DIBJOFE BOEPVUQVUT%FMBZFE JOTUBODF OPUFWBMVBUFE"5. 5SJHHFSDPNQVUBUJPO

Dask Delayed • A chain of Delayed functions is represented
with a Dask Graph. total.visualize()

Dask Distributed • Dask itself oﬀers 2 types of schedulers
which works on a single node: • threading • multiprocessing • Dask Distributed package oﬀers a distributed scheduler, which distributes tasks over multiple nodes.

Dask Distributed • A centrally managed, distributed, dynamic task scheduler.
• Low latency: Each task suﬀers about 1ms of overhead. • Peer-to-peer data sharing: Workers communicate with each other to share data. • Complex Scheduling: Supports complex workﬂows. • Data Locality: Scheduling algorithms cleverly execute computations where data lives. %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 8PSLFS %JTUSJCVUFE 4DIFEVMFS %JTUSJCVUFE $MJFOU

Demo • EC2 m4.xlarge • vCPU: 4 • Memory 16GiB
• 4 Workers with Dask Distributed scheduler

Comparison with Spark • Dask works well when: • Want
to scale existing NumPy or pandas project. • Parallel / Out-of-Core processing on a single node. • Prototype complex algorithm interactively. • Don’t have Big Data infrastructure you can use freely. • Spark works well when: • Needs to scale large number of clusters. • Workﬂow requirement meets Spark API (typical ETL or SQL-like ops). • Needs enterprise support.

References • Oﬃcial Document • http://dask.pydata.org/en/stable/ • Dask Tutorial (includes
more practical examples) • https://github.com/dask/dask-tutorial • Matthew Rocklin’s Blog Post • http://matthewrocklin.com/blog/

Conclusions • Understand the fundamentals of: • How pandas handles
internal data eﬃciently. • Dask to parallelize data processing easily. It provides: • Data structures like nd-array and DataFrame which extends common interfaces like NumPy and pandas. • Dynamic task graph and its scheduling optimized for computation.

Interested? • Let’s start contribution! • pandas • https://github.com/pandas-dev/pandas •
Dask • https://github.com/dask/dask

Data processing using pandas and Dask

Data processing using pandas and Dask

More Decks by Sinhrks

Other Decks in Programming

Featured

Transcript