$30 off During Our Annual Pro Sale. View Details »

Data processing using pandas and Dask

Data processing using pandas and Dask

PyCon night Tokyo 2017
https://eventdots.jp/event/617886

Sinhrks

May 26, 2017
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. Data processing using

    pandas and Dask

    Masaaki Horikoshi @ ARISE analytics

    View Slide

  2. Self Introduction
    • OSS Contributions:

    • A member of core developers of:

    • GitHub: https://github.com/sinhrks

    View Slide

  3. Goal
    • Understand the fundamentals of:

    • How pandas handles internal data efficiently.

    • Dask to parallelize data processing easily.

    View Slide

  4. • Cooperative with various scientific packages.
    PyData Stacks
    #PLFI NBUQMPUMJC
    4DJLJUMFBSO 4UBUTNPEFM
    /VN1Z
    1Z5BCMFT 42-"MDIFNZ
    *CJT
    4DJ1Z
    1Z4QBSL
    %BTL
    +VQZUFS
    QBOEBT
    6TFS*OUFSGBDF
    7JTVBMJ[BUJPO
    #JH%BUB
    *0
    $PNQVUBUJPO
    .BDIJOF-FBSOJOH
    4UBUJTUJDT
    SQZ
    0UIFS1SPHSBNNJOH
    MBOHVBHFT

    View Slide

  5. pandas

    Efficient Labeled Data Structure

    View Slide

  6. What is pandas?
    • “pandas” provides high-performance, easy-to-use
    data structures for data analysis.

    • Known as “DataFrame” in R language.

    • Author: Wes McKinney

    • License: BSD

    • Etymology: PANel DAta System

    • GitHub: 9700↑⭐

    View Slide

  7. import pandas as pd
    df = pd.read_csv(‘adult.csv’)
    df
    DataFrame
    "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ
    -JDINBO .
    6$*.BDIJOF-FBSOJOH3FQPTJUPSZ*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF
    3FBEDTWpMF

    View Slide

  8. DataFrame
    df[['age', 'marital-status']]
    df.groupby('income')['hours-per-week'].mean()
    (SPVQCZ 4FMFDU "HHSFHBUF
    4FMFDU

    View Slide

  9. Why pandas?
    • Capable for real-world (messy) data.

    • Provides intuitive data structures.

    • Batteries included for data processing.

    View Slide

  10. NumPy
    • N-dimensional array (nd-array) and matrix

    • Index (location) based access

    • Single data type
    arr = np.array([1, 2, 3, 4], dtype=np.int64)
    arr
    array([1, 2, 3, 4])


    MPDBUJPO
    /VN1Zl4USVDUVSFE"SSBZzDBODPOBJONVMUJQMFEUZQFT

    View Slide

  11. pandas
    • In addition to NumPy capabilities:

    • Label based access (Index / Columns)

    • Mixed data types
    $PMVNOT
    *OEFY
    .JYFEEBUBUZQFT

    View Slide

  12. Data Structures
    • Defined per dimension.

    • pandas focuses on 2D data structure.
    4FSJFT
    %

    %BUB'SBNF
    %

    1BOFM
    % EFQSFDBUFEJO

    $PMPSJ[FEDFMMTBSFMBCFMT

    View Slide

  13. pandas Functionality
    • Vectorized computations

    • Group by (split-apply-combine)

    • Reshaping (merge, join, concat…)

    • Various I/O support (SQL, Excel, CSV/TSV…)

    • Flexible time series handling

    • Plotting

    • Please refer to the official documentation for details.

    View Slide

  14. DataFrame Internals
    • Consists of type-based “Block”.
    $PMVNOT
    *OEFY
    ʜ
    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL
    $PMVNOTNBZCF
    DPOTPMJEBUFEQFSUZQFT

    View Slide

  15. Internal Implementations
    • pandas internally uses:

    • NumPy

    • Basic indexing, basic statistics…

    • SciPy

    • Interpolation, advanced statistics…

    • Cython:

    • Advanced indexing, hash table, group-by, join,
    datetime ops…

    View Slide

  16. Cython (Language)
    • A superset of the Python additionally supports C functions
    and C types. Can be compiled to C code.
    def ismember(ndarray arr, set values):
    cdef:
    Py_ssize_t i, n
    ndarray[uint8_t] result
    object val
    n = len(arr)
    result = np.empty(n, dtype=np.uint8)
    for i in range(n):
    val = util.get_value_at(arr, i)
    result[i] = val in values
    return result.view(np.bool_)
    ismember(np.array([1, 2, 3, 4]), set([2, 3]))
    array([False, True, True, False], dtype=bool)
    5ZQFEFpOJUJPOT
    3FUVSOCPPMBSSBZJOEJDBUFT
    BSSBZFMFNFOUTBSFJODMVEFE
    JOUIFTFU
    (FUJOQVU`TJUIWBMVF

    View Slide

  17. Release the GIL
    • For better parallelism (pandas 0.17.0-).
    def duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'):
    cdef:
    int ret = 0, value, k
    Py_ssize_t i, n = len(values)
    kh_int64_t * table = kh_init_int64()
    ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool')
    kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT))

    else:
    with nogil:
    for i from 0 <= i < n:
    value = values[i]
    k = kh_get_int64(table, value)
    if k != table.n_buckets:
    out[table.vals[k]] = 1
    out[i] = 1
    else:
    k = kh_put_int64(table, value, &ret)
    table.keys[k] = value
    table.vals[k] = i
    out[i] = 0
    kh_destroy_int64(table)
    return out
    3FMFBTFUIF(*-

    View Slide

  18. GIL (Global Interpreter Lock)
    • It prevents CPython from running multiple threads.

    • GIL is released on I/O.

    • GIL can be released using Cython.

    • Scientific packages are working to release GIL.

    • NumPy, SciPy, Scikit-learn, pandas…

    • Cannot use Python classes after GIL is released.

    • If the target is object dtype, GIL cannot be
    released.

    View Slide

  19. Further Reading
    • “A look inside pandas design and development”
    by Wes McKinney

    • “pandas internals” by Masaaki Horikoshi

    View Slide

  20. pandas for Big Data
    • May face 2 issues:

    • pandas performs computations using single
    thread.

    • Users have to parallelize by themselves.

    • pandas cannot handle data which exceeds
    physical memory.

    • Users have to write logic using pandas
    “chunk” function.

    View Slide

  21. Dask

    Flexible Parallel Computation
    Framework

    View Slide

  22. What is Dask?
    • Dask is a flexible parallel computation framework for numeric
    operations which offers:

    • Data structures like nd-array and DataFrame which
    extends common interfaces like NumPy and pandas.

    • Dynamic task graph and its scheduling optimized for
    computation.

    • Author: Matthew Rocklin

    • License: BSD

    • GitHub: 1500↑⭐

    View Slide

  23. (Incomplete) List of OSS uses Dask
    • (TFLearn) Deep learning library featuring a
    higher-level API for TensorFlow.

    • (Distributed Scheduler) A platform to author,
    schedule and monitor workflows.

    • Image Processing SciKit.

    • N-D labeled arrays and datasets in Python.

    • An interface to query data on different
    storage systems.

    • A graphics pipeline system for creating
    meaningful representations of large datasets
    quickly and flexibly.
    Datashader
    Airflow

    View Slide

  24. Dask Data Structures
    • Dask provides following data structures.
    %BUB4USVDUVSF #BTF$MBTT %FGBVMU4DIFEVMFS
    %BTL"SSBZ /VN1ZOEBSSBZ UISFBEJOH
    %BTL%BUB'SBNF QBOEBT%BUB'SBNF UISFBEJOH
    %BTL#BH 1Z5PPM[ MJTU TFU EJDU

    NVMUJQSPDFTTJOH
    DBOOPUSFMFBTF(*-

    View Slide

  25. Dask Array
    • Consists from multiple NumPy nd-array split
    along axis.
    /VN1ZOEBSSBZ %BTL"SSBZ
    $IVOL
    DIVOLTJ[F

    View Slide

  26. Dask Array
    import numpy as np
    x = np.ones((10, 10))
    x
    import dask.array as da
    dx = da.ones((10, 10), chunks=(5, 5))
    dx
    dask.arraychunksize=(5, 5)>
    $SFBUFYOEBSSBZ
    $SFBUFY%BTL"SSBZ
    TQFDJGZJOHYDIVOLTJ[F
    array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
    [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

    View Slide

  27. Dask Array
    import dask.array as da
    dx = da.ones((10, 10), chunks=(5, 5))
    dx
    dask.arraychunksize=(5, 5)>
    dx.visualize()
    $SFBUFY%BTL
    "SSBZTQFDJGZJOHY
    DIVOLTJ[F
    7JTVBMJ[FJOUFSOBMHSBQIVTJOH
    (SBQIWJ[
    &BDIOPEFDPSSFTQPOETUP
    FBDIDIVOL

    View Slide

  28. ߦྻͷ
    QBOEBT%BUB'SBN
    FΛ࡞੒
    Dask Array
    dy = dx.sum(axis=0)
    dy
    dask.arraychunksize=(5,)>
    4VNPGBSSBZFMFNFOUT
    BMPOHBYJT
    dy.visualize()
    dy.compute()
    array([ 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.])
    4VN 4VN

    View Slide

  29. Dask Array
    dy2 = dx.sum()
    dy2
    dask.arraychunksize=()>
    4VNPGBSSBZFMFNFOUT
    dy2.visualize()
    dy2.compute()
    100.0

    View Slide

  30. Dask Array
    • Dask Array can have arbitrary shape and chunk
    size.

    • Recommended to use the same chunk size
    over axis because computations are
    performed per chunk.
    da.ones((30, 20, 1, 15), chunks=(3, 7, 1, 2))
    dask.arraychunksize=(3, 7, 1, 2)>

    View Slide

  31. Dask DataFrame
    • Consists from multiple pandas DataFrames split
    along index (row labels).
    QBOEBT%BUB'SBNF %BTL%BUB'SBNF
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO

    View Slide

  32. Dask DataFrame
    import pandas as pd
    df = pd.DataFrame({'X': np.arange(10),
    'Y': np.arange(10, 20),
    'Z': np.arange(20, 30)},
    index=list('abcdefghij'))
    df
    $SFBUFYQBOEBT
    %BUB'SBNF

    View Slide

  33. ߦྻͷ
    QBOEBT%BUB'SBNFΛ࡞੒
    Dask DataFrame
    import dask.dataframe as dd
    ddf = dd.from_pandas(df, 2)
    ddf
    QBSUJUJPO
    QBSUJUJPO
    EJWJTJPO
    EJWJTJPO
    EJWJTJPO

    View Slide

  34. Dask DataFrame
    ddf + 1
    (ddf + 1).compute()

    EG EEG
    DPNQVUF

    EEG
    "EEUPBMMFMFNFOUT
    $PNQVUBUJPOJTOPUQFSGPSNFEBU
    UIJTQPJOU
    5SJHHFSDPNQVUBUJPO

    View Slide

  35. Blocked Algorithm (Addition)


    $PODBU
    (ddf + 1).compute()
    QBOEBT%BUB'SBNF %BTL%BUB'SBNF
    1FSGPSNDPNQVUBUJPO
    QFSQBSUJUJPO
    QBOEBT%BUB'SBNF
    3FTVMU

    View Slide

  36. Blocked Algorithm (Total)
    ddf.sum().compute()
    4VN
    4VN
    $PODBU
    4VN
    4VN
    PWFSQBSUJUJPOT

    $PODBUFOBUF
    4VN
    QFSQBSUJUJPO

    View Slide

  37. Blocked Algorithm (Total)
    ddf.sum().visualize()
    4VN
    PWFSQBSUJUJPOT

    $PODBUFOBUF
    4VN
    QFSQBSUJUJPO

    %BTL%BUB'SBNF

    View Slide

  38. Blocked Algorithm (Mean)
    ddf.mean().visualize()
    4VN
    PWFSQBSUJUJPOT

    $PVOU
    PWFSQBSUJUJPOT

    .FBO4VN$PVOU
    4VN
    QFSQBSUJUJPO

    $PVOU
    QFSQBSUJUJPO

    View Slide

  39. Blocked Algorithm (Descriptive Statistics)
    ddf.describe().visualize()
    3FTVMU

    View Slide

  40. Dask DataFrame Functionality
    • Vectorized parallel computations

    • Group by (split-apply-combine)

    • Reshaping (merge, join, concat…)

    • Various I/O support (SQL, CSV/TSV…)

    • Flexible time series handling

    • Please refer to the official documentation for detail.

    View Slide

  41. Dask Internals
    • All Dask computations are expressed as Dask
    Graph.

    • Dask Graph is flexible enough to implement
    more complex algorithms such as linear
    algebra.

    View Slide

  42. Linear Algebra
    • Dask Array implements:
    'VODUJPO %FTDSJQUJPO
    MJOBMHDIPMFTLZ
    3FUVSOTUIF$IPMFTLZEFDPNQPTJUJPO PSPGB)FSNJUJBOQPTJUJWFEFpOJUF
    NBUSJY"
    MJOBMHJOW
    $PNQVUFUIFJOWFSTFPGBNBUSJYXJUI-6EFDPNQPTJUJPOBOEGPSXBSE
    CBDLXBSETVCTUJUVUJPOT
    MJOBMHMTUTR
    3FUVSOUIFMFBTUTRVBSFTTPMVUJPOUPBMJOFBSNBUSJYFRVBUJPOVTJOH23
    EFDPNQPTJUJPO
    MJOBMHMV $PNQVUFUIFMVEFDPNQPTJUJPOPGBNBUSJY
    MJOBMHRS $PNQVUFUIFRSGBDUPSJ[BUJPOPGBNBUSJY
    MJOBMHTPMWF 4PMWFUIFFRVBUJPOBYCGPSY
    MJOBMHTPMWF@USJBOHVMBS 4PMWFUIFFRVBUJPOBYCGPSY BTTVNJOHBJTBUSJBOHVMBSNBUSJY
    MJOBMHTWE $PNQVUFUIFTJOHVMBSWBMVFEFDPNQPTJUJPOPGBNBUSJY
    MJOBMHTWE@DPNQSFTTFE 3BOEPNMZDPNQSFTTFESBOLLUIJO4JOHVMBS7BMVF%FDPNQPTJUJPO
    MJOBMHUTRS %JSFDU5BMMBOE4LJOOZ23BMHPSJUIN

    View Slide

  43. Example: LU decomposition
    • LU decomposition

    • LU Decomposition can be split to computations
    per block.
    "ɹɹɹɹ-Y6

    View Slide

  44. Blocked LU Decomposition
    • Diagonal Block

    • Row-direction(i < j)

    • Columns direction (i < j)


    * LU: Function to solve LU decomposition

    * Solve: Function to solve equation

    View Slide

  45. Blocked LU Decomposition
    arr = da.random.random((9, 9), chunks=(3, 3))
    arr
    dask.arraydtype=float64,chunksize=(3, 3)>
    from dask import compute
    t, l, u = da.linalg.lu(arr)
    t, l, u = compute(t, l, u)

    View Slide

  46. Blocked LU Decomposition
    from dask import visualize
    visualize(t, l, u)

    View Slide

  47. Dask Internals
    • All Dask data structures are represented by Dask
    Graph.

    • Dask Array and DataFrame operations
    updates its Dask Graph.

    • Dask also offers API to make your own algorithm
    parallel.

    View Slide

  48. Dask Delayed
    • Assuming following simple computation.

    • x and y can be computed in parallel.
    def inc(x):
    return x + 1
    def add(x, y):
    return x + y
    x = inc(1)
    y = inc(5)
    total = add(x, y)
    total
    8

    View Slide

  49. Dask Delayed
    from dask import delayed
    @delayed
    def inc(x):
    return x + 1
    @delayed
    def add(x, y):
    return x + y
    x = inc(1)
    y = inc(5)
    total = add(x, y)
    total
    Delayed('add-b43be476-ffc7-48d7-a8ec-0f95df821e64')
    total.compute()
    8
    6TJOH!EFMBZFEEFDPSBUPSNBLFT
    XSBQQFEGVODUJPOMB[Z
    %FMBZFEGVODUJPOTDBOCF
    DIBJOFE BOEPVUQVUT%FMBZFE
    JOTUBODF OPUFWBMVBUFE"5.

    5SJHHFSDPNQVUBUJPO

    View Slide

  50. Dask Delayed
    • A chain of Delayed functions is represented with
    a Dask Graph.
    total.visualize()

    View Slide

  51. Dask Distributed
    • Dask itself offers 2 types of schedulers which
    works on a single node:

    • threading

    • multiprocessing

    • Dask Distributed package offers a distributed
    scheduler, which distributes tasks over multiple
    nodes.

    View Slide

  52. Dask Distributed
    • A centrally managed, distributed, dynamic task scheduler.

    • Low latency: Each task suffers about 1ms of overhead.

    • Peer-to-peer data sharing: Workers communicate with each
    other to share data.

    • Complex Scheduling: Supports complex workflows.

    • Data Locality: Scheduling algorithms cleverly execute
    computations where data lives.
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    8PSLFS
    %JTUSJCVUFE
    4DIFEVMFS
    %JTUSJCVUFE
    $MJFOU

    View Slide

  53. Demo
    • EC2 m4.xlarge

    • vCPU: 4

    • Memory 16GiB

    • 4 Workers with Dask Distributed scheduler

    View Slide

  54. Comparison with Spark
    • Dask works well when:

    • Want to scale existing NumPy or pandas project.

    • Parallel / Out-of-Core processing on a single node.

    • Prototype complex algorithm interactively.

    • Don’t have Big Data infrastructure you can use freely.

    • Spark works well when:

    • Needs to scale large number of clusters.

    • Workflow requirement meets Spark API (typical ETL or SQL-like ops).

    • Needs enterprise support.

    View Slide

  55. References
    • Official Document

    • http://dask.pydata.org/en/stable/

    • Dask Tutorial (includes more practical examples)

    • https://github.com/dask/dask-tutorial

    • Matthew Rocklin’s Blog Post

    • http://matthewrocklin.com/blog/

    View Slide

  56. Conclusions
    • Understand the fundamentals of:

    • How pandas handles internal data efficiently.

    • Dask to parallelize data processing easily. It
    provides:

    • Data structures like nd-array and DataFrame which
    extends common interfaces like NumPy and pandas.

    • Dynamic task graph and its scheduling optimized for
    computation.

    View Slide

  57. Interested?
    • Let’s start contribution!

    • pandas

    • https://github.com/pandas-dev/pandas

    • Dask

    • https://github.com/dask/dask

    View Slide