$30 off During Our Annual Pro Sale. View Details »

PyConJP 2015: pandas internals

Sinhrks
October 11, 2015

PyConJP 2015: pandas internals

Sinhrks

October 11, 2015
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. pandas internals

    View Slide

  2. Introduction
    • Data Analyst

    • OSS Contributions:

    • PyData Development Team (pandas)

    • Blaze Development Team (Dask)

    • GitHub: https://github.com/sinhrks

    View Slide

  3. Goal
    • Understand:

    • How pandas handles internal data efficiently.

    • Some basic rules to get the most out of
    pandas.

    View Slide

  4. Agenda
    • What is pandas?

    • pandas internals

    • Tips for performance

    View Slide

  5. What is pandas?

    View Slide

  6. What is pandas?
    • “pandas” provides high-performance, easy-to-use
    data structures for data analysis.

    • Known as “DataFrame” in R language.

    • Author: Wes McKinney

    • License: BSD

    • Etymology: PANel DAta System

    • GitHub: 5000↑⭐️

    View Slide

  7. import pandas as pd
    df = pd.read_csv(‘adult.csv’)
    df
    DataFrame
    "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ
    -JDINBO .
    6$*.BDIJOF-FBSOJOH3FQPTJUPSZ*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF
    3FBEDTWpMF

    View Slide

  8. DataFrame
    df[['age', 'marital-status']]
    df.groupby('income')['hours-per-week'].mean()
    (SPVQCZ 4FMFDU "HHSFHBUF
    4FMFDU

    View Slide

  9. Why pandas?
    • Capable for real-world (messy) data

    • Provides intuitive data structures

    • Batteries included for data wrangling

    View Slide

  10. NumPy
    • N-dimensional array (ndarray) and matrix

    • Index (location) based access

    • Single data type
    arr = np.array([1, 2, 3, 4], dtype=np.int64)
    arr
    array([1, 2, 3, 4])


    MPDBUJPO

    View Slide

  11. pandas
    • In addition to NumPy capabilities:

    • Label based access (Index / Columns)

    • Mixed data types
    $PMVNOT
    *OEFY
    .JYFEEBUBUZQFT

    View Slide

  12. Data Structures
    • Defined per dimension.
    4FSJFT
    %

    %BUB'SBNF
    %

    1BOFM
    %

    $PMPSJ[FEDFMMTBSFMBCFMT

    View Slide

  13. Functionality
    • Vectorized computations

    • Group by (split-apply-combine)

    • Reshaping (merge, join, concat…)

    • Various I/O support (SQL, Excel, …)

    • Flexible time series handling

    • Plotting

    • Please refer to the official documentation for details.

    View Slide

  14. • Cooperative with various scientific packages.
    PyData Stacks
    #PLFI NBUQMPUMJC
    4DJLJUMFBSO 4UBUTNPEFM
    /VN1Z
    1Z5BCMFT 42-"MDIFNZ
    *CJT
    4DJ1Z
    1Z4QBSL
    #MB[F
    +VQZUFS
    QBOEBT
    6TFS*OUFSGBDF
    7JTVBMJ[BUJPO
    #JH%BUB
    *0
    $PNQVUBUJPO
    .BDIJOF-FBSOJOH
    4UBUJTUJDT
    SQZ
    0UIFS1SPHSBNNJOH
    MBOHVBHFT

    View Slide

  15. • Is functionality everything?

    • Performance

    • What developers do.

    • What users can.

    View Slide

  16. pandas internals

    View Slide

  17. pandas internals
    • Introduce some typical techniques used in
    pandas internally.

    • Intend to clarify the basics, rather than
    explaining algorithm detail.

    • Expect to be useful to achieve better
    performance in your program.

    View Slide

  18. DataFrame Internals
    • Consists of Type-based “Block”.
    $PMVNOT
    *OEFY
    ʜ
    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL
    $PMVNOTNBZCF
    DPOTPMJEBUFEQFSUZQFT

    View Slide

  19. Cython (Language)
    • A superset of the Python additionally supports C functions
    and C types. Can be compiled to C code.
    def ismember(ndarray arr, set values):
    cdef:
    Py_ssize_t i, n
    ndarray[uint8_t] result
    object val
    n = len(arr)
    result = np.empty(n, dtype=np.uint8)
    for i in range(n):
    val = util.get_value_at(arr, i)
    result[i] = val in values
    return result.view(np.bool_)
    ismember(np.array([1, 2, 3, 4]), set([2, 3]))
    array([False, True, True, False], dtype=bool)
    5ZQFEFpOJTJPOT
    3FUVSOCPPMBSSBZJOEJDBUFT
    lBSSzJTJODMVEFEJOlWBMVFTzTFU
    (FUlBSSz`TJUIWBMVF

    View Slide

  20. Cython
    • Performance critical functions are written in Cython or C.

    • Cited from “Cython: A guide for Python programmers” by
    Kurt W.Smith”

    • NOTE: C and Fortran are not included in the table.
    -JOFTPG$ZUIPO
    4BHF
    QBOEBT
    4DJ1Z
    4DJLJUMFBSO
    /VN1Z

    View Slide

  21. Code Example: Reindex
    df = pd.DataFrame({'X': [1, 2, 3],
    'Y': [4, 5, 6],
    'Z': [True, False, True]},
    index=['a', 'b', 'c'])
    df
    df.reindex(['b', 'a', 'c'])
    3FPSEFSFECZHJWFOJOEFY

    View Slide

  22. Code Example: Reindex
    • Pseudo code (logic is simplified):
    /POFFEUPSFJOEFY
    /POFFEUPSFJOEFY
    def reindex(self, given_index):
    if [given_index is equal to self.index]:
    return self.copy()
    if len(given_index) == 0:
    return [empty]
    if self.index.is_unique:
    return [unique index logic]
    else:
    return [non-unique index logic]
    0QUJNJ[FEMPHJDTGPS
    FBDIDPOEJUJPO
    6OJRVFOFTT lJT@VOJRVFz
    JTB
    DBDIFEQSPQFSUZUZQJDBMMZEJWJEFT
    UIFMPHJD MBUFSTMJEF

    View Slide

  23. array([[1, 4, True],
    [2, 5, False],
    [3, 6, True]], dtype=object)
    df.index.get_indexer(['b', 'a', ‘c'])
    • Step by step:
    df.values
    Code Example: Reindex
    np.take(df.values, [1, 0, 2], axis=0)
    (FUJOUFSOBMOEBSSBZ
    array([1, 0, 2]) (FUNBQQJOHCFUXFFOHJWFOMBCFMT
    BOEDVSSFOUJOEFY OFYUTMJEF

    array([[2, 5, False],
    [1, 4, True],
    [3, 6, True]], dtype=object)
    3FPSEFSCZMPDBUJPO

    View Slide

  24. Code Example: get_indexer
    • Utilize Cython and C hash table (klib/khash.h)
    cdef class Int64Engine(IndexEngine):
    cdef initialize(self):

    self.mapping = _hash.Int64HashTable(…)
    def get_indexer(self, values):

    return self.mapping.lookup(values)
    5ZQFPQUJNJ[FE$ZUIPOJOUFSOBM OBNFEl&OHJOFz
    $ZUIPOXSBQQFSGPSLIBTII

    View Slide

  25. Code Example: Cache
    • Index is immutable and can cache some
    computations.
    class Index(…):
    @cache_readonly(…)
    def is_unique(self):
    return self._engine.is_unique
    $ZUIPOJ[FENFNPJ[FEFDPSBUPS
    6TFUZQFPQUJNJ[FE&OHJOF

    View Slide

  26. Code Example: Type-Optimized Cython
    Python script to generate Cython functions for
    each type using templates.
    left_join_template = """
    def left_join_indexer_%(name)s(ndarray[%(c_type)s] left,
    ndarray[%(c_type)s] right):
    '''
    Two-pass algorithm for monotonic indexes. Handles many-
    to-one merges
    '''
    cdef:
    Py_ssize_t i, j, k, nright, nleft, count
    %(c_type)s lval, rval
    ndarray[int64_t] lindexer, rindexer
    ndarray[%(c_type)s] result

    'JMMFECZFBDIUZQF

    View Slide

  27. Release the GIL
    • For parallelism (pandas 0.17.0-).
    def duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'):
    cdef:
    int ret = 0, value, k
    Py_ssize_t i, n = len(values)
    kh_int64_t * table = kh_init_int64()
    ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool')
    kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT))

    else:
    with nogil:
    for i from 0 <= i < n:
    value = values[i]
    k = kh_get_int64(table, value)
    if k != table.n_buckets:
    out[table.vals[k]] = 1
    out[i] = 1
    else:
    k = kh_put_int64(table, value, &ret)
    table.keys[k] = value
    table.vals[k] = i
    out[i] = 0
    kh_destroy_int64(table)
    return out
    3FMFBTFUIF(*-
    6TFLIBTIEJSFDUMZ VOBCMF
    UPVTFXSBQQFSDMBTTXJUIPVU(*-

    View Slide

  28. Further Reading
    • “A look inside pandas design and development”
    by Wes McKinney

    View Slide

  29. Performance Testing
    • airspeed velocity

    • pandas now has 650↑ benchmarks
    All benchmarks:
    before after ratio
    [5049b5 ] [53ac28 ]
    293.20ns 290.10ns 0.99 attrs_caching.getattr_dataframe_index.time_getattr_dataframe_index
    3.13μs 3.08μs 0.98 attrs_caching.setattr_dataframe_index.time_setattr_dataframe_index
    7.45ms 7.23ms 0.97 binary_ops.frame_add.time_frame_add
    4.14ms 4.09ms 0.99 binary_ops.frame_add_no_ne.time_frame_add_no_ne
    4.28ms 4.40ms 1.03 binary_ops.frame_add_st.time_frame_add_st
    21.67ms 21.58ms 1.00 binary_ops.frame_float_div.time_frame_float_div
    5.74ms 5.84ms 1.02 binary_ops.frame_float_div_by_zero.time_frame_float_div_by_zero
    17.90ms 17.81ms 0.99 binary_ops.frame_float_floor_by_zero.time_frame_float_floor_by_zero
    10.49ms 9.97ms 0.95 binary_ops.frame_float_mod.time_frame_float_mod
    5.95ms 6.14ms 1.03 binary_ops.frame_int_div_by_zero.time_frame_int_div_by_zero
    10.64ms 10.64ms 1.00 binary_ops.frame_int_mod.time_frame_int_mod
    7.26ms 7.31ms 1.01 binary_ops.frame_mult.time_frame_mult
    4.14ms 4.10ms 0.99 binary_ops.frame_mult_no_ne.time_frame_mult_no_ne
    $PNQBSJTPOCFUXFFO
    DPNNJUT

    View Slide

  30. Performance Testing
    • airspeed velocity

    • Changes with the passage of the time

    View Slide

  31. Tips for performance

    View Slide

  32. Tips for performance
    • Introduce basic rules which can be applied to most
    cases in performance point of view.

    • Some functions intends user’s convenience,
    rather than performance.

    • Environment

    • AWS EC2: c4.2xlarge (vCPU: 8, Memory: 15 GiB)

    • Python 3.5.0

    • DISCLAIMER: Performance is mostly depending on actual data
    and operations. Be sure to profile the effectiveness.

    View Slide

  33. 1. Installation
    • Link NumPy to linear algebra libraries.

    • BLAS/ATLAS, LAPACK

    • Install pandas optional dependencies:
    #PUUMFOFDL A collection of fast NumPy array functions.
    /VNFYQS A fast numerical expression evaluator.

    View Slide

  34. 1. Installation
    • Confirm NumPy links
    sysinfo.get_info('lapack')
    {'language': 'f77',
    'libraries': ['openblas'],
    'library_dirs': ['/home/ec2-user/miniconda/lib']}
    import numpy.distutils.system_info as sysinfo
    sysinfo.get_info('atlas')
    {'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')],
    'include_dirs': ['/home/ec2-user/miniconda/include'],
    'language': 'f77',
    'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'],
    'library_dirs': ['/home/ec2-user/miniconda/lib']}

    View Slide

  35. 1. Installation
    • Confirm pandas-related environment.
    pd.show_versions()
    INSTALLED VERSIONS
    ------------------
    python: 3.5.0.final.0

    pandas: 0.17.0
    numpy: 1.10.0
    bottleneck: 1.0.0
    numexpr: 2.4.4

    View Slide

  36. 2. Built-in Functions / Methods
    • Check API doc before writing user defined
    functions (UDF) by yourself.

    • Some functions may be faster than NumPy
    depending on conditions.

    • Example: Uniquify
    np.unique([1, 2, 2, 3, 2, 4])
    array([1, 2, 3, 4])
    3FNPWFEVQMJDBUFT

    View Slide

  37. 2. Built-in Functions / Methods
    %timeit np.unique(values)
    10 loops, best of 3: 42.2 ms per loop
    %timeit pd.unique(values)
    100 loops, best of 3: 7.1 ms per loop
    np.random.seed(71)
    values = np.random.randint(1, 1000, 1000000)
    values
    array([108, 942, 12, ..., 308, 897, 40])
    (FOFSBUFTBNQMFEBUB
    /VN1Z
    QBOEBT

    View Slide

  38. 2. Built-in Functions / Methods
    • Especially, avoid “apply”.

    • Example: String concatenation
    $PODBUFOBUFTUSJOHTPG
    lCzBOElDzDPMVNOT

    View Slide

  39. 2. Built-in Functions / Methods
    def f1(s):
    return s['b'] + s['c']
    %timeit df.apply(f1, axis=1)
    1 loops, best of 3: 14.3 s per loop
    %timeit df['b'] + df['c']
    10 loops, best of 3: 92.5 ms per loop
    BQQMZ
    7FDUPSJ[FE
    import pandas.util.testing as tm
    chars1 = tm.rands_array(5, 100)
    chars2 = tm.rands_array(5, 10000)
    n = 1000000
    df = pd.DataFrame({'a': np.random.randn(n),
    'b': tm.choice(chars1, size=n),
    'c': tm.choice(chars2, size=n))
    df
    1SFQBSJOHSBOEPNEBUB

    View Slide

  40. 3. Repeated Ops
    • pandas methods basically returns a copy (not
    view).

    • Use single vectorized operation to avoid
    repeated copies.
    df[‘a’] + 1
    "EEUPDPMVNOlBz

    View Slide

  41. 3. Repeated Ops
    • Example: Arithmetic
    %timeit df['a'] + 2 - 1
    1000 loops, best of 3: 1.03 ms per loop
    %timeit df['a'] + 1
    1000 loops, best of 3: 475 µs per loop
    $PQJFEUJNFT
    $PQJFEUJNF

    View Slide

  42. 4. Data Types
    • Avoid to use “object” dtype.

    • Note that “str” is regarded as “object” dtype.

    • Example: Group-by → mean
    (SPVQCZlCzDPMVNOXIJDI
    IBTVOJRVFWBMVFT
    5IFO$BMDVMBUFNFBO
    df.groupby('b').mean()

    View Slide

  43. 4. Data Types
    • Object dtype.

    • Convert the grouping column to “Categorical”.
    %timeit df.groupby('b').mean()
    10 loops, best of 3: 59.7 ms per loop
    df['b'] = df['b'].astype('category')
    %timeit df.groupby('b').mean()
    100 loops, best of 3: 17.2 ms per loop
    lPCKFDUzEUZQF
    l$BUFHPSJDBMz

    View Slide

  44. 4. Data Types
    • Categorical, what?
    c = pd.Categorical(list(‘ababcabaca'))
    c
    [a, b, a, b, c, a, b, a, c, a]
    Categories (3, object): [a, b, c]
    c.categories
    Index(['a', 'b', 'c'], dtype='object')
    c.codes
    array([0, 1, 0, 1, 2, 0, 1, 0, 2, 0], dtype=int8)
    B C D

    MPDBUJPO
    DBUFHPSJFT

    DPEFT
    1SPDFTTFECZlDPEFTz GBTU

    $SFBUFl$BUFHPSJDBMz

    View Slide

  45. 5. Index
    • Index is immutable and cache some calculated
    results.

    • Better to be sorted, unique and without
    missing values (NaN).

    • Example: Left-outer join by Index
    left.join(right)
    MFGU SJHIU

    KPJO

    View Slide

  46. 5. Index
    np.random.seed(71)
    df_left = pd.DataFrame({'a': np.random.randn(n),
    'b': np.random.randn(n)})
    n_right = 10000
    df_right = pd.DataFrame({'c': np.random.randint(1, 100,
    n_right)})
    %timeit df_left.join(df_right)
    100 loops, best of 3: 6.88 ms per loop
    df_right_shuffled = df_right.sample(n=len(df_right))
    %timeit df_left.join(df_right_shuffled)
    100 loops, best of 3: 18.7 ms per loop
    +PJOCZTPSUFEVOJRVF*OEFY
    4IV⒐FCZSBOEPNTBNQMJOH
    /POTPSUFEVOJRVF*OEFY
    -FGU.SPXT DPMVNOT
    3JHIU,SPXT DPMVNO

    View Slide

  47. 6. I/O
    • No single solution (read/write, types…)
    • Cited from “Efficiently Store Pandas DataFrames” by
    Matthew Rocklin
    QSPUPDPM

    View Slide

  48. 6. I/O
    • Parsing datetime likely to be a bottleneck (load).

    • ISO-8601
    iso_8641_fmt = '2011-{0:02d}-{1:02d} 00:00:00'
    values = [iso_8601_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])]
    values
    ['2011-01-03 00:00:00', '2011-02-04 00:00:00']
    pd.to_datetime(values)
    DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]',
    freq=None, tz=None)

    View Slide

  49. 6. I/O
    • Flexible format (parsed by dateutil)
    mdy_fmt = '{0:02d}/{1:02d}/2011'
    values = [mdy_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])]
    values
    ['01/03/2011', '02/04/2011']
    pd.to_datetime(values)
    DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]',
    freq=None, tz=None)

    View Slide

  50. 6. I/O
    N = 10000
    months = np.random.randint(1, 12, N)
    days = np.random.randint(1, 28, N)
    100 loops, best of 3: 2.26 ms per loop
    dates = [mdy_fmt.format(m, d) for m, d in zip(months, days)]
    %timeit pd.to_datetime(dates)
    1 loops, best of 3: 805 ms per loop
    dates = [iso_8601_fmt.format(m, d) for m, d in zip(months, days)]
    %timeit pd.to_datetime(dates)
    1SFQBSF,SBOEPN
    DPNCJOBUJPOTPGNPOUIBOEEBZ
    %timeit pd.to_datetime(dates, format='%m/%d/%Y')
    10 loops, best of 3: 26.1 ms per loop
    *40
    'PSNBUMJLFlz
    1SPWJEJOHlGPSNBUzLXNBZ
    JNQSPWFUIFQFSGPSNBODF

    View Slide

  51. Limitations
    • pandas cannot control:

    • Internal data alignment & consolidation

    • Internal data copy triggered by NumPy

    • Alternatives:

    • Use NumPy directly

    • Use languages can control lower-levels

    View Slide

  52. Other Tools
    • Computation time

    • Cython

    • Numba

    • Parallelize

    • Dask

    • PySpark

    View Slide

  53. Conclusions
    • Understand:

    • How pandas handles internal data efficiently.

    • Some basic rules to get the most out of
    pandas.

    View Slide

  54. Interested?
    • Let’s start contribution!

    • https://github.com/pydata/pandas

    View Slide