Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PyConJP 2015: pandas internals

Sinhrks
October 11, 2015

PyConJP 2015: pandas internals

Sinhrks

October 11, 2015
Tweet

More Decks by Sinhrks

Other Decks in Programming

Transcript

  1. Introduction • Data Analyst • OSS Contributions: • PyData Development

    Team (pandas) • Blaze Development Team (Dask) • GitHub: https://github.com/sinhrks
  2. Goal • Understand: • How pandas handles internal data efficiently.

    • Some basic rules to get the most out of pandas.
  3. What is pandas? • “pandas” provides high-performance, easy-to-use data structures

    for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 5000↑⭐️
  4. import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ

    -JDINBO .  6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF
  5. Why pandas? • Capable for real-world (messy) data • Provides

    intuitive data structures • Batteries included for data wrangling
  6. NumPy • N-dimensional array (ndarray) and matrix • Index (location)

    based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4])         MPDBUJPO
  7. pandas • In addition to NumPy capabilities: • Label based

    access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT
  8. Functionality • Vectorized computations • Group by (split-apply-combine) • Reshaping

    (merge, join, concat…) • Various I/O support (SQL, Excel, …) • Flexible time series handling • Plotting • Please refer to the official documentation for details.
  9. • Cooperative with various scientific packages. PyData Stacks #PLFI NBUQMPUMJC

    4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL #MB[F +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT
  10. pandas internals • Introduce some typical techniques used in pandas

    internally. • Intend to clarify the basics, rather than explaining algorithm detail. • Expect to be useful to achieve better performance in your program.
  11. DataFrame Internals • Consists of Type-based “Block”. $PMVNOT *OEFY ʜ

    *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT
  12. Cython (Language) • A superset of the Python additionally supports

    C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJTJPOT 3FUVSOCPPMBSSBZJOEJDBUFT lBSSzJTJODMVEFEJOlWBMVFTzTFU (FUlBSSz`TJUIWBMVF
  13. Cython • Performance critical functions are written in Cython or

    C. • Cited from “Cython: A guide for Python programmers” by Kurt W.Smith” • NOTE: C and Fortran are not included in the table. -JOFTPG$ZUIPO 4BHF   QBOEBT   4DJ1Z   4DJLJUMFBSO   /VN1Z  
  14. Code Example: Reindex df = pd.DataFrame({'X': [1, 2, 3], 'Y':

    [4, 5, 6], 'Z': [True, False, True]}, index=['a', 'b', 'c']) df df.reindex(['b', 'a', 'c']) 3FPSEFSFECZHJWFOJOEFY <bC` bB` bD`>
  15. Code Example: Reindex • Pseudo code (logic is simplified): /POFFEUPSFJOEFY

    /POFFEUPSFJOEFY def reindex(self, given_index): if [given_index is equal to self.index]: return self.copy() if len(given_index) == 0: return [empty] if self.index.is_unique: return [unique index logic] else: return [non-unique index logic] 0QUJNJ[FEMPHJDTGPS FBDIDPOEJUJPO 6OJRVFOFTT lJT@VOJRVFz JTB DBDIFEQSPQFSUZUZQJDBMMZEJWJEFT UIFMPHJD MBUFSTMJEF 
  16. array([[1, 4, True], [2, 5, False], [3, 6, True]], dtype=object)

    df.index.get_indexer(['b', 'a', ‘c']) • Step by step: df.values Code Example: Reindex np.take(df.values, [1, 0, 2], axis=0) (FUJOUFSOBMOEBSSBZ array([1, 0, 2]) (FUNBQQJOHCFUXFFOHJWFOMBCFMT BOEDVSSFOUJOEFY OFYUTMJEF  array([[2, 5, False], [1, 4, True], [3, 6, True]], dtype=object) 3FPSEFSCZMPDBUJPO
  17. Code Example: get_indexer • Utilize Cython and C hash table

    (klib/khash.h) cdef class Int64Engine(IndexEngine): cdef initialize(self): … self.mapping = _hash.Int64HashTable(…) def get_indexer(self, values): … return self.mapping.lookup(values) 5ZQFPQUJNJ[FE$ZUIPOJOUFSOBM OBNFEl&OHJOFz $ZUIPOXSBQQFSGPSLIBTII
  18. Code Example: Cache • Index is immutable and can cache

    some computations. class Index(…): @cache_readonly(…) def is_unique(self): return self._engine.is_unique $ZUIPOJ[FENFNPJ[FEFDPSBUPS 6TFUZQFPQUJNJ[FE&OHJOF
  19. Code Example: Type-Optimized Cython Python script to generate Cython functions

    for each type using templates. left_join_template = """ def left_join_indexer_%(name)s(ndarray[%(c_type)s] left, ndarray[%(c_type)s] right): ''' Two-pass algorithm for monotonic indexes. Handles many- to-one merges ''' cdef: Py_ssize_t i, j, k, nright, nleft, count %(c_type)s lval, rval ndarray[int64_t] lindexer, rindexer ndarray[%(c_type)s] result … 'JMMFECZFBDIUZQF
  20. Release the GIL • For parallelism (pandas 0.17.0-). def duplicated_int64(ndarray[int64_t,

    ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*- 6TFLIBTIEJSFDUMZ VOBCMF UPVTFXSBQQFSDMBTTXJUIPVU(*-
  21. Performance Testing • airspeed velocity • pandas now has 650↑

    benchmarks All benchmarks: before after ratio [5049b5 ] [53ac28 ] 293.20ns 290.10ns 0.99 attrs_caching.getattr_dataframe_index.time_getattr_dataframe_index 3.13μs 3.08μs 0.98 attrs_caching.setattr_dataframe_index.time_setattr_dataframe_index 7.45ms 7.23ms 0.97 binary_ops.frame_add.time_frame_add 4.14ms 4.09ms 0.99 binary_ops.frame_add_no_ne.time_frame_add_no_ne 4.28ms 4.40ms 1.03 binary_ops.frame_add_st.time_frame_add_st 21.67ms 21.58ms 1.00 binary_ops.frame_float_div.time_frame_float_div 5.74ms 5.84ms 1.02 binary_ops.frame_float_div_by_zero.time_frame_float_div_by_zero 17.90ms 17.81ms 0.99 binary_ops.frame_float_floor_by_zero.time_frame_float_floor_by_zero 10.49ms 9.97ms 0.95 binary_ops.frame_float_mod.time_frame_float_mod 5.95ms 6.14ms 1.03 binary_ops.frame_int_div_by_zero.time_frame_int_div_by_zero 10.64ms 10.64ms 1.00 binary_ops.frame_int_mod.time_frame_int_mod 7.26ms 7.31ms 1.01 binary_ops.frame_mult.time_frame_mult 4.14ms 4.10ms 0.99 binary_ops.frame_mult_no_ne.time_frame_mult_no_ne $PNQBSJTPOCFUXFFO DPNNJUT
  22. Tips for performance • Introduce basic rules which can be

    applied to most cases in performance point of view. • Some functions intends user’s convenience, rather than performance. • Environment • AWS EC2: c4.2xlarge (vCPU: 8, Memory: 15 GiB) • Python 3.5.0 • DISCLAIMER: Performance is mostly depending on actual data and operations. Be sure to profile the effectiveness.
  23. 1. Installation • Link NumPy to linear algebra libraries. •

    BLAS/ATLAS, LAPACK • Install pandas optional dependencies: #PUUMFOFDL A collection of fast NumPy array functions. /VNFYQS A fast numerical expression evaluator.
  24. 1. Installation • Confirm NumPy links sysinfo.get_info('lapack') {'language': 'f77', 'libraries':

    ['openblas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']} import numpy.distutils.system_info as sysinfo sysinfo.get_info('atlas') {'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'include_dirs': ['/home/ec2-user/miniconda/include'], 'language': 'f77', 'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']}
  25. 1. Installation • Confirm pandas-related environment. pd.show_versions() INSTALLED VERSIONS ------------------

    python: 3.5.0.final.0 … pandas: 0.17.0 numpy: 1.10.0 bottleneck: 1.0.0 numexpr: 2.4.4 …
  26. 2. Built-in Functions / Methods • Check API doc before

    writing user defined functions (UDF) by yourself. • Some functions may be faster than NumPy depending on conditions. • Example: Uniquify np.unique([1, 2, 2, 3, 2, 4]) array([1, 2, 3, 4]) 3FNPWFEVQMJDBUFT
  27. 2. Built-in Functions / Methods %timeit np.unique(values) 10 loops, best

    of 3: 42.2 ms per loop %timeit pd.unique(values) 100 loops, best of 3: 7.1 ms per loop np.random.seed(71) values = np.random.randint(1, 1000, 1000000) values array([108, 942, 12, ..., 308, 897, 40]) (FOFSBUFTBNQMFEBUB /VN1Z QBOEBT
  28. 2. Built-in Functions / Methods • Especially, avoid “apply”. •

    Example: String concatenation $PODBUFOBUFTUSJOHTPG lCzBOElDzDPMVNOT
  29. 2. Built-in Functions / Methods def f1(s): return s['b'] +

    s['c'] %timeit df.apply(f1, axis=1) 1 loops, best of 3: 14.3 s per loop %timeit df['b'] + df['c'] 10 loops, best of 3: 92.5 ms per loop BQQMZ 7FDUPSJ[FE import pandas.util.testing as tm chars1 = tm.rands_array(5, 100) chars2 = tm.rands_array(5, 10000) n = 1000000 df = pd.DataFrame({'a': np.random.randn(n), 'b': tm.choice(chars1, size=n), 'c': tm.choice(chars2, size=n)) df 1SFQBSJOHSBOEPNEBUB
  30. 3. Repeated Ops • pandas methods basically returns a copy

    (not view). • Use single vectorized operation to avoid repeated copies. df[‘a’] + 1 "EEUPDPMVNOlBz
  31. 3. Repeated Ops • Example: Arithmetic %timeit df['a'] + 2

    - 1 1000 loops, best of 3: 1.03 ms per loop %timeit df['a'] + 1 1000 loops, best of 3: 475 µs per loop $PQJFEUJNFT $PQJFEUJNF
  32. 4. Data Types • Avoid to use “object” dtype. •

    Note that “str” is regarded as “object” dtype. • Example: Group-by → mean (SPVQCZlCzDPMVNOXIJDI IBTVOJRVFWBMVFT 5IFO$BMDVMBUFNFBO df.groupby('b').mean()
  33. 4. Data Types • Object dtype. • Convert the grouping

    column to “Categorical”. %timeit df.groupby('b').mean() 10 loops, best of 3: 59.7 ms per loop df['b'] = df['b'].astype('category') %timeit df.groupby('b').mean() 100 loops, best of 3: 17.2 ms per loop lPCKFDUzEUZQF l$BUFHPSJDBMz
  34. 4. Data Types • Categorical, what? c = pd.Categorical(list(‘ababcabaca')) c

    [a, b, a, b, c, a, b, a, c, a] Categories (3, object): [a, b, c] c.categories Index(['a', 'b', 'c'], dtype='object') c.codes array([0, 1, 0, 1, 2, 0, 1, 0, 2, 0], dtype=int8) B C D    MPDBUJPO DBUFHPSJFT           DPEFT 1SPDFTTFECZlDPEFTz GBTU $SFBUFl$BUFHPSJDBMz
  35. 5. Index • Index is immutable and cache some calculated

    results. • Better to be sorted, unique and without missing values (NaN). • Example: Left-outer join by Index left.join(right) MFGU SJHIU  KPJO
  36. 5. Index np.random.seed(71) df_left = pd.DataFrame({'a': np.random.randn(n), 'b': np.random.randn(n)}) n_right

    = 10000 df_right = pd.DataFrame({'c': np.random.randint(1, 100, n_right)}) %timeit df_left.join(df_right) 100 loops, best of 3: 6.88 ms per loop df_right_shuffled = df_right.sample(n=len(df_right)) %timeit df_left.join(df_right_shuffled) 100 loops, best of 3: 18.7 ms per loop +PJOCZTPSUFEVOJRVF*OEFY 4IV⒐FCZSBOEPNTBNQMJOH /POTPSUFEVOJRVF*OEFY -FGU.SPXT DPMVNOT 3JHIU,SPXT DPMVNO
  37. 6. I/O • No single solution (read/write, types…) • Cited

    from “Efficiently Store Pandas DataFrames” by Matthew Rocklin QSPUPDPM
  38. 6. I/O • Parsing datetime likely to be a bottleneck

    (load). • ISO-8601 iso_8641_fmt = '2011-{0:02d}-{1:02d} 00:00:00' values = [iso_8601_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['2011-01-03 00:00:00', '2011-02-04 00:00:00'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)
  39. 6. I/O • Flexible format (parsed by dateutil) mdy_fmt =

    '{0:02d}/{1:02d}/2011' values = [mdy_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['01/03/2011', '02/04/2011'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)
  40. 6. I/O N = 10000 months = np.random.randint(1, 12, N)

    days = np.random.randint(1, 28, N) 100 loops, best of 3: 2.26 ms per loop dates = [mdy_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1 loops, best of 3: 805 ms per loop dates = [iso_8601_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1SFQBSF,SBOEPN DPNCJOBUJPOTPGNPOUIBOEEBZ %timeit pd.to_datetime(dates, format='%m/%d/%Y') 10 loops, best of 3: 26.1 ms per loop *40 'PSNBUMJLFlz 1SPWJEJOHlGPSNBUzLXNBZ JNQSPWFUIFQFSGPSNBODF
  41. Limitations • pandas cannot control: • Internal data alignment &

    consolidation • Internal data copy triggered by NumPy • Alternatives: • Use NumPy directly • Use languages can control lower-levels
  42. Conclusions • Understand: • How pandas handles internal data efficiently.

    • Some basic rules to get the most out of pandas.