Slide 1

Slide 1 text

pandas internals

Slide 2

Slide 2 text

Introduction • Data Analyst • OSS Contributions: • PyData Development Team (pandas) • Blaze Development Team (Dask) • GitHub: https://github.com/sinhrks

Slide 3

Slide 3 text

Goal • Understand: • How pandas handles internal data efficiently. • Some basic rules to get the most out of pandas.

Slide 4

Slide 4 text

Agenda • What is pandas? • pandas internals • Tips for performance

Slide 5

Slide 5 text

What is pandas?

Slide 6

Slide 6 text

What is pandas? • “pandas” provides high-performance, easy-to-use data structures for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 5000↑⭐️

Slide 7

Slide 7 text

import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ -JDINBO . 6$*.BDIJOF-FBSOJOH3FQPTJUPSZ*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF

Slide 8

Slide 8 text

DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() (SPVQCZ 4FMFDU "HHSFHBUF 4FMFDU

Slide 9

Slide 9 text

Why pandas? • Capable for real-world (messy) data • Provides intuitive data structures • Batteries included for data wrangling

Slide 10

Slide 10 text

NumPy • N-dimensional array (ndarray) and matrix • Index (location) based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4]) MPDBUJPO

Slide 11

Slide 11 text

pandas • In addition to NumPy capabilities: • Label based access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT

Slide 12

Slide 12 text

Data Structures • Defined per dimension. 4FSJFT % %BUB'SBNF % 1BOFM % $PMPSJ[FEDFMMTBSFMBCFMT

Slide 13

Slide 13 text

Functionality • Vectorized computations • Group by (split-apply-combine) • Reshaping (merge, join, concat…) • Various I/O support (SQL, Excel, …) • Flexible time series handling • Plotting • Please refer to the official documentation for details.

Slide 14

Slide 14 text

• Cooperative with various scientific packages. PyData Stacks #PLFI NBUQMPUMJC 4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL #MB[F +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT

Slide 15

Slide 15 text

• Is functionality everything? • Performance • What developers do. • What users can.

Slide 16

Slide 16 text

pandas internals

Slide 17

Slide 17 text

pandas internals • Introduce some typical techniques used in pandas internally. • Intend to clarify the basics, rather than explaining algorithm detail. • Expect to be useful to achieve better performance in your program.

Slide 18

Slide 18 text

DataFrame Internals • Consists of Type-based “Block”. $PMVNOT *OEFY ʜ *OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT

Slide 19

Slide 19 text

Cython (Language) • A superset of the Python additionally supports C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJTJPOT 3FUVSOCPPMBSSBZJOEJDBUFT lBSSzJTJODMVEFEJOlWBMVFTzTFU (FUlBSSz`TJUIWBMVF

Slide 20

Slide 20 text

Cython • Performance critical functions are written in Cython or C. • Cited from “Cython: A guide for Python programmers” by Kurt W.Smith” • NOTE: C and Fortran are not included in the table. -JOFTPG$ZUIPO 4BHF QBOEBT 4DJ1Z 4DJLJUMFBSO /VN1Z

Slide 21

Slide 21 text

Code Example: Reindex df = pd.DataFrame({'X': [1, 2, 3], 'Y': [4, 5, 6], 'Z': [True, False, True]}, index=['a', 'b', 'c']) df df.reindex(['b', 'a', 'c']) 3FPSEFSFECZHJWFOJOEFY

Slide 22

Slide 22 text

Code Example: Reindex • Pseudo code (logic is simplified): /POFFEUPSFJOEFY /POFFEUPSFJOEFY def reindex(self, given_index): if [given_index is equal to self.index]: return self.copy() if len(given_index) == 0: return [empty] if self.index.is_unique: return [unique index logic] else: return [non-unique index logic] 0QUJNJ[FEMPHJDTGPS FBDIDPOEJUJPO 6OJRVFOFTT lJT@VOJRVFz JTB DBDIFEQSPQFSUZUZQJDBMMZEJWJEFT UIFMPHJD MBUFSTMJEF

Slide 23

Slide 23 text

array([[1, 4, True], [2, 5, False], [3, 6, True]], dtype=object) df.index.get_indexer(['b', 'a', ‘c']) • Step by step: df.values Code Example: Reindex np.take(df.values, [1, 0, 2], axis=0) (FUJOUFSOBMOEBSSBZ array([1, 0, 2]) (FUNBQQJOHCFUXFFOHJWFOMBCFMT BOEDVSSFOUJOEFY OFYUTMJEF array([[2, 5, False], [1, 4, True], [3, 6, True]], dtype=object) 3FPSEFSCZMPDBUJPO

Slide 24

Slide 24 text

Code Example: get_indexer • Utilize Cython and C hash table (klib/khash.h) cdef class Int64Engine(IndexEngine): cdef initialize(self): … self.mapping = _hash.Int64HashTable(…) def get_indexer(self, values): … return self.mapping.lookup(values) 5ZQFPQUJNJ[FE$ZUIPOJOUFSOBM OBNFEl&OHJOFz $ZUIPOXSBQQFSGPSLIBTII

Slide 25

Slide 25 text

Code Example: Cache • Index is immutable and can cache some computations. class Index(…): @cache_readonly(…) def is_unique(self): return self._engine.is_unique $ZUIPOJ[FENFNPJ[FEFDPSBUPS 6TFUZQFPQUJNJ[FE&OHJOF

Slide 26

Slide 26 text

Code Example: Type-Optimized Cython Python script to generate Cython functions for each type using templates. left_join_template = """ def left_join_indexer_%(name)s(ndarray[%(c_type)s] left, ndarray[%(c_type)s] right): ''' Two-pass algorithm for monotonic indexes. Handles many- to-one merges ''' cdef: Py_ssize_t i, j, k, nright, nleft, count %(c_type)s lval, rval ndarray[int64_t] lindexer, rindexer ndarray[%(c_type)s] result … 'JMMFECZFBDIUZQF

Slide 27

Slide 27 text

Release the GIL • For parallelism (pandas 0.17.0-). def duplicated_int64(ndarray[int64_t, ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*- 6TFLIBTIEJSFDUMZ VOBCMF UPVTFXSBQQFSDMBTTXJUIPVU(*-

Slide 28

Slide 28 text

Further Reading • “A look inside pandas design and development” by Wes McKinney

Slide 29

Slide 29 text

Performance Testing • airspeed velocity • pandas now has 650↑ benchmarks All benchmarks: before after ratio [5049b5 ] [53ac28 ] 293.20ns 290.10ns 0.99 attrs_caching.getattr_dataframe_index.time_getattr_dataframe_index 3.13μs 3.08μs 0.98 attrs_caching.setattr_dataframe_index.time_setattr_dataframe_index 7.45ms 7.23ms 0.97 binary_ops.frame_add.time_frame_add 4.14ms 4.09ms 0.99 binary_ops.frame_add_no_ne.time_frame_add_no_ne 4.28ms 4.40ms 1.03 binary_ops.frame_add_st.time_frame_add_st 21.67ms 21.58ms 1.00 binary_ops.frame_float_div.time_frame_float_div 5.74ms 5.84ms 1.02 binary_ops.frame_float_div_by_zero.time_frame_float_div_by_zero 17.90ms 17.81ms 0.99 binary_ops.frame_float_floor_by_zero.time_frame_float_floor_by_zero 10.49ms 9.97ms 0.95 binary_ops.frame_float_mod.time_frame_float_mod 5.95ms 6.14ms 1.03 binary_ops.frame_int_div_by_zero.time_frame_int_div_by_zero 10.64ms 10.64ms 1.00 binary_ops.frame_int_mod.time_frame_int_mod 7.26ms 7.31ms 1.01 binary_ops.frame_mult.time_frame_mult 4.14ms 4.10ms 0.99 binary_ops.frame_mult_no_ne.time_frame_mult_no_ne $PNQBSJTPOCFUXFFO DPNNJUT

Slide 30

Slide 30 text

Performance Testing • airspeed velocity • Changes with the passage of the time

Slide 31

Slide 31 text

Tips for performance

Slide 32

Slide 32 text

Tips for performance • Introduce basic rules which can be applied to most cases in performance point of view. • Some functions intends user’s convenience, rather than performance. • Environment • AWS EC2: c4.2xlarge (vCPU: 8, Memory: 15 GiB) • Python 3.5.0 • DISCLAIMER: Performance is mostly depending on actual data and operations. Be sure to profile the effectiveness.

Slide 33

Slide 33 text

1. Installation • Link NumPy to linear algebra libraries. • BLAS/ATLAS, LAPACK • Install pandas optional dependencies: #PUUMFOFDL A collection of fast NumPy array functions. /VNFYQS A fast numerical expression evaluator.

Slide 34

Slide 34 text

1. Installation • Confirm NumPy links sysinfo.get_info('lapack') {'language': 'f77', 'libraries': ['openblas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']} import numpy.distutils.system_info as sysinfo sysinfo.get_info('atlas') {'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'include_dirs': ['/home/ec2-user/miniconda/include'], 'language': 'f77', 'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']}

Slide 35

Slide 35 text

1. Installation • Confirm pandas-related environment. pd.show_versions() INSTALLED VERSIONS ------------------ python: 3.5.0.final.0 … pandas: 0.17.0 numpy: 1.10.0 bottleneck: 1.0.0 numexpr: 2.4.4 …

Slide 36

Slide 36 text

2. Built-in Functions / Methods • Check API doc before writing user defined functions (UDF) by yourself. • Some functions may be faster than NumPy depending on conditions. • Example: Uniquify np.unique([1, 2, 2, 3, 2, 4]) array([1, 2, 3, 4]) 3FNPWFEVQMJDBUFT

Slide 37

Slide 37 text

2. Built-in Functions / Methods %timeit np.unique(values) 10 loops, best of 3: 42.2 ms per loop %timeit pd.unique(values) 100 loops, best of 3: 7.1 ms per loop np.random.seed(71) values = np.random.randint(1, 1000, 1000000) values array([108, 942, 12, ..., 308, 897, 40]) (FOFSBUFTBNQMFEBUB /VN1Z QBOEBT

Slide 38

Slide 38 text

2. Built-in Functions / Methods • Especially, avoid “apply”. • Example: String concatenation $PODBUFOBUFTUSJOHTPG lCzBOElDzDPMVNOT

Slide 39

Slide 39 text

2. Built-in Functions / Methods def f1(s): return s['b'] + s['c'] %timeit df.apply(f1, axis=1) 1 loops, best of 3: 14.3 s per loop %timeit df['b'] + df['c'] 10 loops, best of 3: 92.5 ms per loop BQQMZ 7FDUPSJ[FE import pandas.util.testing as tm chars1 = tm.rands_array(5, 100) chars2 = tm.rands_array(5, 10000) n = 1000000 df = pd.DataFrame({'a': np.random.randn(n), 'b': tm.choice(chars1, size=n), 'c': tm.choice(chars2, size=n)) df 1SFQBSJOHSBOEPNEBUB

Slide 40

Slide 40 text

3. Repeated Ops • pandas methods basically returns a copy (not view). • Use single vectorized operation to avoid repeated copies. df[‘a’] + 1 "EEUPDPMVNOlBz

Slide 41

Slide 41 text

3. Repeated Ops • Example: Arithmetic %timeit df['a'] + 2 - 1 1000 loops, best of 3: 1.03 ms per loop %timeit df['a'] + 1 1000 loops, best of 3: 475 µs per loop $PQJFEUJNFT $PQJFEUJNF

Slide 42

Slide 42 text

4. Data Types • Avoid to use “object” dtype. • Note that “str” is regarded as “object” dtype. • Example: Group-by → mean (SPVQCZlCzDPMVNOXIJDI IBTVOJRVFWBMVFT 5IFO$BMDVMBUFNFBO df.groupby('b').mean()

Slide 43

Slide 43 text

4. Data Types • Object dtype. • Convert the grouping column to “Categorical”. %timeit df.groupby('b').mean() 10 loops, best of 3: 59.7 ms per loop df['b'] = df['b'].astype('category') %timeit df.groupby('b').mean() 100 loops, best of 3: 17.2 ms per loop lPCKFDUzEUZQF l$BUFHPSJDBMz

Slide 44

Slide 44 text

4. Data Types • Categorical, what? c = pd.Categorical(list(‘ababcabaca')) c [a, b, a, b, c, a, b, a, c, a] Categories (3, object): [a, b, c] c.categories Index(['a', 'b', 'c'], dtype='object') c.codes array([0, 1, 0, 1, 2, 0, 1, 0, 2, 0], dtype=int8) B C D MPDBUJPO DBUFHPSJFT DPEFT 1SPDFTTFECZlDPEFTz GBTU $SFBUFl$BUFHPSJDBMz

Slide 45

Slide 45 text

5. Index • Index is immutable and cache some calculated results. • Better to be sorted, unique and without missing values (NaN). • Example: Left-outer join by Index left.join(right) MFGU SJHIU KPJO

Slide 46

Slide 46 text

5. Index np.random.seed(71) df_left = pd.DataFrame({'a': np.random.randn(n), 'b': np.random.randn(n)}) n_right = 10000 df_right = pd.DataFrame({'c': np.random.randint(1, 100, n_right)}) %timeit df_left.join(df_right) 100 loops, best of 3: 6.88 ms per loop df_right_shuffled = df_right.sample(n=len(df_right)) %timeit df_left.join(df_right_shuffled) 100 loops, best of 3: 18.7 ms per loop +PJOCZTPSUFEVOJRVF*OEFY 4IV⒐FCZSBOEPNTBNQMJOH /POTPSUFEVOJRVF*OEFY -FGU.SPXT DPMVNOT 3JHIU,SPXT DPMVNO

Slide 47

Slide 47 text

6. I/O • No single solution (read/write, types…) • Cited from “Efficiently Store Pandas DataFrames” by Matthew Rocklin QSPUPDPM

Slide 48

Slide 48 text

6. I/O • Parsing datetime likely to be a bottleneck (load). • ISO-8601 iso_8641_fmt = '2011-{0:02d}-{1:02d} 00:00:00' values = [iso_8601_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['2011-01-03 00:00:00', '2011-02-04 00:00:00'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)

Slide 49

Slide 49 text

6. I/O • Flexible format (parsed by dateutil) mdy_fmt = '{0:02d}/{1:02d}/2011' values = [mdy_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['01/03/2011', '02/04/2011'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)

Slide 50

Slide 50 text

6. I/O N = 10000 months = np.random.randint(1, 12, N) days = np.random.randint(1, 28, N) 100 loops, best of 3: 2.26 ms per loop dates = [mdy_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1 loops, best of 3: 805 ms per loop dates = [iso_8601_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1SFQBSF,SBOEPN DPNCJOBUJPOTPGNPOUIBOEEBZ %timeit pd.to_datetime(dates, format='%m/%d/%Y') 10 loops, best of 3: 26.1 ms per loop *40 'PSNBUMJLFlz 1SPWJEJOHlGPSNBUzLXNBZ JNQSPWFUIFQFSGPSNBODF

Slide 51

Slide 51 text

Limitations • pandas cannot control: • Internal data alignment & consolidation • Internal data copy triggered by NumPy • Alternatives: • Use NumPy directly • Use languages can control lower-levels

Slide 52

Slide 52 text

Other Tools • Computation time • Cython • Numba • Parallelize • Dask • PySpark

Slide 53

Slide 53 text

Conclusions • Understand: • How pandas handles internal data efficiently. • Some basic rules to get the most out of pandas.

Slide 54

Slide 54 text

Interested? • Let’s start contribution! • https://github.com/pydata/pandas