PyConJP 2015: pandas internals

pandas internals

Introduction • Data Analyst • OSS Contributions: • PyData Development
Team (pandas) • Blaze Development Team (Dask) • GitHub: https://github.com/sinhrks

Goal • Understand: • How pandas handles internal data eﬃciently.
• Some basic rules to get the most out of pandas.

Agenda • What is pandas? • pandas internals • Tips
for performance

What is pandas?

What is pandas? • “pandas” provides high-performance, easy-to-use data structures
for data analysis. • Known as “DataFrame” in R language. • Author: Wes McKinney • License: BSD • Etymology: PANel DAta System • GitHub: 5000↑⭐️

import pandas as pd df = pd.read_csv(‘adult.csv’) df DataFrame "EVMU%BUBTFUUBLFOGSPN6$*.-3FQPTJUPSZ
-JDINBO . 6$*.BDIJOF-FBSOJOH3FQPTJUPSZ<IUUQBSDIJWFJDTVDJFEVNM>*SWJOF $"6OJWFSTJUZPG$BMJGPSOJB 4DIPPMPG*OGPSNBUJPOBOE$PNQVUFS4DJFODF 3FBEDTWpMF

DataFrame df[['age', 'marital-status']] df.groupby('income')['hours-per-week'].mean() (SPVQCZ 4FMFDU "HHSFHBUF 4FMFDU

Why pandas? • Capable for real-world (messy) data • Provides
intuitive data structures • Batteries included for data wrangling

NumPy • N-dimensional array (ndarray) and matrix • Index (location)
based access • Single data type arr = np.array([1, 2, 3, 4], dtype=np.int64) arr array([1, 2, 3, 4]) MPDBUJPO

pandas • In addition to NumPy capabilities: • Label based
access (Index / Columns) • Mixed data types $PMVNOT *OEFY .JYFEEBUBUZQFT

Data Structures • Deﬁned per dimension. 4FSJFT % %BUB'SBNF %
1BOFM % $PMPSJ[FEDFMMTBSFMBCFMT

Functionality • Vectorized computations • Group by (split-apply-combine) • Reshaping
(merge, join, concat…) • Various I/O support (SQL, Excel, …) • Flexible time series handling • Plotting • Please refer to the oﬃcial documentation for details.

• Cooperative with various scientiﬁc packages. PyData Stacks #PLFI NBUQMPUMJC
4DJLJUMFBSO 4UBUTNPEFM /VN1Z 1Z5BCMFT 42-"MDIFNZ *CJT 4DJ1Z 1Z4QBSL #MB[F +VQZUFS QBOEBT 6TFS*OUFSGBDF 7JTVBMJ[BUJPO #JH%BUB *0 $PNQVUBUJPO .BDIJOF-FBSOJOH 4UBUJTUJDT SQZ 0UIFS1SPHSBNNJOH MBOHVBHFT

• Is functionality everything? • Performance • What developers do.
• What users can.

pandas internals

pandas internals • Introduce some typical techniques used in pandas
internally. • Intend to clarify the basics, rather than explaining algorithm detail. • Expect to be useful to achieve better performance in your program.

DataFrame Internals • Consists of Type-based “Block”. $PMVNOT *OEFY ʜ
*OU#MPDL 'MPBU#MPDL 0CKFDU#MPDL $PMVNOTNBZCF DPOTPMJEBUFEQFSUZQFT

Cython (Language) • A superset of the Python additionally supports
C functions and C types. Can be compiled to C code. def ismember(ndarray arr, set values): cdef: Py_ssize_t i, n ndarray[uint8_t] result object val n = len(arr) result = np.empty(n, dtype=np.uint8) for i in range(n): val = util.get_value_at(arr, i) result[i] = val in values return result.view(np.bool_) ismember(np.array([1, 2, 3, 4]), set([2, 3])) array([False, True, True, False], dtype=bool) 5ZQFEFpOJTJPOT 3FUVSOCPPMBSSBZJOEJDBUFT lBSSzJTJODMVEFEJOlWBMVFTzTFU (FUlBSSz`TJUIWBMVF

Cython • Performance critical functions are written in Cython or
C. • Cited from “Cython: A guide for Python programmers” by Kurt W.Smith” • NOTE: C and Fortran are not included in the table. -JOFTPG$ZUIPO 4BHF QBOEBT 4DJ1Z 4DJLJUMFBSO /VN1Z

Code Example: Reindex df = pd.DataFrame({'X': [1, 2, 3], 'Y':
[4, 5, 6], 'Z': [True, False, True]}, index=['a', 'b', 'c']) df df.reindex(['b', 'a', 'c']) 3FPSEFSFECZHJWFOJOEFY <bC` bB` bD`>

Code Example: Reindex • Pseudo code (logic is simpliﬁed): /POFFEUPSFJOEFY
/POFFEUPSFJOEFY def reindex(self, given_index): if [given_index is equal to self.index]: return self.copy() if len(given_index) == 0: return [empty] if self.index.is_unique: return [unique index logic] else: return [non-unique index logic] 0QUJNJ[FEMPHJDTGPS FBDIDPOEJUJPO 6OJRVFOFTT lJT@VOJRVFz JTB DBDIFEQSPQFSUZUZQJDBMMZEJWJEFT UIFMPHJD MBUFSTMJEF

array([[1, 4, True], [2, 5, False], [3, 6, True]], dtype=object)
df.index.get_indexer(['b', 'a', ‘c']) • Step by step: df.values Code Example: Reindex np.take(df.values, [1, 0, 2], axis=0) (FUJOUFSOBMOEBSSBZ array([1, 0, 2]) (FUNBQQJOHCFUXFFOHJWFOMBCFMT BOEDVSSFOUJOEFY OFYUTMJEF array([[2, 5, False], [1, 4, True], [3, 6, True]], dtype=object) 3FPSEFSCZMPDBUJPO

Code Example: get_indexer • Utilize Cython and C hash table
(klib/khash.h) cdef class Int64Engine(IndexEngine): cdef initialize(self): … self.mapping = _hash.Int64HashTable(…) def get_indexer(self, values): … return self.mapping.lookup(values) 5ZQFPQUJNJ[FE$ZUIPOJOUFSOBM OBNFEl&OHJOFz $ZUIPOXSBQQFSGPSLIBTII

Code Example: Cache • Index is immutable and can cache
some computations. class Index(…): @cache_readonly(…) def is_unique(self): return self._engine.is_unique $ZUIPOJ[FENFNPJ[FEFDPSBUPS 6TFUZQFPQUJNJ[FE&OHJOF

Code Example: Type-Optimized Cython Python script to generate Cython functions
for each type using templates. left_join_template = """ def left_join_indexer_%(name)s(ndarray[%(c_type)s] left, ndarray[%(c_type)s] right): ''' Two-pass algorithm for monotonic indexes. Handles many- to-one merges ''' cdef: Py_ssize_t i, j, k, nright, nleft, count %(c_type)s lval, rval ndarray[int64_t] lindexer, rindexer ndarray[%(c_type)s] result … 'JMMFECZFBDIUZQF

Release the GIL • For parallelism (pandas 0.17.0-). def duplicated_int64(ndarray[int64_t,
ndim=1] values, object keep='first'): cdef: int ret = 0, value, k Py_ssize_t i, n = len(values) kh_int64_t * table = kh_init_int64() ndarray[uint8_t, ndim=1, cast=True] out = np.empty(n, dtype='bool') kh_resize_int64(table, min(n, _SIZE_HINT_LIMIT)) … else: with nogil: for i from 0 <= i < n: value = values[i] k = kh_get_int64(table, value) if k != table.n_buckets: out[table.vals[k]] = 1 out[i] = 1 else: k = kh_put_int64(table, value, &ret) table.keys[k] = value table.vals[k] = i out[i] = 0 kh_destroy_int64(table) return out 3FMFBTFUIF(*- 6TFLIBTIEJSFDUMZ VOBCMF UPVTFXSBQQFSDMBTTXJUIPVU(*-

Further Reading • “A look inside pandas design and development”
by Wes McKinney

Performance Testing • airspeed velocity • pandas now has 650↑
benchmarks All benchmarks: before after ratio [5049b5 ] [53ac28 ] 293.20ns 290.10ns 0.99 attrs_caching.getattr_dataframe_index.time_getattr_dataframe_index 3.13μs 3.08μs 0.98 attrs_caching.setattr_dataframe_index.time_setattr_dataframe_index 7.45ms 7.23ms 0.97 binary_ops.frame_add.time_frame_add 4.14ms 4.09ms 0.99 binary_ops.frame_add_no_ne.time_frame_add_no_ne 4.28ms 4.40ms 1.03 binary_ops.frame_add_st.time_frame_add_st 21.67ms 21.58ms 1.00 binary_ops.frame_float_div.time_frame_float_div 5.74ms 5.84ms 1.02 binary_ops.frame_float_div_by_zero.time_frame_float_div_by_zero 17.90ms 17.81ms 0.99 binary_ops.frame_float_floor_by_zero.time_frame_float_floor_by_zero 10.49ms 9.97ms 0.95 binary_ops.frame_float_mod.time_frame_float_mod 5.95ms 6.14ms 1.03 binary_ops.frame_int_div_by_zero.time_frame_int_div_by_zero 10.64ms 10.64ms 1.00 binary_ops.frame_int_mod.time_frame_int_mod 7.26ms 7.31ms 1.01 binary_ops.frame_mult.time_frame_mult 4.14ms 4.10ms 0.99 binary_ops.frame_mult_no_ne.time_frame_mult_no_ne $PNQBSJTPOCFUXFFO DPNNJUT

Performance Testing • airspeed velocity • Changes with the passage
of the time

Tips for performance

Tips for performance • Introduce basic rules which can be
applied to most cases in performance point of view. • Some functions intends user’s convenience, rather than performance. • Environment • AWS EC2: c4.2xlarge (vCPU: 8, Memory: 15 GiB) • Python 3.5.0 • DISCLAIMER: Performance is mostly depending on actual data and operations. Be sure to proﬁle the eﬀectiveness.

1. Installation • Link NumPy to linear algebra libraries. •
BLAS/ATLAS, LAPACK • Install pandas optional dependencies: #PUUMFOFDL A collection of fast NumPy array functions. /VNFYQS A fast numerical expression evaluator.

1. Installation • Conﬁrm NumPy links sysinfo.get_info('lapack') {'language': 'f77', 'libraries':
['openblas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']} import numpy.distutils.system_info as sysinfo sysinfo.get_info('atlas') {'define_macros': [('ATLAS_INFO', '"\\"3.8.4\\""')], 'include_dirs': ['/home/ec2-user/miniconda/include'], 'language': 'f77', 'libraries': ['lapack', 'f77blas', 'cblas', 'atlas'], 'library_dirs': ['/home/ec2-user/miniconda/lib']}

1. Installation • Conﬁrm pandas-related environment. pd.show_versions() INSTALLED VERSIONS ------------------
python: 3.5.0.final.0 … pandas: 0.17.0 numpy: 1.10.0 bottleneck: 1.0.0 numexpr: 2.4.4 …

2. Built-in Functions / Methods • Check API doc before
writing user deﬁned functions (UDF) by yourself. • Some functions may be faster than NumPy depending on conditions. • Example: Uniquify np.unique([1, 2, 2, 3, 2, 4]) array([1, 2, 3, 4]) 3FNPWFEVQMJDBUFT

2. Built-in Functions / Methods %timeit np.unique(values) 10 loops, best
of 3: 42.2 ms per loop %timeit pd.unique(values) 100 loops, best of 3: 7.1 ms per loop np.random.seed(71) values = np.random.randint(1, 1000, 1000000) values array([108, 942, 12, ..., 308, 897, 40]) (FOFSBUFTBNQMFEBUB /VN1Z QBOEBT

2. Built-in Functions / Methods • Especially, avoid “apply”. •
Example: String concatenation $PODBUFOBUFTUSJOHTPG lCzBOElDzDPMVNOT

2. Built-in Functions / Methods def f1(s): return s['b'] +
s['c'] %timeit df.apply(f1, axis=1) 1 loops, best of 3: 14.3 s per loop %timeit df['b'] + df['c'] 10 loops, best of 3: 92.5 ms per loop BQQMZ 7FDUPSJ[FE import pandas.util.testing as tm chars1 = tm.rands_array(5, 100) chars2 = tm.rands_array(5, 10000) n = 1000000 df = pd.DataFrame({'a': np.random.randn(n), 'b': tm.choice(chars1, size=n), 'c': tm.choice(chars2, size=n)) df 1SFQBSJOHSBOEPNEBUB

3. Repeated Ops • pandas methods basically returns a copy
(not view). • Use single vectorized operation to avoid repeated copies. df[‘a’] + 1 "EEUPDPMVNOlBz

3. Repeated Ops • Example: Arithmetic %timeit df['a'] + 2
- 1 1000 loops, best of 3: 1.03 ms per loop %timeit df['a'] + 1 1000 loops, best of 3: 475 µs per loop $PQJFEUJNFT $PQJFEUJNF

4. Data Types • Avoid to use “object” dtype. •
Note that “str” is regarded as “object” dtype. • Example: Group-by → mean (SPVQCZlCzDPMVNOXIJDI IBTVOJRVFWBMVFT 5IFO$BMDVMBUFNFBO df.groupby('b').mean()

4. Data Types • Object dtype. • Convert the grouping
column to “Categorical”. %timeit df.groupby('b').mean() 10 loops, best of 3: 59.7 ms per loop df['b'] = df['b'].astype('category') %timeit df.groupby('b').mean() 100 loops, best of 3: 17.2 ms per loop lPCKFDUzEUZQF l$BUFHPSJDBMz

4. Data Types • Categorical, what? c = pd.Categorical(list(‘ababcabaca')) c
[a, b, a, b, c, a, b, a, c, a] Categories (3, object): [a, b, c] c.categories Index(['a', 'b', 'c'], dtype='object') c.codes array([0, 1, 0, 1, 2, 0, 1, 0, 2, 0], dtype=int8) B C D MPDBUJPO DBUFHPSJFT DPEFT 1SPDFTTFECZlDPEFTz GBTU $SFBUFl$BUFHPSJDBMz

5. Index • Index is immutable and cache some calculated
results. • Better to be sorted, unique and without missing values (NaN). • Example: Left-outer join by Index left.join(right) MFGU SJHIU KPJO

5. Index np.random.seed(71) df_left = pd.DataFrame({'a': np.random.randn(n), 'b': np.random.randn(n)}) n_right
= 10000 df_right = pd.DataFrame({'c': np.random.randint(1, 100, n_right)}) %timeit df_left.join(df_right) 100 loops, best of 3: 6.88 ms per loop df_right_shuffled = df_right.sample(n=len(df_right)) %timeit df_left.join(df_right_shuffled) 100 loops, best of 3: 18.7 ms per loop +PJOCZTPSUFEVOJRVF*OEFY 4IV⒐FCZSBOEPNTBNQMJOH /POTPSUFEVOJRVF*OEFY -FGU.SPXT DPMVNOT 3JHIU,SPXT DPMVNO

6. I/O • No single solution (read/write, types…) • Cited
from “Eﬃciently Store Pandas DataFrames” by Matthew Rocklin QSPUPDPM

6. I/O • Parsing datetime likely to be a bottleneck
(load). • ISO-8601 iso_8641_fmt = '2011-{0:02d}-{1:02d} 00:00:00' values = [iso_8601_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['2011-01-03 00:00:00', '2011-02-04 00:00:00'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)

6. I/O • Flexible format (parsed by dateutil) mdy_fmt =
'{0:02d}/{1:02d}/2011' values = [mdy_fmt.format(m, d) for m, d in zip([1, 2], [3, 4])] values ['01/03/2011', '02/04/2011'] pd.to_datetime(values) DatetimeIndex(['2011-01-03', '2011-02-04'], dtype='datetime64[ns]', freq=None, tz=None)

6. I/O N = 10000 months = np.random.randint(1, 12, N)
days = np.random.randint(1, 28, N) 100 loops, best of 3: 2.26 ms per loop dates = [mdy_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1 loops, best of 3: 805 ms per loop dates = [iso_8601_fmt.format(m, d) for m, d in zip(months, days)] %timeit pd.to_datetime(dates) 1SFQBSF,SBOEPN DPNCJOBUJPOTPGNPOUIBOEEBZ %timeit pd.to_datetime(dates, format='%m/%d/%Y') 10 loops, best of 3: 26.1 ms per loop *40 'PSNBUMJLFlz 1SPWJEJOHlGPSNBUzLXNBZ JNQSPWFUIFQFSGPSNBODF

Limitations • pandas cannot control: • Internal data alignment &
consolidation • Internal data copy triggered by NumPy • Alternatives: • Use NumPy directly • Use languages can control lower-levels

Other Tools • Computation time • Cython • Numba •
Parallelize • Dask • PySpark

Conclusions • Understand: • How pandas handles internal data eﬃciently.
• Some basic rules to get the most out of pandas.

Interested? • Let’s start contribution! • https://github.com/pydata/pandas

PyConJP 2015: pandas internals

PyConJP 2015: pandas internals

More Decks by Sinhrks

Other Decks in Programming

Featured

Transcript