Sprinting Pandas at ODSC 2020

Sprinting Pandas (live from London) @IanOzsvald – ianozsvald.com Ian Ozsvald
ODSC 2020

 Interim Chief Data Scientist  19+ years experience 
Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

 Pandas – Saving RAM to fit in more data
– Calculating faster by dropping to Numpy  Advice for “being highly performant”  Has Covid 19 affected UK Company Registrations? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Strings are expensive and slow By [ian]@ianozsvald[.com] Ian Ozsvald

Categoricals are cheap and fast! By [ian]@ianozsvald[.com] Ian Ozsvald Circa
1% of previous memory cost

Categoricals “.cat” accessor By [ian]@ianozsvald[.com] Ian Ozsvald

Categoricals – over 10x speed up (on this data)! By
[ian]@ianozsvald[.com] Ian Ozsvald

Categoricals – index queries faster! By [ian]@ianozsvald[.com] Ian Ozsvald Circa
500x speed-up!

float64 is default and a bit expensive By [ian]@ianozsvald[.com] Ian
Ozsvald

float32 “half-price” and a bit faster By [ian]@ianozsvald[.com] Ian Ozsvald

Make choices to save RAM By [ian]@ianozsvald[.com] Ian Ozsvald Including
the index (previously we ignored it) we still save circa 50% RAM so you can fit in more rows of data

“dtype_diet” gives you advice By [ian]@ianozsvald[.com] Ian Ozsvald

Drop to NumPy if you know you can By [ian]@ianozsvald[.com]
Ian Ozsvald Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean which is slower – see my blog or PyDataAmsterdam 2020 talk for details

NumPy vs Pandas overhead (ser.sum()) By [ian]@ianozsvald[.com] Ian Ozsvald 25
files, 83 functions Very few NumPy calls! Thanks! https://github.com/ianozsvald/callgraph_james_powell

Overhead... By [ian]@ianozsvald[.com] Ian Ozsvald

Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51
functions Many fewer Pandas calls (but still a lot!)

Is Pandas unnecessarily slow – NO! By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/pandas-dev/pandas/issues/34773 - the truth is a bit complicated!

 Install optional (but great!) Pandas dependencies – bottleneck –
numexpr  Investigate https://github.com/ianozsvald/dtype_diet  Investigate my ipython_memory_usage (PyPI/Conda) Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

Pure Python is “slow” and expressive By [ian]@ianozsvald[.com] Ian Ozsvald
Deliberately poor function – pretend this is clever but slow!

Compile to Numba judiciously By [ian]@ianozsvald[.com] Ian Ozsvald Near 10x
speed-up!

Parallelise with Dask for multi-core By [ian]@ianozsvald[.com] Ian Ozsvald 
Make plain-Python code multi-core  Note I had to drop text index column due to speed-hit  Data copy cost can overwhelm any benefits so (always) profile & time

 Mistakes slow us down (PAY ATTENTION!) – Try nullable
Int64 & boolean, forthcoming Float64 – Write tests (unit & end-to-end) – Lots more material & my newsletter on my blog IanOzsvald.com – Time saving docs: Being highly performant By [ian]@ianozsvald[.com] Ian Ozsvald

 Memory mapped & lazy computation – New string dtype
(RAM efficient)  Modin sits on Pandas, new “algebra” for dfs – Drop in replacement, easy to try Vaex / Modin By [ian]@ianozsvald[.com] Ian Ozsvald See talks on my blog:

 You have a huge dataset on a single harddrive
 Memory mapped files (HDF5) are best  Numpy types and simpler Pandas-like functions  Investment – similar but different API to Pandas When to try Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/vaexio/vaex/issues/968

 You want Pandas but ran out of RAM on
1 machine  You want multi-machine cluster scalability  You want multi-core support for operations like groupby on parallelisable datasets  Investment – quick start then a learning curve When to try Dask By [ian]@ianozsvald[.com] Ian Ozsvald

 You want all of Pandas  You have lots
of RAM and many CPUs  You’re doing groupby operations on many columns  Investment – easy to try When to try Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390

Covid 19’s effect on UK Economy? By [ian]@ianozsvald[.com] Ian Ozsvald
Sharp decline in corporate registration after Lockdown – then apparent surge (perhaps just backed-up paperwork?). Will the recovery “last”? All open data, you can do similar things!

 Make it right then make it fast  Think
about being performant  See blog for my classes  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Be faster by learning new approaches By [ian]@ianozsvald[.com] Ian Ozsvald

Sprinting Pandas at ODSC 2020

Sprinting Pandas at ODSC 2020

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Sprinting Pandas (live from London) @IanOzsvald – ianozsvald.com Ian Ozsvald

 Interim Chief Data Scientist  19+ years experience 

 Pandas – Saving RAM to fit in more data

Strings are expensive and slow By [ian]@ianozsvald[.com] Ian Ozsvald

Categoricals are cheap and fast! By [ian]@ianozsvald[.com] Ian Ozsvald Circa

Categoricals “.cat” accessor By [ian]@ianozsvald[.com] Ian Ozsvald

Categoricals – over 10x speed up (on this data)! By

Categoricals – index queries faster! By [ian]@ianozsvald[.com] Ian Ozsvald Circa

float64 is default and a bit expensive By [ian]@ianozsvald[.com] Ian

float32 “half-price” and a bit faster By [ian]@ianozsvald[.com] Ian Ozsvald

Make choices to save RAM By [ian]@ianozsvald[.com] Ian Ozsvald Including

“dtype_diet” gives you advice By [ian]@ianozsvald[.com] Ian Ozsvald

Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

NumPy vs Pandas overhead (ser.sum()) By [ian]@ianozsvald[.com] Ian Ozsvald 25

Overhead... By [ian]@ianozsvald[.com] Ian Ozsvald

Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

Is Pandas unnecessarily slow – NO! By [ian]@ianozsvald[.com] Ian Ozsvald

 Install optional (but great!) Pandas dependencies – bottleneck –

Pure Python is “slow” and expressive By [ian]@ianozsvald[.com] Ian Ozsvald

Compile to Numba judiciously By [ian]@ianozsvald[.com] Ian Ozsvald Near 10x

Parallelise with Dask for multi-core By [ian]@ianozsvald[.com] Ian Ozsvald 

 Mistakes slow us down (PAY ATTENTION!) – Try nullable

 Memory mapped & lazy computation – New string dtype

 You have a huge dataset on a single harddrive

 You want Pandas but ran out of RAM on

 You want all of Pandas  You have lots

Covid 19’s effect on UK Economy? By [ian]@ianozsvald[.com] Ian Ozsvald

 Make it right then make it fast  Think

Be faster by learning new approaches By [ian]@ianozsvald[.com] Ian Ozsvald