Sprinting Pandas (live in London)
@IanOzsvald – ianozsvald.com
Ian Ozsvald
London Python October 2020
Slide 2
Slide 2 text
Interim Chief Data Scientist
19+ years experience
Team coaching & public courses
–I’m sharing from my Higher Performance
Python course
Introductions
By [ian]@ianozsvald[.com] Ian Ozsvald
2nd
Edition!
Slide 3
Slide 3 text
Pandas
– Saving RAM to fit in more data
– Calculating faster by dropping to Numpy
Advice for “being highly performant”
Has Covid 19 affected UK Company Registrations?
Today’s goal
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 4
Slide 4 text
Strings are expensive and slow
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 5
Slide 5 text
Categoricals are cheap and fast!
By [ian]@ianozsvald[.com] Ian Ozsvald
Circa 1% of previous memory cost
Slide 6
Slide 6 text
Categoricals
“.cat” accessor
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 7
Slide 7 text
Categoricals – over 10x speed up (on
this data)!
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 8
Slide 8 text
Categoricals – index queries faster!
By [ian]@ianozsvald[.com] Ian Ozsvald
Circa 500x speed-up!
Slide 9
Slide 9 text
float64 is default and a bit expensive
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 10
Slide 10 text
float32 “half-price” and a bit faster
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 11
Slide 11 text
Make choices to save RAM
By [ian]@ianozsvald[.com] Ian Ozsvald
Including the index (previously
we ignored it) we still save
circa 50% RAM so you can fit in
more rows of data
Slide 12
Slide 12 text
“dtype_diet” gives you advice
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 13
Slide 13 text
Drop to NumPy if you know you can
By [ian]@ianozsvald[.com] Ian Ozsvald
Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean
which is slower – see my blog or PyDataAmsterdam 2020 talk for details
Slide 14
Slide 14 text
NumPy vs Pandas overhead
(ser.sum())
By [ian]@ianozsvald[.com] Ian Ozsvald
25 files, 83 functions
Very few NumPy
calls!
Thanks!
https://github.com/ianozsvald/callgraph_james_powell
Slide 15
Slide 15 text
Overhead...
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 16
Slide 16 text
Overhead with ser.values.sum()
By [ian]@ianozsvald[.com] Ian Ozsvald
18 files, 51 functions
Many fewer Pandas
calls (but still a lot!)
Slide 17
Slide 17 text
Is Pandas unnecessarily slow – NO!
By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/pandas-dev/pandas/issues/34773 -
the truth is a bit complicated!
Slide 18
Slide 18 text
Install optional (but great!) Pandas dependencies
– bottleneck
– numexpr
Investigate https://github.com/ianozsvald/dtype_diet
Investigate my ipython_memory_usage (PyPI/Conda)
Being highly performant
By [ian]@ianozsvald[.com] Ian Ozsvald
https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
Slide 19
Slide 19 text
Pure Python is “slow” and expressive
By [ian]@ianozsvald[.com] Ian Ozsvald
Deliberately poor function – pretend
this is clever but slow!
Slide 20
Slide 20 text
Compile to Numba judiciously
By [ian]@ianozsvald[.com] Ian Ozsvald
Near 10x speed-up!
Slide 21
Slide 21 text
Parallelise with Dask for multi-core
By [ian]@ianozsvald[.com] Ian Ozsvald
Make plain-Python
code multi-core
Note I had to drop text
index column due to
speed-hit
Data copy cost can
overwhelm any benefits
so (always) profile &
time
Slide 22
Slide 22 text
Mistakes slow us down (PAY ATTENTION!)
– Try nullable Int64 & boolean, forthcoming Float64
– Write tests (unit & end-to-end)
– Lots more material & my newsletter on my blog
IanOzsvald.com
– Time saving docs:
Being highly performant
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 23
Slide 23 text
Memory mapped & lazy computation
– New string dtype (RAM efficient)
Modin sits on Pandas, new “algebra” for dfs
– Drop in replacement, easy to try
Vaex / Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
See talks on my blog:
Slide 24
Slide 24 text
You have a huge dataset on a single harddrive
Memory mapped files (HDF5) are best
Numpy types and simpler Pandas-like functions
Investment – similar but different API to Pandas
When to try Vaex
By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/vaexio/vaex/issues/968
Slide 25
Slide 25 text
You want Pandas but ran out of RAM on 1 machine
You want multi-machine cluster scalability
You want multi-core support for operations like groupby
on parallelisable datasets
Investment – quick start then a learning curve
When to try Dask
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 26
Slide 26 text
You want all of Pandas
You have lots of RAM and many CPUs
You’re doing groupby operations on many columns
Investment – easy to try
When to try Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/modin-project/modin/issues/1390
Slide 27
Slide 27 text
Covid 19’s effect on UK Economy?
By [ian]@ianozsvald[.com] Ian Ozsvald
Sharp decline in
corporate
registration
after Lockdown
– then apparent
surge (perhaps
just backed-up
paperwork?).
Will the
recovery “last”?
All open data,
you can do
similar things!
Slide 28
Slide 28 text
Make it right then make it fast
Think about being performant
See blog for my classes
I’d love a postcard if you learned
something new!
Summary
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 29
Slide 29 text
Be faster by learning new approaches
By [ian]@ianozsvald[.com] Ian Ozsvald