Making Pandas Fly (live from
London)
@IanOzsvald – ianozsvald.com
Ian Ozsvald
EuroPython 2020
Slide 2
Slide 2 text
Interim Chief Data Scientist
19+ years experience
Team coaching & public courses
– I’m sharing from my Higher Performance
Python course
Introductions
By [ian]@ianozsvald[.com] Ian Ozsvald
2nd
Edition!
Slide 3
Slide 3 text
All volunteers – go say thank you in #lobby
They’ve put in a huge amount of volunteered work for us!
Thank the organisers!
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 4
Slide 4 text
Pandas
– Saving RAM to fit in more data
– Calculating faster by dropping to Numpy
Advice for “being highly performant”
Has Covid 19 affected UK Company Registrations?
Today’s goal
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 5
Slide 5 text
Strings are expensive and slow
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 6
Slide 6 text
Categoricals are cheap and fast!
By [ian]@ianozsvald[.com] Ian Ozsvald
Circa 1% of previous memory cost
Slide 7
Slide 7 text
Categoricals
“.cat” accessor
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 8
Slide 8 text
Categoricals – over 10x speed up (on
this data)!
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 9
Slide 9 text
Categoricals – index queries faster!
By [ian]@ianozsvald[.com] Ian Ozsvald
Circa 500x speed-up!
Slide 10
Slide 10 text
float64 is default and a bit expensive
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 11
Slide 11 text
float32 “half-price” and a bit faster
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 12
Slide 12 text
Make choices to save RAM
By [ian]@ianozsvald[.com] Ian Ozsvald
Including the index (previously
we ignored it) we still save
circa 50% RAM so you can fit in
more rows of data
Slide 13
Slide 13 text
“dtype_diet” gives you advice
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 14
Slide 14 text
Drop to NumPy if you know you can
By [ian]@ianozsvald[.com] Ian Ozsvald
Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean
which is slower – see my blog or PyDataAmsterdam 2020 talk for details
Slide 15
Slide 15 text
NumPy vs Pandas overhead
(ser.sum())
By [ian]@ianozsvald[.com] Ian Ozsvald
25 files, 83 functions
Very few NumPy
calls!
Thanks!
Slide 16
Slide 16 text
Overhead...
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 17
Slide 17 text
Overhead with ser.values.sum()
By [ian]@ianozsvald[.com] Ian Ozsvald
18 files, 51 functions
Many fewer Pandas
calls (but still a lot!)
Slide 18
Slide 18 text
Is Pandas unnecessarily slow – NO!
By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/pandas-dev/pandas/issues/34773 -
the truth is a bit complicated!
Slide 19
Slide 19 text
Install optional (but great!) Pandas dependencies
– bottleneck
– numexpr
Investigate https://github.com/ianozsvald/dtype_diet
Investigate my ipython_memory_usage (PyPI/Conda)
Being highly performant
By [ian]@ianozsvald[.com] Ian Ozsvald
https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html
Slide 20
Slide 20 text
Pure Python is “slow” and expressive
By [ian]@ianozsvald[.com] Ian Ozsvald
Deliberately poor function – pretend
this is clever but slow!
Slide 21
Slide 21 text
Compile to Numba judiciously
By [ian]@ianozsvald[.com] Ian Ozsvald
Near 10x speed-up!
Slide 22
Slide 22 text
Parallelise with Dask for multi-core
By [ian]@ianozsvald[.com] Ian Ozsvald
Make plain-Python
code multi-core
Note I had to drop text
index column due to
speed-hit
Data copy cost can
overwhelm any benefits
so (always) profile &
time
Slide 23
Slide 23 text
Mistakes slow us down (PAY ATTENTION!)
– Try nullable Int64 & boolean, forthcoming Float64
– Write tests (unit & end-to-end)
– Lots more material & my newsletter on my blog
IanOzsvald.com
– Time saving docs:
Being highly performant
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 24
Slide 24 text
Memory mapped & lazy computation
– New string dtype (RAM efficient)
Modin sits on Pandas, new “algebra” for dfs
– Drop in replacement, easy to try
Vaex / Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
See talks on my blog:
Slide 25
Slide 25 text
Make it right then make it fast
Think about being performant
See blog for my classes
I’d love a postcard if you learned
something new!
Summary
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 26
Slide 26 text
Covid 19’s effect on UK Economy?
By [ian]@ianozsvald[.com] Ian Ozsvald
Sharp decline in
corporate
registration
after Lockdown
– then apparent
surge (perhaps
just backed-up
paperwork?).
Will the
recovery “last”?
All open data,
you can do
similar things!