# Sprinting Pandas (London Python)

## Transcript

1. Sprinting Pandas (live in London)
Ian Ozsvald
2. Interim Chief Data Scientist

19+ years experience

Team coaching & public courses
–I’m sharing from my Higher Performance
Python course
Introductions
2nd
Edition!

3. Pandas
– Saving RAM to fit in more data
– Calculating faster by dropping to Numpy

Has Covid 19 affected UK Company Registrations?
Today’s goal
4. Strings are expensive and slow
5. Categoricals are cheap and fast!
Circa 1% of previous memory cost

6. Categoricals
“.cat” accessor
7. Categoricals – over 10x speed up (on
this data)!
8. Categoricals – index queries faster!
Circa 500x speed-up!

9. float64 is default and a bit expensive
10. float32 “half-price” and a bit faster
11. Make choices to save RAM
Including the index (previously
we ignored it) we still save
circa 50% RAM so you can fit in
more rows of data

13. Drop to NumPy if you know you can
Caveat – Pandas mean is not np mean, the fair comparison is to np nanmean
which is slower – see my blog or PyDataAmsterdam 2020 talk for details

(ser.sum())
25 files, 83 functions
Very few NumPy
calls!
Thanks!
https://github.com/ianozsvald/callgraph_james_powell

18 files, 51 functions
Many fewer Pandas
calls (but still a lot!)

17. Is Pandas unnecessarily slow – NO!
https://github.com/pandas-dev/pandas/issues/34773 -
the truth is a bit complicated!

18. Install optional (but great!) Pandas dependencies
– bottleneck
– numexpr

Investigate https://github.com/ianozsvald/dtype_diet

Investigate my ipython_memory_usage (PyPI/Conda)
Being highly performant
https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html

19. Pure Python is “slow” and expressive
Deliberately poor function – pretend
this is clever but slow!

20. Compile to Numba judiciously
Near 10x speed-up!

21. Parallelise with Dask for multi-core
Make plain-Python
code multi-core

Note I had to drop text
index column due to
speed-hit

Data copy cost can
overwhelm any benefits
so (always) profile &
time

22. Mistakes slow us down (PAY ATTENTION!)
– Try nullable Int64 & boolean, forthcoming Float64
– Write tests (unit & end-to-end)
– Lots more material & my newsletter on my blog
IanOzsvald.com
– Time saving docs:
Being highly performant
23. Memory mapped & lazy computation
– New string dtype (RAM efficient)

Modin sits on Pandas, new “algebra” for dfs
– Drop in replacement, easy to try
Vaex / Modin
See talks on my blog:

24. You have a huge dataset on a single harddrive

Memory mapped files (HDF5) are best

Numpy types and simpler Pandas-like functions

Investment – similar but different API to Pandas
When to try Vaex
https://github.com/vaexio/vaex/issues/968

25. You want Pandas but ran out of RAM on 1 machine

You want multi-machine cluster scalability

You want multi-core support for operations like groupby
on parallelisable datasets

Investment – quick start then a learning curve
26. You want all of Pandas

You have lots of RAM and many CPUs

You’re doing groupby operations on many columns

Investment – easy to try
When to try Modin
https://github.com/modin-project/modin/issues/1390

27. Covid 19’s effect on UK Economy?
Sharp decline in
corporate
registration
after Lockdown
– then apparent
surge (perhaps
just backed-up
paperwork?).
Will the
recovery “last”?
All open data,
you can do
similar things!

28. Make it right then make it fast