Faster problem solving with Pandas (DevDays/NDR 2021)

Slide 1

Slide 1 text

Faster Problem Solving with Pandas @IanOzsvald – ianozsvald.com Ian Ozsvald DevDays 2021 (my first!)

Slide 2

Slide 2 text

•Get more into RAM & see what’s slow •Vectorise for speed •Debug groupbys •Use Numba to compile rolling functions •Install optional dependencies that make you faster Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 3

Slide 3 text

 Interim Chief Data Scientist  19+ years experience  Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

Slide 4

Slide 4 text

 Remember – all benchmarks are wrong  Your data/OS/libraries will change the results  You need to do your own experiments Benchmarking caveat By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 5

Slide 5 text

 In-memory only  Operations can be RAM expensive  Very wide & rich API  Lots of timeseries tooling  Mixed history with NumPy & Python datatypes Pandas background By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

A quick look at the data (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald 25M rows, lots of text columns, circa 20GB in RAM in Pandas before we even start any work (the Pickle file we loaded was 4GB) https://www.gov.uk/government/collections/price-paid-data

Slide 7

Slide 7 text

Beware “info” - it underestimates By [ian]@ianozsvald[.com] Ian Ozsvald Text column size always under-estimated! ….

Slide 8

Slide 8 text

 Do you need all that data?  Could you use fewer columns?  Could you subsample or discard rows?  Smaller data == faster operations Tip By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

The slice we’ll use (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 10

Slide 10 text

We’ll check RAM and then shrink this By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 11

Slide 11 text

Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald 10x reduction! Categoricals added recently, they support all the Pandas types Regular operations work fine If you have repeated data – use them! int32 is a NumPy 32bit dtype, half the size of the normal int64 or float64

Slide 12

Slide 12 text

Check we don’t lose data! By [ian]@ianozsvald[.com] Ian Ozsvald Use .all() or .any() in an assert during research (or with an Exception for production code) to make sure you don’t lose data int64 and int32 have different capacities!

Slide 13

Slide 13 text

Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald 10x speed-up! By switching from strings to encoded data, we get some huge speed improvements

Slide 14

Slide 14 text

Inside the category dtype By [ian]@ianozsvald[.com] Ian Ozsvald This row has a 3 which is an S (semi- detached)

Slide 15

Slide 15 text

 Categoricals are “see-through”, it is like having the original item in the column  If you have repeated data then use them for memory and time savings Tip By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

So how many sales do we see? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 17

Slide 17 text

GroupBy and apply is easy but slow By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

Vectorised version will be faster By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 19

Slide 19 text

Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster – but maybe harder to read and to remember in a month what was happening?

Slide 20

Slide 20 text

Tip - Opening up the groupby By [ian]@ianozsvald[.com] Ian Ozsvald .groups is a dictionary of grouped keys to the matching rows

Slide 21

Slide 21 text

Extracting an item from the groupby By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 22

Slide 22 text

dir(…) to introspect (there’s lots!) By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 23

Slide 23 text

 .rolling lets us make a “rolling window” on our data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba is a Just in Time compiler (JIT)  Now we can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 24

Slide 24 text

Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays” are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 8x speed- up

Slide 25

Slide 25 text

rolling+Numba on our data By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 26

Slide 26 text

rolling+Numba on our data By [ian]@ianozsvald[.com] Ian Ozsvald 4x speed-up on our custom NumPy-only function

Slide 27

Slide 27 text

 Remember that Numba doesn’t “know Pandas” yet  Using NumPy dtypes and np. functions  A groupby-apply with Numba is in development Using Numba By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 28

Slide 28 text

GroupBy or numpy-groupies? By [ian]@ianozsvald[.com] Ian Ozsvald Numpy-groupies is a Numba-based NumPy (and Pandas) focused aggregator, it can compile common and custom functions. Getting the keys (factorize) is still slow but the aggregation can be much faster. factorize is faster than astype(‘category’) here mean and nanmean take the same time here

Slide 29

Slide 29 text

Beware well-meant“improvements” By [ian]@ianozsvald[.com] Ian Ozsvald Selecting “just the right columns” can involve a copy which slows things down!

Slide 30

Slide 30 text

 Some helpful dependencies aren’t installed by default  You should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 31

Slide 31 text

Install “bottleneck” for faster math By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck (and numexpr) are not installed by default and they offer free speed- ups, so install them!

Slide 32

Slide 32 text

Drop to NumPy if you know you can By [ian]@ianozsvald[.com] Ian Ozsvald Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is doing more work. This example works if you know you have no NaNs See my PyDataAmsterdam 2020 talk for more details (see my blog)

Slide 33

Slide 33 text

My cheatsheet By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet Link on the following slides

Slide 34

Slide 34 text

Useful reading By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

Slide 35

Slide 35 text

 Use Categoricals for speed-ups and to save RAM  Try Int32 and Float32 – half the size, sometimes faster  Install “bottleneck” and “numexpr” for free speed-ups  Investigate new Numba options in Pandas Tips By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

Slide 36

Slide 36 text

 Pandas has lots of new options for speed  My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald Get my cheatsheet: http://bit.ly/hpp_cheatsheet

Slide 37

Slide 37 text

Appendix By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 38

Slide 38 text

 Rich library wrapping DataFrames, Arrays and arbitrary Python functions (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 39

Slide 39 text

 Bigger than RAM or “I want to use all my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 40

Slide 40 text

Resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Reasonably fast and uses 1GB of RAM and 8 processes 100 parquet files of 20MB each (probably 100MB would be a better size)

Slide 41

Slide 41 text

Time series resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Dask Dashboard Task-list as a diagnostics

Slide 42

Slide 42 text

 Keep it simple – many moving parts  Few processes, lots of RAM per process  Always call .compute()  ddf.sample(frac=0.01) to subsample makes things faster  Check Array, Bag and Delayed options in Dask too Top Dask Tips By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 43

Slide 43 text

 A lesser known Pandas alternative  Designed for billions of rows, probably in HDF5 (single big file)  Memory maps to keep RAM usage low  Efficient groupby, faster strings, NumPy focused (also ML, visualisation and more), lazy expressions Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 44

Slide 44 text

 Typically you convert data to HDF5 (with helper functions)  Once converted opening the file and running queries is very quick  Results look like Pandas DataFrames but aren’t Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 45

Slide 45 text

Completion’s per day By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 46

Slide 46 text

 Dask – multi-core, -machine & -files, Vaex – multicore, single machine and typically single file  Dask allows most of Pandas and supports huge NumPy Arrays, a Bag, Delayed functions and lots more  Both can handle billions of rows and ML. Both have readable code Dask and Vaex? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 47

Slide 47 text

File sizes By [ian]@ianozsvald[.com] Ian Ozsvald Format Size on Disk (GB) CSV (original) 4.4 Pickle 5 3.6 Parquet uncompressed (faster to load) 2.2 Parquet + Snappy (small speed penalty) 1.5 SQLite 4.6 HDF5 (Vaex) 4.5