Pandas 2, Polars and Dask (PyDataLondon 2023)

Slide 1

Slide 1 text

Pandas 2, Polars or Dask? PyData London 2023 Talk @IanOzsvald – ianozsvald.com @GilesWeaver

Slide 2

Slide 2 text

Interim Chief Data Scientist We are Ian Ozsvald & Giles Weaver By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 2nd Edition! Data Scientist

Slide 3

Slide 3 text

Lots of change in the ecosystem in recent years Which library should you use? We learned Polars in 2 weeks We benchmark. All benchmarks are lies 3 interesting DataFrame libraries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 4

Slide 4 text

Ian - “Let’s do something silly” 2,000 mile round trip <£1k car Ideally it shouldn’t break or explode Motoscape Charity Rally By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

Slide 5

Slide 5 text

17 years of roadtest pass or fails 30M vehicles/year, [C|T]SV text files Text→Parquet made easy with Dask 600M rows in total Car Test Data (UK DVLA) By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 6

Slide 6 text

Pandas 15 years old, NumPy based PyArrow first class alongside NumPy Internal clean-ups so less RAM used Copy on Write (off by default) Pandas 2 – what’s new? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 7

Slide 7 text

PyArrow vs NumPy – which to use? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald NumExpr & bottleneck both installed Checks for identical results in notebook String dtype Nullable integer dtype Backend NumPy strings expensive in RAM e.g. 82M rows 39GB NumPy, 11GB Arrow

Slide 8

Slide 8 text

Pandas+Arrow, query, Seaborn By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald You can optimise by hand – mask, then choose columns to go faster

Slide 9

Slide 9 text

Rust based, Python front-end, 3 years old Arrow (not NumPy) Inherently multi-core and parallelised Eager and Lazy API (+Query Planner) Beta out-of-core (medium data) support Polars – what’s in it? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 10

Slide 10 text

Polars – same query & Seaborn By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald (Lazy df is even faster)

Slide 11

Slide 11 text

A more advanced query By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Polars eager (no “lazy() / collect()” call) takes 6s Pandas+NumPy takes 25s (i.e. slower) Possibly we can further optimise this by hand (?) Enables the Query Planner optimisations

Slide 12

Slide 12 text

Pandas+Arrow probably faster than Pandas+NumPy Polars seems to be faster than Pandas+Arrow Maybe you can make Pandas “as fast”, but you have to experiment – Polars is “just fast” All benchmarks are lies – your mileage will vary First conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 13

Slide 13 text

BULLET Should I buy Volvo V50 – mileage? By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 14

Slide 14 text

BULLET Volvo v50 lasts <24 hours By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

Slide 15

Slide 15 text

Resampling a timeseries By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald This dataset is in-RAM (2021-2022) There’s a limit to how much we can instantiate into memory, even if we’re careful with sub- selection and dtypes

Slide 16

Slide 16 text

BULLET Scanning 640M rows of larger dataset By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Implicit Lazy DataFrame 11 seconds, 640M rows, circa 850 partitions (files)

Slide 17

Slide 17 text

April drop was due to lockdown By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 18

Slide 18 text

Vehicle ownership increases, Hybrids growing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald We have to touch all parquet files, so we can’t easily use Pandas MOT after 3 years of age for all vehicles

Slide 19

Slide 19 text

Dask scales Pandas (and lots more) By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 20

Slide 20 text

By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Adding columns=[...] saves 20s

Slide 21

Slide 21 text

For the rally we bought a ‘99 Passat By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Dead before 2023 Still alive Us https://bit.ly/JustGivingIan

Slide 22

Slide 22 text

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with default 4 workers (*4 threads) 1min with 12 workers (*1 thr.) hand tuned Giles had to push directives to the Arrow read, set shuffle on set_index and agg

Slide 23

Slide 23 text

Issues encountered By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 24

Slide 24 text

Haven’t checked to_numpy(), Numba, apply, rolling, writing partitioned Parquet (Polars) NaN / Missing behaviour different Polars/Pandas sklearn partial support (sklearn assumes Pandas API) – but maybe Pandas+Arrow has copy issues too? Arrow timeseries/str different to Pandas NumPy? Thoughts on our testing By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 25

Slide 25 text

Dask ddf and Polars can perform similarly Dask learning curve harder, especially for performance Dask does a lot more (e.g. Bag, ML, NumPy, clusters, diagnostics) Medium-data conclusions By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald

Slide 26

Slide 26 text

Experiment, we have options! I love receiving postcards (email me) Sponsor Ian (JustGiving) Join after for our discussion Summary By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald https://bit.ly/JustGivingIan

Slide 27

Slide 27 text

Appendix By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 28

Slide 28 text

Arrow RAM usage great! By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald Polars is similar as it uses Arrow

Slide 29

Slide 29 text

Manual Query Planning By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 30

Slide 30 text

TITLE By [ian]@ianozsvald[.com] and [email protected] @gilesweaver Ian Ozsvald 3min+ with default 4 workers (*4 threads) 1min with 12 works (*1 thread) – hand tuned Giles had to sort the Parquet (6 mins) & change groupby agg shuffle, else performance much worse