PyDataUK_Making_Pandas_Fly.pdf

Slide 1

Slide 1 text

Making Pandas Fly (live from London) @IanOzsvald – ianozsvald.com Ian Ozsvald PyDataUK 2020

Slide 2

Slide 2 text

 Interim Chief Data Scientist  19+ years experience  Team coaching & public courses – Higher Performance! Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition M ay 2020

Slide 3

Slide 3 text

 Pandas – Saving RAM – Calculating faster by dropping to Numpy & Numba  A brief look at Modin for in-RAM faster Pandas ops  What does Covid 19 do to the (UK) economy? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 4

Slide 4 text

NumPy vs Pandas overhead By [ian]@ianozsvald[.com] Ian Ozsvald 25 files, 83 functions Very few NumPy calls!

Slide 5

Slide 5 text

Overhead... By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 6

Slide 6 text

Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51 functions Many fewer Pandas calls (but still a lot!)

Slide 7

Slide 7 text

 A new “algebra” for DataFrames, reimplemented functions & Pandas fallback  Young project, drop-in replacement  Uses Ray for parallel computation  Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 8

Slide 8 text

Modin By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 9

Slide 9 text

 ex Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390

Slide 10

Slide 10 text

Covid 19 UK economic impact? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 11

Slide 11 text

 Modin if big df, else check your Pandas choices. Swifter multicore  See blog for my classes, also Thoughts & Jobs email list  I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 12

Slide 12 text

 “New” project (not “Pandas”)  Memory mapped, virtual columns & lazy computation  New string dtype (RAM efficient)  See article (single laptop, billions of samples) -> Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

Slide 13

Slide 13 text

 10 million rows “probably fine” but needs 10s GB RAM  Probably only single core, built for in-RAM computation  Complex 10yr codebase, hard to optimise When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 14

Slide 14 text

 Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html

Slide 15

Slide 15 text

Dask – remember to “.compute()” By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 16

Slide 16 text

Dask – mature & rich diagnostics By [ian]@ianozsvald[.com] Ian Ozsvald groupby task-graph

Slide 17

Slide 17 text

 Live dataframe demo... Dask Demo By [ian]@ianozsvald[.com] Ian Ozsvald

Slide 18

Slide 18 text

 “Slower” than Pandas but happily works for 100GBs+  Lots of docs & help on StackOverflow  Great for 1 or n machines for bigger-than-RAM tasks  Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald