May 05, 2020


  Making Pandas Fly (live from London) PyDataUK 2020

    Ozsvald PyDataUK 2020
  Interim Chief Data Scientist  19+ years experience

    Team coaching & public courses – Higher Performance! Introductions 2nd Edition May 2020
  Pandas – Saving RAM – Calculating faster by dropping

    to Numpy & Numba  A brief look at Modin for in-RAM faster Pandas ops  What does Covid 19 do to the (UK) economy? Today's goal
  NumPy vs Pandas overhead 25 files,

    83 functions Very few NumPy calls!
  Overhead...

  Overhead with ser.values.sum() 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  A new "algebra" for DataFrames, reimplemented functions & Pandas

    fallback  Young project, drop-in replacement  Uses Ray for parallel computation  Easy to experiment with Modin
  Modin

  ex Modin https://github.com/modin-project/modin/issues/1390

  Covid 19 UK economic impact?

  Modin if big df, else check your Pandas choices.

    Swifter multicore  See blog for my classes, also Thoughts & Jobs email list  I'd love a postcard if you learned something new Summary
  "New" project (not "Pandas")  Memory mapped, virtual columns

    & lazy computation  New string dtype (RAM efficient)  See article (single laptop, billions of samples) -> Vaex https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385
  10 million rows "probably fine" but needs 10s GB

    RAM  Probably only single core, built for in-RAM computation  Complex 10yr codebase, hard to optimise When does Pandas get smelly?
  Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe

    for Pandas – row blocks, not cols Dask Distributed DataFrame https://dask.readthedocs.io/en/latest/dataframe.html
  Dask – remember to ".compute()"

  Dask – mature & rich diagnostics

    groupby task-graph
  Live dataframe demo... Dask Demo

  "Slower" than Pandas but happily works for 100GBs+

    Lots of docs & help on StackOverflow  Great for 1 or n machines for bigger-than-RAM tasks  Give Workers lots of RAM (else they die!) Dask Distributed DataFrame