Making Pandas Fly

Making Pandas Fly

Talk at PyDataBudapest on preprocessing data with Dask and then making Pandas run much faster and use less RAM: https://ianozsvald.com/2020/04/27/flying-pandas-and-making-pandas-fly-virtual-talks-this-weekend-on-faster-data-processing-with-pandas-modin-dask-and-vaex/

3d644406158b4d440111903db1f62622?s=128

ianozsvald

April 27, 2020
Tweet

Transcript

  1. Making Pandas Fly (live from London) @IanOzsvald – ianozsvald.com Ian

    Ozsvald PyDataBudapest 2020
  2.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses – Higher Performance! Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition M ay 2020
  3.  Preparing data with Dask  Pandas – Saving RAM

    – Calculating faster by dropping to Numpy & Numba  What does Covid 19 do to the (UK) economy? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  4.  10 million rows “probably fine” but needs 10s GB

    RAM  Probably only single core, built for in-RAM computation  Complex 10yr codebase, hard to optimise When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald
  5.  Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe

    for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html
  6. Dask – remember to “.compute()” By [ian]@ianozsvald[.com] Ian Ozsvald

  7. Dask – mature & rich diagnostics By [ian]@ianozsvald[.com] Ian Ozsvald

    groupby task-graph
  8.  Live dataframe demo... Dask Demo By [ian]@ianozsvald[.com] Ian Ozsvald

  9.  “Slower” than Pandas but happily works for 100GBs+ 

    Lots of docs & help on StackOverflow  Great for 1 or n machines for bigger-than-RAM tasks  Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald
  10.  Get more into RAM with smaller dtypes  Use

    smaller dtypes for numbers  Make smarter function choices  Compile with Numba Making Pandas fly By [ian]@ianozsvald[.com] Ian Ozsvald
  11. NumPy vs Pandas overhead By [ian]@ianozsvald[.com] Ian Ozsvald 25 files,

    83 functions Very few NumPy calls!
  12. Overhead... By [ian]@ianozsvald[.com] Ian Ozsvald

  13. Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  14. Covid 19 UK economic impact? By [ian]@ianozsvald[.com] Ian Ozsvald

  15.  Dask on Bigger Data, Modin if in RAM. Vaex

    for many strings  See blog for my classes  I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  16.  “New” project (not “Pandas”)  Memory mapped, virtual columns

    & lazy computation  New string dtype (RAM efficient)  See article (single laptop, billions of samples) -> Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385
  17.  A new “algebra” for DataFrames, reimplemented functions & Pandas

    fallback  Young project, drop-in replacement  Uses Ray for parallel computation  Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald
  18. Modin By [ian]@ianozsvald[.com] Ian Ozsvald

  19.  ex Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390