Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making Pandas Fly

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

Making Pandas Fly

Talk at PyDataBudapest on preprocessing data with Dask and then making Pandas run much faster and use less RAM: https://ianozsvald.com/2020/04/27/flying-pandas-and-making-pandas-fly-virtual-talks-this-weekend-on-faster-data-processing-with-pandas-modin-dask-and-vaex/

Avatar for ianozsvald

ianozsvald

April 27, 2020
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses – Higher Performance! Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition M ay 2020
  2.  Preparing data with Dask  Pandas – Saving RAM

    – Calculating faster by dropping to Numpy & Numba  What does Covid 19 do to the (UK) economy? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3.  10 million rows “probably fine” but needs 10s GB

    RAM  Probably only single core, built for in-RAM computation  Complex 10yr codebase, hard to optimise When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald
  4.  Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe

    for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html
  5.  “Slower” than Pandas but happily works for 100GBs+ 

    Lots of docs & help on StackOverflow  Great for 1 or n machines for bigger-than-RAM tasks  Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald
  6.  Get more into RAM with smaller dtypes  Use

    smaller dtypes for numbers  Make smarter function choices  Compile with Numba Making Pandas fly By [ian]@ianozsvald[.com] Ian Ozsvald
  7. Overhead with ser.values.sum() By [ian]@ianozsvald[.com] Ian Ozsvald 18 files, 51

    functions Many fewer Pandas calls (but still a lot!)
  8.  Dask on Bigger Data, Modin if in RAM. Vaex

    for many strings  See blog for my classes  I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  9.  “New” project (not “Pandas”)  Memory mapped, virtual columns

    & lazy computation  New string dtype (RAM efficient)  See article (single laptop, billions of samples) -> Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385
  10.  A new “algebra” for DataFrames, reimplemented functions & Pandas

    fallback  Young project, drop-in replacement  Uses Ray for parallel computation  Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald