Flying Pandas - Modin, Dask and Vaex

Flying Pandas - Dask, Modin and Vaex (live from London)
@IanOzsvald – ianozsvald.com Ian Ozsvald Remote Pizza Python 2020

 Interim Chief Data Scientist  19+ years experience 
Team coaching & public courses – Higher Performance! Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition M ay 2020

 When to use Modin or Dask  A quick
peek at Vaex Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald

 10 million rows “probably fine” but needs 10s GB
RAM  Probably only single core, built for in-RAM computation  Complex 10yr codebase, hard to optimise  Following tools are Pandas-like (each with differences) When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald

 A new “algebra” for DataFrames, reimplemented functions & Pandas
fallback  Young project, drop-in replacement  Uses Ray for parallel computation  Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald

Modin By [ian]@ianozsvald[.com] Ian Ozsvald

 ex Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390

 Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe
for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html

Dask – remember to “.compute()” By [ian]@ianozsvald[.com] Ian Ozsvald

Dask – mature & rich diagnostics By [ian]@ianozsvald[.com] Ian Ozsvald
groupby task-graph

 “Slower” than Pandas but happily works for 100GBs+ 
Lots of docs & help on StackOverflow  Great for 1 or n machines for bigger-than-RAM tasks  Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald

 “New” project (not “Pandas”)  Memory mapped, virtual columns
& lazy computation  New string dtype (RAM efficient)  See article (single laptop, billions of samples) -> Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385

 Dask on Bigger Data, Modin if in RAM. See
Budapest talk for smaller dataframes on Monday  See blog for my classes  I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald meetup.com/PyData-Budapest

Flying Pandas - Modin, Dask and Vaex

Flying Pandas - Modin, Dask and Vaex

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

Flying Pandas - Dask, Modin and Vaex (live from London)

 Interim Chief Data Scientist  19+ years experience 

 When to use Modin or Dask  A quick

 10 million rows “probably fine” but needs 10s GB

 A new “algebra” for DataFrames, reimplemented functions & Pandas

Modin By [ian]@ianozsvald[.com] Ian Ozsvald

 ex Modin By [ian]@ianozsvald[.com] Ian Ozsvald https://github.com/modin-project/modin/issues/1390

 Mature project, Array (NumPy), Bag (list-like)  Distributed dataframe

Dask – remember to “.compute()” By [ian]@ianozsvald[.com] Ian Ozsvald

Dask – mature & rich diagnostics By [ian]@ianozsvald[.com] Ian Ozsvald

 “Slower” than Pandas but happily works for 100GBs+ 

 “New” project (not “Pandas”)  Memory mapped, virtual columns

 Dask on Bigger Data, Modin if in RAM. See