Flying Pandas - Dask, Modin and
Vaex (live from London)
@IanOzsvald – ianozsvald.com
Ian Ozsvald
Remote Pizza Python 2020
Slide 2
Slide 2 text
Interim Chief Data Scientist
19+ years experience
Team coaching & public courses
– Higher Performance!
Introductions
By [ian]@ianozsvald[.com] Ian Ozsvald
2nd
Edition
M
ay
2020
Slide 3
Slide 3 text
When to use Modin or Dask
A quick peek at Vaex
Today’s goal
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 4
Slide 4 text
10 million rows “probably fine” but needs 10s GB RAM
Probably only single core, built for in-RAM computation
Complex 10yr codebase, hard to optimise
Following tools are Pandas-like (each with differences)
When does Pandas get smelly?
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 5
Slide 5 text
A new “algebra” for DataFrames,
reimplemented functions & Pandas fallback
Young project, drop-in replacement
Uses Ray for parallel computation
Easy to experiment with
Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 6
Slide 6 text
Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 7
Slide 7 text
ex
Modin
By [ian]@ianozsvald[.com] Ian Ozsvald
https://github.com/modin-project/modin/issues/1390
Slide 8
Slide 8 text
Mature project, Array (NumPy), Bag (list-like)
Distributed dataframe for Pandas – row
blocks, not cols
Dask Distributed DataFrame
By [ian]@ianozsvald[.com] Ian Ozsvald
https://dask.readthedocs.io/en/latest/dataframe.html
Slide 9
Slide 9 text
Dask – remember to “.compute()”
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 10
Slide 10 text
Dask – mature & rich diagnostics
By [ian]@ianozsvald[.com] Ian Ozsvald
groupby task-graph
Slide 11
Slide 11 text
“Slower” than Pandas but happily works for 100GBs+
Lots of docs & help on StackOverflow
Great for 1 or n machines for bigger-than-RAM tasks
Give Workers lots of RAM (else they die!)
Dask Distributed DataFrame
By [ian]@ianozsvald[.com] Ian Ozsvald
Slide 12
Slide 12 text
“New” project (not “Pandas”)
Memory mapped, virtual columns & lazy computation
New string dtype (RAM efficient)
See article (single laptop, billions of samples) ->
Vaex
By [ian]@ianozsvald[.com] Ian Ozsvald
https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20-
minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385
Slide 13
Slide 13 text
Dask on Bigger Data, Modin if in
RAM. See Budapest talk for smaller
dataframes on Monday
See blog for my classes
I’d love a postcard if you learned
something new
Summary
By [ian]@ianozsvald[.com] Ian Ozsvald
meetup.com/PyData-Budapest