Interim Chief Data Scientist 19+ years experience Team coaching & public courses – Higher Performance! Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition M ay 2020
Pandas – Saving RAM – Calculating faster by dropping to Numpy & Numba A brief look at Modin for in-RAM faster Pandas ops What does Covid 19 do to the (UK) economy? Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
A new “algebra” for DataFrames, reimplemented functions & Pandas fallback Young project, drop-in replacement Uses Ray for parallel computation Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald
Modin if big df, else check your Pandas choices. Swifter multicore See blog for my classes, also Thoughts & Jobs email list I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald
10 million rows “probably fine” but needs 10s GB RAM Probably only single core, built for in-RAM computation Complex 10yr codebase, hard to optimise When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald
Mature project, Array (NumPy), Bag (list-like) Distributed dataframe for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html
“Slower” than Pandas but happily works for 100GBs+ Lots of docs & help on StackOverflow Great for 1 or n machines for bigger-than-RAM tasks Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald