10 min talk at Remote Pizza Python advising on when you might replace Pandas with Modin, Dask or Vaex for bigger-than-RAM and parallelised computation.
RAM Probably only single core, built for in-RAM computation Complex 10yr codebase, hard to optimise Following tools are Pandas-like (each with differences) When does Pandas get smelly? By [ian]@ianozsvald[.com] Ian Ozsvald
fallback Young project, drop-in replacement Uses Ray for parallel computation Easy to experiment with Modin By [ian]@ianozsvald[.com] Ian Ozsvald
for Pandas – row blocks, not cols Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald https://dask.readthedocs.io/en/latest/dataframe.html
Lots of docs & help on StackOverflow Great for 1 or n machines for bigger-than-RAM tasks Give Workers lots of RAM (else they die!) Dask Distributed DataFrame By [ian]@ianozsvald[.com] Ian Ozsvald
& lazy computation New string dtype (RAM efficient) See article (single laptop, billions of samples) -> Vaex By [ian]@ianozsvald[.com] Ian Ozsvald https://towardsdatascience.com/ml-impossible-train-a-1-billion-sample-model-in-20- minutes-with-vaex-and-scikit-learn-on-your-9e2968e6f385
Budapest talk for smaller dataframes on Monday See blog for my classes I’d love a postcard if you learned something new Summary By [ian]@ianozsvald[.com] Ian Ozsvald meetup.com/PyData-Budapest