Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

November 11, 2020

Skinny Pandas Riding on a Rocket (PyDataGlobal 2020)

With larger datasets we need to be smarter about how we use Pandas to get results. We’ll look at strategies to shrink our data to get more into RAM, offload computation to tools like Dask or Vaex, store with Parquet or SQLite, make calculations faster and retain debuggability.
For more talks like this see: https://ianozsvald.com/


November 11, 2020

More Decks by ianozsvald

Other Decks in Research


  1. Skinny Pandas Riding on a Rocket @IanOzsvald – ianozsvald.com Ian

    Ozsvald (The inaugural!) PyDataGlobal 2020
  2. •Pandas – get more into RAM & see what’s slow

    •Dask – work on larger datasets and on multi-core •SQLite – slight segue to and oldie and a goodie •Vaex – a Pandas+Dask competitor for certain workflows? •25M row UK Land Registry (house purchase) dataset Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  4.  Remember – all benchmarks are wrong  Your data/OS/libraries

    will change the results  You need to do your own experiments Benchmarking caveat By [ian]@ianozsvald[.com] Ian Ozsvald
  5.  In-memory only  Operations can be RAM expensive 

    Very wide & rich API  Lots of timeseries tooling  Mixed history with NumPy & Python datatypes Pandas background By [ian]@ianozsvald[.com] Ian Ozsvald
  6. A quick look at the data (25M rows) By [ian]@ianozsvald[.com]

    Ian Ozsvald 25M rows, lots of text columns, circa 10GB in RAM in Pandas before we even start any work (the Pickle file we loaded was 4GB) https://www.gov.uk/government/collections/price-paid-data
  7. •Useful package to profile RAM usage in Python •Can track

    whole-program execution with “mprof” •Can track cost of single command with “%memit” memory_profiler By [ian]@ianozsvald[.com] Ian Ozsvald
  8. Note that groupby can be RAM hungry By [ian]@ianozsvald[.com] Ian

    Ozsvald Unfortunately the categorical version doesn’t save much RAM
  9. Install “bottleneck” for faster math By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck

    (and numexpr) are not installed by default and they offer free speed- ups, so install them!
  10. Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

    Ian Ozsvald Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is doing more work. This example works if you know you have no NaNs See my PyDataAmsterdam 2020 talk for more details (see my blog)
  11.  Use Categoricals for speed-ups and to save RAM 

    Try Int32 and Float32 – half the size, sometimes faster  Install “bottleneck” and “numexpr” for free speed-ups  We can only take this so far, to scale we need newer tools Tips By [ian]@ianozsvald[.com] Ian Ozsvald
  12.  Rich library wrapping DataFrames, Arrays and arbitrary Python functions

    (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  13.  Bigger than RAM or “I want to use all

    my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  14. Resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Reasonably fast

    and uses 1GB of RAM and 8 processes 100 parquet files of 20MB each (probably 100MB would be a better size)
  15. memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald 2 processes, 1.2GB total

    RAM footprint groupby similar to the Pandas version plus some preparation (which took 11GB) but takes 45 seconds Caveat – this one loads all columns to amplify memory usage (which we could optimise!) for this demo
  16. memory_profiler’s mprof By [ian]@ianozsvald[.com] Ian Ozsvald With 16 processes, 6GB

    total RAM usage but only 25 seconds You have control over the RAM footprint Note selecting only the subset of needed columns significantly speeds this up
  17.  Keep it simple – many moving parts  Few

    processes, lots of RAM per process  Always call .compute()  ddf.sample(frac=0.01) to subsample makes things faster  Check Array, Bag and Delayed options in Dask too Top Dask Tips By [ian]@ianozsvald[.com] Ian Ozsvald
  18.  20 years old, 2M tests  “Most widely deployed

    db on the planet”  “To be supported through to 2050”  1GB rows, 281TB max db, Android, Java, Python, Firefox, ... SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald
  19.  Circa 90MB SQLite process in RAM  Happily reads

    large (many GB-TB) database files  “Fast enough” for many applications & multi-user  Sane binary format for data interchange (beats CSV?)  Read/write and could be added to source control SQLite - simple, portable By [ian]@ianozsvald[.com] Ian Ozsvald
  20.  A lesser known Pandas alternative  Designed for billions

    of rows, probably in HDF5 (single big file)  Memory maps to keep RAM usage low  Efficient groupby, faster strings, NumPy focused (also ML, visualisation and more), lazy expressions Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald
  21.  Typically you convert data to HDF5 (with helper functions)

     Once converted opening the file and running queries is very quick  Results look like Pandas DataFrames but aren’t Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald
  22. Vaex RAM usage on a groupby By [ian]@ianozsvald[.com] Ian Ozsvald

    Using the same groupby operations we had with Pandas (which cost up to 1GB) – Vaex is far more efficient with RAM usage than Pandas or Dask and much faster than Dask! Sidenote – string operations (e.g. lower, contains) much faster than Pandas
  23.  Dask – multi-core, -machine & -files, Vaex – multicore,

    single machine and typically single file  Dask allows most of Pandas and supports huge NumPy Arrays, a Bag, Delayed functions and lots more  Both can handle billions of rows and ML. Both have readable code Dask and Vaex? By [ian]@ianozsvald[.com] Ian Ozsvald
  24.  Stick to Dask if you need Pandas functions 

    Think about Vaex common DF operations  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  25. File sizes By [ian]@ianozsvald[.com] Ian Ozsvald Format Size on Disk

    (GB) CSV (original) 4.4 Pickle 5 3.6 Parquet uncompressed (faster to load) 2.2 Parquet + Snappy (small speed penalty) 1.5 SQLite 4.6 HDF5 (Vaex) 4.5
  26.  We can export from Pandas with df.to_sql  Works

    via SQLAlchemy  We can execute arbitrary SQL e.g. to “CREATE INDEX” SQLite – why use “boring SQL”? By [ian]@ianozsvald[.com] Ian Ozsvald