Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Faster problem solving with Pandas (PyConBY 2021)

3d644406158b4d440111903db1f62622?s=47 ianozsvald
March 12, 2021

Faster problem solving with Pandas (PyConBY 2021)

"My Pandas is slow!" - I hear that a lot. We'll look at ways of making Pandas calculate faster, help you express your problem to fit Pandas more efficiently and look at process changes that'll make you waste less time debugging your Pandas. By attending this talk you'll get answers faster and more reliably with Pandas so your analytics and data science work will be more rewarding.

Presented at PyCon Belarus 2021.

3d644406158b4d440111903db1f62622?s=128

ianozsvald

March 12, 2021
Tweet

Transcript

  1. Faster Problem Solving with Pandas @IanOzsvald – ianozsvald.com Ian Ozsvald

    PyCon Belarus 2021 (my first!)
  2. •Get more into RAM & see what’s slow •Vectorise for

    speed •Debug groupbys •Use Numba to compile rolling functions •Install optional dependencies that make you faster Today’s goal By [ian]@ianozsvald[.com] Ian Ozsvald
  3.  Interim Chief Data Scientist  19+ years experience 

    Team coaching & public courses –I’m sharing from my Higher Performance Python course Introductions By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!
  4. Forgive me making a video! By [ian]@ianozsvald[.com] Ian Ozsvald

  5.  Remember – all benchmarks are wrong  Your data/OS/libraries

    will change the results  You need to do your own experiments Benchmarking caveat By [ian]@ianozsvald[.com] Ian Ozsvald
  6.  In-memory only  Operations can be RAM expensive 

    Very wide & rich API  Lots of timeseries tooling  Mixed history with NumPy & Python datatypes Pandas background By [ian]@ianozsvald[.com] Ian Ozsvald
  7. A quick look at the data (25M rows) By [ian]@ianozsvald[.com]

    Ian Ozsvald 25M rows, lots of text columns, circa 20GB in RAM in Pandas before we even start any work (the Pickle file we loaded was 4GB) https://www.gov.uk/government/collections/price-paid-data
  8. Beware “info” - it underestimates By [ian]@ianozsvald[.com] Ian Ozsvald Text

    column size always under-estimated! ….
  9.  Do you need all that data?  Could you

    use fewer columns?  Could you subsample or discard rows?  Smaller data == faster operations Tip By [ian]@ianozsvald[.com] Ian Ozsvald
  10. The slice we’ll use (25M rows) By [ian]@ianozsvald[.com] Ian Ozsvald

  11. We’ll check RAM and then shrink this By [ian]@ianozsvald[.com] Ian

    Ozsvald
  12. Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald

    10x reduction! Categoricals added recently, they support all the Pandas types Regular operations work fine If you have repeated data – use them! int32 is a NumPy 32bit dtype, half the size of the normal int64 or float64
  13. Check we don’t lose data! By [ian]@ianozsvald[.com] Ian Ozsvald Use

    .all() or .any() in an assert during research (or with an Exception for production code) to make sure you don’t lose data int64 and int32 have different capacities!
  14. Let’s use Category & int32 dtypes By [ian]@ianozsvald[.com] Ian Ozsvald

    10x speed-up! By switching from strings to encoded data, we get some huge speed improvements
  15. Inside the category dtype By [ian]@ianozsvald[.com] Ian Ozsvald This row

    has a 3 which is an S (semi- detached)
  16.  Categoricals are “see-through”, it is like having the original

    item in the column  If you have repeated data then use them for memory and time savings Tip By [ian]@ianozsvald[.com] Ian Ozsvald
  17. So how many sales do we see? By [ian]@ianozsvald[.com] Ian

    Ozsvald
  18. GroupBy and apply is easy but slow By [ian]@ianozsvald[.com] Ian

    Ozsvald
  19. Vectorised version will be faster By [ian]@ianozsvald[.com] Ian Ozsvald

  20. Vectorised-precalculation is faster By [ian]@ianozsvald[.com] Ian Ozsvald 1.5x faster –

    but maybe harder to read and to remember in a month what was happening?
  21. Tip - Opening up the groupby By [ian]@ianozsvald[.com] Ian Ozsvald

    .groups is a dictionary of grouped keys to the matching rows
  22. Extracting an item from the groupby By [ian]@ianozsvald[.com] Ian Ozsvald

  23. dir(…) to introspect (there’s lots!) By [ian]@ianozsvald[.com] Ian Ozsvald

  24.  .rolling lets us make a “rolling window” on our

    data  Great for time-series e.g. rolling 1w or 10min  User-defined functions traditionally crazy-slow  Numba is a Just in Time compiler (JIT)  Now we can make faster functions for little effort Rolling and Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  25. Numba on rolling operations By [ian]@ianozsvald[.com] Ian Ozsvald “raw arrays”

    are the underlying NumPy arrays – this only works for NumPy arrays, not Pandas Extension types 8x speed- up
  26. rolling+Numba on our data By [ian]@ianozsvald[.com] Ian Ozsvald

  27. rolling+Numba on our data By [ian]@ianozsvald[.com] Ian Ozsvald 4x speed-up

    on our custom NumPy-only function
  28.  Remember that Numba doesn’t “know Pandas” yet  Using

    NumPy dtypes and np. functions  A groupby-apply with Numba is in development Using Numba By [ian]@ianozsvald[.com] Ian Ozsvald
  29.  Some helpful dependencies aren’t installed by default  You

    should install these – especially bottleneck Bottleneck & numexpr – install these! By [ian]@ianozsvald[.com] Ian Ozsvald
  30. Install “bottleneck” for faster math By [ian]@ianozsvald[.com] Ian Ozsvald Bottleneck

    (and numexpr) are not installed by default and they offer free speed- ups, so install them!
  31. Drop to NumPy if you know you can By [ian]@ianozsvald[.com]

    Ian Ozsvald Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is doing more work. This example works if you know you have no NaNs See my PyDataAmsterdam 2020 talk for more details (see my blog)
  32. Useful reading By [ian]@ianozsvald[.com] Ian Ozsvald

  33.  Use Categoricals for speed-ups and to save RAM 

    Try Int32 and Float32 – half the size, sometimes faster  Install “bottleneck” and “numexpr” for free speed-ups  Investigate new Numba options in Pandas Tips By [ian]@ianozsvald[.com] Ian Ozsvald
  34.  Pandas has lots of new options for speed 

    My newsletter has tips (and jobs)  See blog for my classes + many past talks  I’d love a postcard if you learned something new! Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  35. Appendix By [ian]@ianozsvald[.com] Ian Ozsvald

  36.  Rich library wrapping DataFrames, Arrays and arbitrary Python functions

    (very powerful, quite complex)  Popular for scaling Pandas  Wraps Pandas DataFrames Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  37.  Bigger than RAM or “I want to use all

    my cores”  Generally you’ll use Parquet (or CSV or many choices)  The Dashboard gives rich diagnostics  Write Pandas-like code  Lazy – use “.compute()” Dask for larger datasets By [ian]@ianozsvald[.com] Ian Ozsvald
  38. Resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald Reasonably fast

    and uses 1GB of RAM and 8 processes 100 parquet files of 20MB each (probably 100MB would be a better size)
  39. Time series resampling across 100 partitions By [ian]@ianozsvald[.com] Ian Ozsvald

    Dask Dashboard Task-list as a diagnostics
  40.  Keep it simple – many moving parts  Few

    processes, lots of RAM per process  Always call .compute()  ddf.sample(frac=0.01) to subsample makes things faster  Check Array, Bag and Delayed options in Dask too Top Dask Tips By [ian]@ianozsvald[.com] Ian Ozsvald
  41.  A lesser known Pandas alternative  Designed for billions

    of rows, probably in HDF5 (single big file)  Memory maps to keep RAM usage low  Efficient groupby, faster strings, NumPy focused (also ML, visualisation and more), lazy expressions Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald
  42.  Typically you convert data to HDF5 (with helper functions)

     Once converted opening the file and running queries is very quick  Results look like Pandas DataFrames but aren’t Vaex – a newer DataFrame system By [ian]@ianozsvald[.com] Ian Ozsvald
  43. Completion’s per day By [ian]@ianozsvald[.com] Ian Ozsvald

  44.  Dask – multi-core, -machine & -files, Vaex – multicore,

    single machine and typically single file  Dask allows most of Pandas and supports huge NumPy Arrays, a Bag, Delayed functions and lots more  Both can handle billions of rows and ML. Both have readable code Dask and Vaex? By [ian]@ianozsvald[.com] Ian Ozsvald
  45. File sizes By [ian]@ianozsvald[.com] Ian Ozsvald Format Size on Disk

    (GB) CSV (original) 4.4 Pickle 5 3.6 Parquet uncompressed (faster to load) 2.2 Parquet + Snappy (small speed penalty) 1.5 SQLite 4.6 HDF5 (Vaex) 4.5