$30 off During Our Annual Pro Sale. View Details »

Faster problem solving with Pandas (PyConBY 2021)

ianozsvald
March 12, 2021

Faster problem solving with Pandas (PyConBY 2021)

"My Pandas is slow!" - I hear that a lot. We'll look at ways of making Pandas calculate faster, help you express your problem to fit Pandas more efficiently and look at process changes that'll make you waste less time debugging your Pandas. By attending this talk you'll get answers faster and more reliably with Pandas so your analytics and data science work will be more rewarding.

Presented at PyCon Belarus 2021.

ianozsvald

March 12, 2021
Tweet

More Decks by ianozsvald

Other Decks in Science

Transcript

  1. Faster Problem Solving with
    Pandas
    @IanOzsvald – ianozsvald.com
    Ian Ozsvald
    PyCon Belarus 2021 (my first!)

    View Slide

  2. •Get more into RAM & see what’s slow
    •Vectorise for speed
    •Debug groupbys
    •Use Numba to compile rolling functions
    •Install optional dependencies that make you faster
    Today’s goal
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  3. Interim Chief Data Scientist

    19+ years experience

    Team coaching & public courses
    –I’m sharing from my Higher Performance
    Python course
    Introductions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    2nd
    Edition!

    View Slide

  4. Forgive me making a video!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  5. Remember – all benchmarks are wrong

    Your data/OS/libraries will change the results

    You need to do your own experiments
    Benchmarking caveat
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  6. In-memory only

    Operations can be RAM expensive

    Very wide & rich API

    Lots of timeseries tooling

    Mixed history with NumPy & Python datatypes
    Pandas background
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  7. A quick look at the data (25M rows)
    By [ian]@ianozsvald[.com] Ian Ozsvald
    25M rows, lots of text
    columns, circa 20GB in RAM
    in Pandas before we even
    start any work (the Pickle file
    we loaded was 4GB)
    https://www.gov.uk/government/collections/price-paid-data

    View Slide

  8. Beware “info” - it underestimates
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Text column size always under-estimated!
    ….

    View Slide


  9. Do you need all that data?

    Could you use fewer columns?

    Could you subsample or discard rows?

    Smaller data == faster operations
    Tip
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  10. The slice we’ll use (25M rows)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  11. We’ll check RAM and then shrink this
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  12. Let’s use Category & int32 dtypes
    By [ian]@ianozsvald[.com] Ian Ozsvald
    10x reduction!
    Categoricals added recently, they
    support all the Pandas types
    Regular operations work fine
    If you have repeated data – use them!
    int32 is a NumPy 32bit dtype, half the
    size of the normal int64 or float64

    View Slide

  13. Check we don’t lose data!
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Use .all() or .any() in an assert during research
    (or with an Exception for production code) to
    make sure you don’t lose data
    int64 and int32 have different capacities!

    View Slide

  14. Let’s use Category & int32 dtypes
    By [ian]@ianozsvald[.com] Ian Ozsvald
    10x speed-up!
    By switching from strings
    to encoded data, we get
    some huge speed
    improvements

    View Slide

  15. Inside the category dtype
    By [ian]@ianozsvald[.com] Ian Ozsvald
    This row has a 3
    which is an S (semi-
    detached)

    View Slide


  16. Categoricals are “see-through”, it is like having the
    original item in the column

    If you have repeated data then use them for memory and
    time savings
    Tip
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  17. So how many sales do we see?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  18. GroupBy and apply is easy but slow
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  19. Vectorised version will be faster
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  20. Vectorised-precalculation is faster
    By [ian]@ianozsvald[.com] Ian Ozsvald
    1.5x faster – but maybe
    harder to read and to
    remember in a month what
    was happening?

    View Slide

  21. Tip - Opening up the groupby
    By [ian]@ianozsvald[.com] Ian Ozsvald
    .groups is a
    dictionary of
    grouped keys
    to the
    matching rows

    View Slide

  22. Extracting an item from the groupby
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  23. dir(…) to introspect (there’s lots!)
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  24. .rolling lets us make a “rolling window” on our data

    Great for time-series e.g. rolling 1w or 10min

    User-defined functions traditionally crazy-slow

    Numba is a Just in Time compiler (JIT)

    Now we can make faster functions for little effort
    Rolling and Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  25. Numba on rolling operations
    By [ian]@ianozsvald[.com] Ian Ozsvald
    “raw arrays” are
    the underlying
    NumPy arrays –
    this only works for
    NumPy arrays, not
    Pandas Extension
    types
    8x
    speed-
    up

    View Slide

  26. rolling+Numba on our data
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  27. rolling+Numba on our data
    By [ian]@ianozsvald[.com] Ian Ozsvald
    4x speed-up on our
    custom NumPy-only
    function

    View Slide


  28. Remember that Numba doesn’t “know Pandas” yet

    Using NumPy dtypes and np. functions

    A groupby-apply with Numba is in development
    Using Numba
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  29. Some helpful dependencies aren’t installed by default

    You should install these – especially bottleneck
    Bottleneck & numexpr – install these!
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  30. Install “bottleneck” for faster math
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Bottleneck (and numexpr)
    are not installed by default
    and they offer free speed-
    ups, so install them!

    View Slide

  31. Drop to NumPy if you know you can
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Note that NumPy’s mean is not-NaN-aware but Pandas’ is, so the Pandas version is
    doing more work. This example works if you know you have no NaNs
    See my PyDataAmsterdam 2020 talk for more details (see my blog)

    View Slide

  32. Useful reading
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  33. Use Categoricals for speed-ups and to save RAM

    Try Int32 and Float32 – half the size, sometimes faster

    Install “bottleneck” and “numexpr” for free speed-ups

    Investigate new Numba options in Pandas
    Tips
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  34. Pandas has lots of new options for speed

    My newsletter has tips (and jobs)

    See blog for my classes + many past talks

    I’d love a postcard if you learned something
    new!
    Summary
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  35. Appendix
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  36. Rich library wrapping DataFrames, Arrays and arbitrary
    Python functions (very powerful, quite complex)

    Popular for scaling Pandas

    Wraps Pandas DataFrames
    Dask for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  37. Bigger than RAM or “I want to use all my cores”

    Generally you’ll use Parquet (or CSV or many choices)

    The Dashboard gives rich diagnostics

    Write Pandas-like code

    Lazy – use “.compute()”
    Dask for larger datasets
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  38. Resampling across 100 partitions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Reasonably fast and uses 1GB of RAM and 8 processes
    100 parquet files
    of 20MB each
    (probably 100MB
    would be a better
    size)

    View Slide

  39. Time series resampling across 100
    partitions
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Dask Dashboard Task-list as a
    diagnostics

    View Slide


  40. Keep it simple – many moving parts

    Few processes, lots of RAM per process

    Always call .compute()

    ddf.sample(frac=0.01) to subsample makes things faster

    Check Array, Bag and Delayed options in Dask too
    Top Dask Tips
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  41. A lesser known Pandas alternative

    Designed for billions of rows, probably in HDF5 (single
    big file)

    Memory maps to keep RAM usage low

    Efficient groupby, faster strings, NumPy focused (also
    ML, visualisation and more), lazy expressions
    Vaex – a newer DataFrame system
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  42. Typically you convert data to HDF5 (with helper
    functions)

    Once converted opening the file and running queries is
    very quick

    Results look like Pandas DataFrames but aren’t
    Vaex – a newer DataFrame system
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  43. Completion’s per day
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide


  44. Dask – multi-core, -machine & -files, Vaex – multicore,
    single machine and typically single file

    Dask allows most of Pandas and supports huge NumPy
    Arrays, a Bag, Delayed functions and lots more

    Both can handle billions of rows and ML. Both have
    readable code
    Dask and Vaex?
    By [ian]@ianozsvald[.com] Ian Ozsvald

    View Slide

  45. File sizes
    By [ian]@ianozsvald[.com] Ian Ozsvald
    Format Size on Disk (GB)
    CSV (original) 4.4
    Pickle 5 3.6
    Parquet
    uncompressed
    (faster to load)
    2.2
    Parquet + Snappy
    (small speed penalty)
    1.5
    SQLite 4.6
    HDF5 (Vaex) 4.5

    View Slide